Hazardous Freight Prediction: Siamese Neural Network and NLP¶

By Nigel Story

Contents¶

Introduction
ETL and Data Cleaning
EDA and Feature Selection
Feature Engineering
Model Definition
Model Evaluation
Conclusion

Introduction ¶

Throughout the COVID-19 pandemic we've heard a lot about the "essential worker": the doctors and nurses on the front lines, the cashiers and clerks keeping our grocery stores open, the postal workers delivering vital parcels across the U.S., and countless others who risk their health and safety every day to keep the rest of us on our feet. Freight line workers are certainly among this heroic group, with their work being invaluable in supplying the country with vital goods. It's dangerous work, regardless of the pandemic, involving operating heavy machinery and moving massive cargo freights, and with this project we seek to make the job a little safer using machine learning.

The objective of this project is to find the probability that a freight car with unknown contents contains hazardous materials, hazardous waste, or chemicals provided the rest of the information on the car's waybill. The ability to predict this would help freight line workers to anticipate potential hazards and take appropriate measures to stay safe.

We'll be using a mixed approach here, modeling-wise, using a simple multi-layer perceptron neural network to make predictions from the rail car's properties (i.e. its estimated short-line miles, number of articulated units, ownership category, and the intermodal code), and then we'll use the car's route data (origin, destination, and interchange states) to build a model analogous to sentiment analysis using NLP techniques.

About the Data¶

This project uses the Carload Sample Waybill (waybill) data. This data contains a stratified sample of all U.S. rail traffic for rail companies that terminate 4,500 or more revenue carloads annually. The waybill data is confidential for a number of reasons and can only be access by entities operating within approved industries, but the Surface Transportation Board makes a publicly available version of the data for independent use.

The publicly available waybill data, along with the reference guide documentation, can be downloaded from the Surface Transportation Board site here.

Package Imports¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

import plaidml.keras as pk
pk.install_backend()
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Conv1D, Dropout, LeakyReLU, MaxPooling1D, Embedding, Flatten, Input, Concatenate
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, confusion_matrix

ETL and Data Cleaning ¶

The data file has been formatted as a flat, 247-byte-per-line .txt file. Let's take a look at the raw data to get an idea of what we're working with.

FILE_PATH = './'

with open(FILE_PATH + 'PublicUseWaybillSample2019.txt', 'r') as f:
    lines = f.readlines()
    print(lines[:5])

['01101701190001PC214XPI 41   0000   374220000027000002700000075800000000000000000020990030070140401704000641                  0641100000000000000000000000000000  0000879                                              T00004000000108000000030320000000\n', '12141701190001PS342FCA 46X  0001PC 461110000010000001000000193100000000000000000019999900860120400004000101                  0641100000662217628806100507050504EF8841379                                              T00004000000040000000077240000040\n', '01121801190001RF483FBC 49   0000   374220000030000003000000165200000000000000000020990000360130401704001465                  1810100000225060708007091015060208E41507900                                              T00004000000120000000066080000000\n', '01261801190001RA806XP  38   0000   371490000043000004300000496100000000000000000019190921080140400004000961                  1344110000151112509210091117001610E41500000                                              T00004000000172000000198440000000\n', '02021801190015PC112LO  41   0000   144130001738000173800006446300000000000000000010100031060240126801210003                  1274103280233052604200100815060903E46605600                                              T00018000002085600000773556000000\n']

At face-value, this data is a nightmare; it just looks like maybe 6 columns of perhaps tab-delimited alpha-numeric serial references. However, we can find some help in the reference guide in the section "2019 Surface Transportation Board Public Use Waybill 247-Byte Record Layout" (p. 99). Here we find a table that will help us in decoding the data, shown below:

Screen%20Shot%202021-01-18%20at%2012.08.32%20PM.png

This is only the first few rows of the table, but this table provides us with the field names in the data, along with the character positions that hold the data for a particular field.

First, since this table is in a pdf document, we'll need to get it into some workable format for us to use in our code. I was able to copy and paste it into Excel, but the formatting was still a hassle. After formatting the data, I saved it to a .csv file, to be read in by pandas.

Let's take a look at the data:

column_lookup_path = FILE_PATH + 'column_positions_lookup.csv'

column_lookup_df = pd.read_csv(column_lookup_path, delimiter=',')
column_lookup_df.head()

Now let's make a lookup dictionary that we can use to look up a column's name and the number of character positions that column should contain.

column_lookup_dict = {k: v for k,v in column_lookup_df[['name', 'number_of_positions']].to_numpy()}

We can use this dictionary, along with the following function, to parse each line of the data and create a JSON-like list of dictionaries containing the data.

def read_waybill_line(line, lookup_dict):
    out = {}
    start = 0
    for key in lookup_dict:
        if key == 'blank':
            continue
        end = start + int(lookup_dict[key])
        out[key] = line[start: end]
        start = end
        
    return out

dataset = [read_waybill_line(line, column_lookup_dict) for line in lines]

Let's convert the data to a pandas DataFrame and preview it.

df = pd.DataFrame(dataset)
df.head()

That looks much better!

Now that we have the data in a readable format, let's take a look at the shape, data types, and memory footprint.

df.shape

(676257, 61)

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 676257 entries, 0 to 676256
Data columns (total 61 columns):
waybill_date                             676257 non-null object
accounting_period                        676257 non-null object
number_of_carloads                       676257 non-null object
car_ownership_category_code              676257 non-null object
aar_equipment_type_code                  676257 non-null object
aar_mechanical_designation               676257 non-null object
stb_car_type                             676257 non-null object
tofc_cofc_service_code                   676257 non-null object
number_of_tofc_cofc_units                676257 non-null object
trailer_container_unit_ownership_code    676257 non-null object
trailer_container_unit_type_code         676257 non-null object
hazardous_bulk_material_in_boxcar        676257 non-null object
commodity_code_stcc                      676257 non-null object
billed_weight_in_tons                    676257 non-null object
actual_weight_in_tons                    676257 non-null object
freight_revenue_dollars                  676257 non-null object
transit_charges_dollars                  676257 non-null object
miscellaneous_charges_dollars            676257 non-null object
inter_intra_state_code                   676257 non-null object
type_of_move                             676257 non-null object
all_rail_intermodal_code                 676257 non-null object
type_of_move_via_water                   676257 non-null object
transit_code                             676257 non-null object
substituted_truck_for_rail_service       676257 non-null object
rebill_code                              676257 non-null object
estimated_short_line_miles               676257 non-null object
stratum_identification                   676257 non-null object
subsample_code                           676257 non-null object
exact_expansion_factor                   676257 non-null object
theoretical_expansion_factor             676257 non-null object
number_of_interchanges                   676257 non-null object
origin_bea_area                          676257 non-null object
origin_freight_rate_territory            676257 non-null object
interchange_state_1                      676257 non-null object
interchange_state_2                      676257 non-null object
interchange_state_3                      676257 non-null object
interchange_state_4                      676257 non-null object
interchange_state_5                      676257 non-null object
interchange_state_6                      676257 non-null object
interchange_state_7                      676257 non-null object
interchange_state_8                      676257 non-null object
interchange_state_9                      676257 non-null object
termination_bea_area                     676257 non-null object
termination_freight_rate_territory       676257 non-null object
waybill_reporting_period_length          676257 non-null object
car_capacity                             676257 non-null object
nominal_car_capacity_expired             676257 non-null object
tare_weight_of_car                       676257 non-null object
outside_length                           676257 non-null object
outside_width                            676257 non-null object
outside_height                           676257 non-null object
extreme_outside_height                   676257 non-null object
type_of_wheel_bearings_and_brakes        676257 non-null object
number_of_axles                          676257 non-null object
draft_gear                               676257 non-null object
number_of_articulated_units              676257 non-null object
aar_error_codes                          676257 non-null object
routing_error_flag                       676257 non-null object
expanded_carloads                        676257 non-null object
expanded_tons                            676257 non-null object
expanded_freight_revenue                 676257 non-null object
dtypes: object(61)
memory usage: 2.4 GB

So this is a pretty big dataset, with over 670k rows, and it's eating up a ton of memory -- about 2.4 GB!

We have a couple of red flags here, though, that we should inspect: first, 670k rows is big, but not big enough to merit a 2.3 GB memory footprint; second, none of the columns contain any null values -- which we know should not be the case from our preview of the pandas data.

So we'll tackle the "no nulls" issue first. From inspecting the original .txt file, we can see that all the values that should be null have been filled with spaces, which eats up memory by both simply having the spaces there at all and by forcing some of the numerical columns to be encoded as text. However, with some regex, we can find and replace these values with np.nan.

Once that is done, we can cast the columns to their proper data types, and check their min and max values to make sure that we are using the proper number of bits for our integers.

The following function will complete these tasks and give us a summary at the end of the memory reduction.

def minimize_mem_usage(df, verbose=False):
    start_mem_usg = df.memory_usage(deep=True).sum() / 1024**2 # MB
    
    # replace space-filled values with np.nan
    df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

    df['waybill_date'] = pd.to_datetime(df['waybill_date'], format='%m%d%y')
    df['accounting_period'] = pd.to_datetime(df['accounting_period'], format='%m%y')
    
    # cast numerical columns to int
    for col in df.columns:
        if col in ['waybill_date', 'accounting_period']:
            continue
        try:
            df[col] = df[col].astype(int)
        except ValueError:
            continue
    
    # drop na's
    df.dropna(axis=1, how='all', inplace=True)
    df.dropna(axis=0, how='all', inplace=True)
    
    NAlist = []
    for col in df.columns:
        if df[col].dtype not in ['object', 'datetime64[ns]']:
            
            IsInt = False
            mx = df[col].max()
            mn = df[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(df[col]).all(): 
                NAlist.append(col)
                df[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = df[col].fillna(0).astype(np.int64)
            result = (df[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif mx < 65535:
                        df[col] = df[col].astype(np.uint16)
                    elif mx < 4294967295:
                        df[col] = df[col].astype(np.uint32)
                    else:
                        df[col] = df[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)    
            
            # Make float datatypes 32 bit
            else:
                df[col] = df[col].astype(np.float32)
    
    # Print final result
    end_mem_usg = df.memory_usage().sum() / 1024**2
    mem_reduction = 100*end_mem_usg/start_mem_usg
    
    if verbose:
        print("___MEMORY REDUCTION STATS:___")
        print("Starting memory usage was {:.2f}MB".format(start_mem_usg))
        print("Current memory usage is {:.2f}MB".format(end_mem_usg))
        print("This is {:.2f}% of the initial size".format(mem_reduction))
    
    return df

df = minimize_mem_usage(df, verbose=True)

___MEMORY REDUCTION STATS:___
Starting memory usage was 2462.56MB
Current memory usage is 127.05MB
This is 5.16% of the initial size

We have great performance coming out of our memory optimization, with a reduction of about 95% of the original memory!

Let's have a quick look at our datatypes to make sure everything went well and makes sense.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 676257 entries, 0 to 676256
Data columns (total 53 columns):
waybill_date                             676257 non-null datetime64[ns]
accounting_period                        676257 non-null datetime64[ns]
number_of_carloads                       676257 non-null uint8
car_ownership_category_code              676257 non-null object
aar_equipment_type_code                  676257 non-null object
aar_mechanical_designation               676257 non-null object
stb_car_type                             676257 non-null uint8
tofc_cofc_service_code                   410310 non-null object
number_of_tofc_cofc_units                676257 non-null uint8
trailer_container_unit_ownership_code    410367 non-null object
trailer_container_unit_type_code         410367 non-null object
hazardous_bulk_material_in_boxcar        61465 non-null object
commodity_code_stcc                      676257 non-null uint16
billed_weight_in_tons                    676257 non-null uint32
actual_weight_in_tons                    676257 non-null uint32
freight_revenue_dollars                  676257 non-null uint32
transit_charges_dollars                  676257 non-null uint16
miscellaneous_charges_dollars            676257 non-null uint32
inter_intra_state_code                   676257 non-null uint8
type_of_move                             676257 non-null uint8
all_rail_intermodal_code                 676257 non-null uint8
type_of_move_via_water                   676257 non-null uint8
transit_code                             676257 non-null uint8
substituted_truck_for_rail_service       676257 non-null uint8
rebill_code                              676257 non-null uint8
estimated_short_line_miles               676257 non-null uint16
stratum_identification                   676257 non-null uint8
subsample_code                           676257 non-null uint8
exact_expansion_factor                   676257 non-null uint16
theoretical_expansion_factor             676257 non-null uint8
number_of_interchanges                   676257 non-null uint8
origin_bea_area                          676257 non-null uint8
origin_freight_rate_territory            676257 non-null uint8
interchange_state_1                      43118 non-null object
interchange_state_2                      5371 non-null object
interchange_state_3                      703 non-null object
interchange_state_4                      18 non-null object
interchange_state_5                      4 non-null object
termination_bea_area                     676257 non-null uint8
termination_freight_rate_territory       676257 non-null uint8
waybill_reporting_period_length          676257 non-null uint8
car_capacity                             676257 non-null uint16
nominal_car_capacity_expired             676257 non-null uint16
tare_weight_of_car                       676257 non-null uint16
outside_length                           676257 non-null uint16
outside_width                            676257 non-null uint16
outside_height                           676257 non-null uint16
extreme_outside_height                   676257 non-null uint16
type_of_wheel_bearings_and_brakes        667869 non-null object
number_of_axles                          676205 non-null object
draft_gear                               676257 non-null uint8
number_of_articulated_units              676257 non-null uint8
aar_error_codes                          676257 non-null uint16
dtypes: datetime64[ns](2), object(14), uint16(12), uint32(4), uint8(21)
memory usage: 127.1+ MB

We can see above that we now have plenty of null values (which is a good sign), and our integer data types have been optimized to the lowest bit size of unsigned integer types, giving us a very reasonable 127MB of data.

We'll have one final look at the data, and then save it to a .csv for later use in analysis.

df.head()

df.to_csv('./prelim_waybill_data.csv', index=False)

Combining with other data sources¶

We have our waybill data clean now, but we will need a couple of other datasets to get a complete picture. In particular, we will need to decode the Surface Transportation Commodity Codes (STCC) and Business Economic Area (BEA) codes, so that we can know what type of freight is being moved and its origin and destination points.

The STCC code headers and BEA codes can be found in similar tables to the data position tables in the reference guide. I've copied the tables into csv's.

We'll read all the data back into Spark DataFrames, so that we can take advantage of Spark SQL to combine the data.

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession.builder.getOrCreate()

waybill = spark.read.csv('./prelim_waybill_data.csv', header=True, inferSchema=True)
stcc = spark.read.csv('./stcc_data.csv', header=True, inferSchema=True)
bea = spark.read.csv('./bea_codes.csv', header=True, inferSchema=True)

waybill.createOrReplaceTempView('waybill')
stcc.createOrReplaceTempView('stcc')
bea.createOrReplaceTempView('bea')

Read in hazardous data¶

haz_query = """
SELECT
    w.*,
    s.short_description as stcc_description,
    o.location as origin_location,
    t.location as terminal_location,
    s.is_hazardous
FROM waybill w
JOIN stcc s ON s.stcc_code = LEFT(w.commodity_code_stcc,2) /* stcc_code contains only the first 2 digits of the stcc */
JOIN bea o ON o.bea_code = w.origin_bea_area
JOIN bea t ON t.bea_code = w.termination_bea_area
WHERE 1=1
    AND s.is_hazardous = 1
"""

hazardous = spark.sql(haz_query)
hazardous.count()

34243

hdf = hazardous.toPandas()
hdf.head()

Read in non-hazardous data¶

nonhaz_query = """
SELECT
    w.*,
    s.short_description as stcc_description,
    o.location as origin_location,
    t.location as terminal_location,
    s.is_hazardous
FROM waybill w
JOIN stcc s ON s.stcc_code = LEFT(w.commodity_code_stcc,2)
JOIN bea o ON o.bea_code = w.origin_bea_area
JOIN bea t ON t.bea_code = w.termination_bea_area
WHERE 1=1
    AND s.is_hazardous = 0
"""

There are far more non-hazardous shipments than hazardous ones, so we'll take a random sample from the non-hazardous data equal in size to the hazardous dataset.

nonhaz = spark.sql(nonhaz_query).rdd.takeSample(False, len(hdf), 42)
nonhaz = sc.parallelize(nonhaz)
schema = hazardous.schema
nhdf = nonhaz.toDF(schema=schema).toPandas()

nhdf.head()

As our objective is to predict whether or not a shipment is hazardous, we won't need all of the columns. I've selected a handful of columns that would appear, based on their data descriptions in the reference guide, to have a relationship with whether or not the freight contains hazardous materials.

df = pd.concat([hdf, nhdf], axis=0)

relevant_cols = [
    'is_hazardous',
    'car_ownership_category_code',
    'miscellaneous_charges_dollars',
    'all_rail_intermodal_code',
    'estimated_short_line_miles',
    'number_of_interchanges',
    'number_of_articulated_units',
    'origin_location', 
    'interchange_state_1', 
    'interchange_state_2',                
    'interchange_state_3', 
    'interchange_state_4', 
    'interchange_state_5', 
    'terminal_location'
]

df = df[relevant_cols]
df.head()

len(df)

68486

This is now our prepared dataset, and I'll write it to a csv to be read in for EDA.

df.to_csv('./prepared_waybill_data.csv', index=False)

EDA and Feature Selection ¶

Let's start by taking a look at the correlation matrix, to see if any of our numerical columns correlate with the is_hazardous column.

df.corr()

There are slight negative correlations with the number of articulated units and the estimated short-line miles. The all_rail_intermodal_code should not actually be a numerical value, as the codes (1, 2, and 9) simply refer to whether or not the freight has been moved using multiple travel media (i.e. it traveled by ship for a time, then moved to the railway). We'll examine these values categorically in a future step.

As we discover columns that have weak predictive power, we will append them to a list of columns to drop, then we will drop them from our data at the end of our analysis.

drop_cols = []

Next, we'll take a look at the distribution of each data point. While some columns don't correlate with hazard, if their distributions differ widely between hazardous and non-hazardous freight, then we might still be able to get some segmenting power from them.

grouped = df.groupby('is_hazardous')

grouped.mean()

grouped.std()

At a glance, we can see a few things: miscellaneous charges tend to be higher and more variant for non-hazardous shipments; estimated short-line miles are far less for hazardous freight (meaning hazardous freight travels a smaller distance, on average); and non-hazardous freight tend to travel in articulated units more frequently than hazardous freight.

Let's visualize these with kernel density plots.

for c in df.columns[df.dtypes!='object']:
    if c == 'is_hazardous':
        continue
    
    sns.kdeplot(df.loc[df['is_hazardous']==0, c], label='Non-Hazardous')
    sns.kdeplot(df.loc[df['is_hazardous']==1, c], label='Hazardous')
    if c == 'miscellaneous_charges_dollars':
        plt.xlim(0, 200)
    plt.title(c)
    plt.legend()
    plt.show()
    
    if c in ['all_rail_intermodal_code', 'number_of_interchanges', 'number_of_articulated_units']:
        print('__HAZARDOUS__')
        print(df.loc[df['is_hazardous']==0, c].value_counts())
        print('__NON-HAZARDOUS__')
        print(df.loc[df['is_hazardous']==1, c].value_counts())
    
    print('='*20)
    print('\n')

====================

__HAZARDOUS__
9    20960
1    13106
2      177
Name: all_rail_intermodal_code, dtype: int64
__NON-HAZARDOUS__
1    19223
9    14988
2       32
Name: all_rail_intermodal_code, dtype: int64
====================

====================

__HAZARDOUS__
0    29541
1     4099
2      530
3       70
4        3
Name: number_of_interchanges, dtype: int64
__NON-HAZARDOUS__
0    29407
1     4173
2      549
3      110
4        3
5        1
Name: number_of_interchanges, dtype: int64
====================

__HAZARDOUS__
4    17436
0    11660
5     2647
3     2471
2       29
Name: number_of_articulated_units, dtype: int64
__NON-HAZARDOUS__
0    30785
4     3065
3      234
5      150
2        9
Name: number_of_articulated_units, dtype: int64
====================

It seems that even though miscellaneous_charges_dollars tend to be higher, on average, for non-hazardous freight, we don't have many non-zero charges, so the data for this column will not be particularly useful. We'll add it to the drop list.

We can see some separability in the all rail/intermodal code. It seems that non-hazardous freight tend to have a higher likelihood of being unidentified (code 9), so we'll keep this column for now.

As we gathered from the correlation matrix, we can see that the estimated short-line miles are shorter for hazardous freight. This could be good for our future model.

The number of interchanges seems to be the same for both types of freight, so we will drop this column.

Hazardous shipments are less likely to travel in articulated units, as we saw in the correlation matrix, so we will keep this column.

drop_cols.append('miscellaneous_charges_dollars')

drop_cols.append('number_of_interchanges')

Let's take a look at our categorical columns now.

g = sns.FacetGrid(df, col='is_hazardous', height=8)
g.map(sns.countplot, 'all_rail_intermodal_code')

/Users/nastory/opt/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:643: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

<seaborn.axisgrid.FacetGrid at 0x7fcdf30b4710>

We see some evidence of separability in our all-rail/intermodal codes, so we will officially retain this column.

g = sns.FacetGrid(df, col='is_hazardous', height=8)
g.map(sns.countplot, 'car_ownership_category_code')

/Users/nastory/opt/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:643: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

<seaborn.axisgrid.FacetGrid at 0x7fce842dd250>

We see here that hazardous freight is more likely to be privately owned, so we will retain this column as well.

The following plots are for the origin locations, terminal locations, and interchange states along the route. These appear to have a good amount of predictive power, as it seems that hazardous and non-hazardous freight tend to originate in separate areas, arrive in separate areas, and follow differing routes that hazardous freight.

def plot_bars(df, colname):
    haz_cnt = df.loc[df['is_hazardous']==1, colname].value_counts().reset_index(drop=False)
    haz_cnt['is_hazardous'] = 'yes'
    nhaz_cnt = df.loc[df['is_hazardous']==0, colname].value_counts().reset_index(drop=False)
    nhaz_cnt['is_hazardous'] = 'no'

    cnts = pd.concat([haz_cnt, nhaz_cnt], axis=0)
    cnts.head()

    g = sns.catplot(
        data=cnts, kind="bar",
        x="index", y=colname, hue="is_hazardous",
        ci="sd", palette="dark", alpha=.6, height=8, aspect=3
    )
    
    x_max = min(len(cnts), 50)
    
    plt.xlim((-1,x_max))
    plt.xticks(rotation=90)
    plt.show()

Below, one can see that the distributions of terminal locations and origin locations differ widely between hazardous and non-hazardous freight. This will be a powerful predictor.

plot_bars(df, 'origin_location')
plot_bars(df, 'terminal_location')

The distributions for interchange states, while not as drastically different as those for origin and terminal location, will still be very informative for our model as well.

plot_bars(df, 'interchange_state_1')
plot_bars(df, 'interchange_state_2')
plot_bars(df, 'interchange_state_3')

Very few shipments have more than 3 interchanges, so we will drop interchange states 4 and 5.

drop_cols.append('interchange_state_4')
drop_cols.append('interchange_state_5')

Let's execute our column drops and save the final data to a csv.

df.drop(drop_cols, axis=1, inplace=True)

df.to_csv('./waybill_relevant_data.csv', index=False)

Feature Engineering ¶

During our EDA, we were able to do the feature selection, so now all that's left is to prepare the data with encoding and scaling. We'll also be developing the "route sentences," which will be the text representation of our train routes.

For sequence data¶

We're going to try using some NLP techniques on the train route, since right now the routes are stored as sequences of string data. So let's engineer the features in the same ways we would engineer sentences, for example.

In this way our predictions become comparable to a simple case of sentiment analysis.

seq_cols = [
    'origin_location',
    'interchange_state_1',
    'interchange_state_2',
    'interchange_state_3',
    'terminal_location'
]


def to_sentence(row):
    ls = []
    for c in seq_cols:
        if (row[c] is None) or (row[c] is np.nan):
            continue
            
        ls.append(row[c].replace(', ', '').replace(' ', '').replace('-', ''))
        
    return ' '.join(ls)

    
routes = df.apply(to_sentence, axis=1).to_list()

routes[:10]

['ChicagoGaryKenoshaILINWI LosAngelesRiversideOrangeCountyCAAZ',
 'ChicagoGaryKenoshaILINWI PhiladelphiaWilmingtonAtlanticCityPANJDEMD',
 'NewOrleansLAMS AL BirminghamAL',
 'BatonRougeLAMS IL AB Alberta',
 'ChicagoGaryKenoshaILINWI ShreveportBossierCityLAAR',
 'CasperWYIDUT TX BeaumontPortArthurTX',
 'NewOrleansLAMS BeaumontPortArthurTX',
 'LosAngelesRiversideOrangeCountyCAAZ LosAngelesRiversideOrangeCountyCAAZ',
 'NewOrleansLAMS BeaumontPortArthurTX',
 'HoustonGalvestonBrazoriaTX TX TX Mexico']

Now that we have our "route sentences" we need to know the vocabulary size.

long_str = ' '.join(routes)

long_ls = long_str.split()
vocab_set = set()
for l in long_ls:
    vocab_set.add(l)
    
print(len(vocab_set))

212

We have 212 distinct words in our vocab, so we'll choose a vocab size of 300, allowing for some unseen values. The maximum route length is five stops, so our max sentence length will be 5 words.

vocab_size = 300
max_length = 5

Now we apply our encoding and padding, and then dump the numpy arrays of our prepared data and the encoder to a pickle file.

encoded_routes = [one_hot(r, vocab_size) for r in routes]
encoded_routes[:5]

[[187, 1], [187, 94], [224, 278, 1], [5, 36, 127, 106], [187, 247]]

with open('./one_hot_encoder.pickle', 'wb') as f:
    pickle.dump(one_hot, f)

padded_routes = pad_sequences(encoded_routes, maxlen=max_length, padding='post')
padded_routes[:5]

array([[187,   1,   0,   0,   0],
       [187,  94,   0,   0,   0],
       [224, 278,   1,   0,   0],
       [  5,  36, 127, 106,   0],
       [187, 247,   0,   0,   0]], dtype=int32)

X_seq = padded_routes
y = df['is_hazardous'].to_numpy()

seq_data = (X_seq, y)

with open('sequence_data.pickle', 'wb') as f:
    pickle.dump(seq_data, f)

For numerical and category data¶

For our traditional numerical and category data, we'll be applying one-hot encoding and min-max scaling, respectively.

num_cols = [
    'estimated_short_line_miles',
    'number_of_articulated_units'
]

cat_cols = [
    'car_ownership_category_code', 
    'all_rail_intermodal_code'
]

nums = df[num_cols]
cats = df[cat_cols].astype(str)

df['all_rail_intermodal_code'] = df['all_rail_intermodal_code'].astype(str)
encoded_cats = pd.get_dummies(cats)
encoded_cats.head()

X_cats = encoded_cats.to_numpy()

scaler = MinMaxScaler()
X_nums = scaler.fit_transform(nums)

X_nums[:10]

array([[0.38061041, 0.        ],
       [0.1454219 , 0.        ],
       [0.06283662, 0.        ],
       [0.44344704, 0.8       ],
       [0.15439856, 0.        ],
       [0.27648115, 0.        ],
       [0.04847397, 0.        ],
       [0.01256732, 0.        ],
       [0.04847397, 0.        ],
       [0.11849192, 0.        ]])

Now that the data is scaled and encoded, we will pickle the scaler and the numerical dataset.

with open('./min_max_scaler.pickle', 'wb') as f:
    pickle.dump(scaler, f)

X = np.concatenate([X_nums, X_cats], axis=1)
X.shape

(68486, 8)

X_seq.shape

(68486, 5)

num_data = (X, y)

with open('./numerical_data.pickle', 'wb') as f:
    pickle.dump(num_data, f)

Now we are ready to send our data to model training!

Model Definition ¶

We're going to be training a Siamese Neural Network to make our predictions. This is not a very common approach, as it's typical that all the data going into a given neural network would be of the same type, but here we will need to feed the network numerical data as well as sequences of words, which require differing input layers and differing initial layer structures.

The structure of our Siamese Neural Network will be the following: the numerical data will be fed into a shallow multilayer perceptron model whose output will be weights from fully-connected layer with 64 nodes; the train route sequence data will be embedded into a vector field and then fed into a 1D Convolutional Neural Network, like one might use for sentiment analysis, outputting another fully-connected layer with 64 nodes. The weights from the two models' outputs will then be concatenated and fed into a third neural network -- a deep learning MLP -- which will give us our final predictions.

Reading in training and validation data¶

with open('./sequence_data.pickle', 'rb') as f:
    sequence_data = pickle.load(f)
    
with open('./numerical_data.pickle', 'rb') as f:
    numeric_data = pickle.load(f)

X_seq, y = sequence_data

X_num, y = numeric_data

print(X_seq.shape)
print(X_num.shape)
print(y.shape)

(68486, 5)
(68486, 8)
(68486,)

X = np.concatenate([X_num, X_seq], axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.33)

xs_train, xs_test, ys_train, ys_test = train_test_split(X_seq, y, test_size=.33)
xn_train, xn_test, yn_train, yn_test = train_test_split(X_num, y, test_size=.33)

Siamese Neural Network¶

vocab_size = 300
max_length = 5
batch_size = 32

def build_model():

    # cnn with only 1 dense layer
    seq_input = Input(shape=(max_length,))

    x = Embedding(vocab_size, 3, input_length=max_length)(seq_input)

    x = Conv1D(256, kernel_size=3, strides=1)(x)
    x = LeakyReLU()(x)
    x = MaxPooling1D(pool_size=2)(x)

    x = Flatten()(x)

    seq_output = Dense(64, activation='relu')(x)

    cnn = Model(inputs=seq_input, outputs=seq_output)

    # mlp with only one dense layer
    num_input = Input(shape=(8,))

    mlp_output = Dense(64, activation='relu')(num_input)

    mlp = Model(inputs=num_input, outputs=mlp_output)

    # combine
    combined = Concatenate()([cnn.output, mlp.output])
    
    z = Dense(512)(combined)
    z = LeakyReLU()(z)
    z = Dropout(.5)(z)
    
    z = Dense(512)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    z = Dense(256)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    z = Dense(64)(z)
    z = LeakyReLU()(z)
    z = Dropout(.2)(z)
    
    output = Dense(1, activation='sigmoid')(z)

    final_model = Model(inputs=mlp.inputs + cnn.inputs, outputs=[output])

    final_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    
    return final_model

model = build_model()

INFO:plaidml:Opening device "metal_amd_radeon_pro_570x.0"

num_train = x_train[:, :8]
seq_train = x_train[:, 8:]

num_test = x_test[:, :8]
seq_test = x_test[:, 8:]

xc_train = [num_train, seq_train] 
xc_test = [num_test, seq_test]

learning_rate_reduction_combined = ReduceLROnPlateau(monitor='val_acc', patience=3, 
                                            verbose=2, factor=0.5, min_lr=0.00001)

best_model_combined = ModelCheckpoint('./combined_cnn_mlp_model.3.h5', monitor='val_acc', verbose=2, 
                             save_best_only=True, mode='max')

early_stopping_combined = EarlyStopping(monitor='val_loss', min_delta=1e-10, 
                               patience=10, restore_best_weights=True)

hist = model.fit(xc_train, y_train,
         batch_size=batch_size,
         epochs=50,
         validation_data=(xc_test, y_test),
         callbacks = [learning_rate_reduction_combined, best_model_combined, early_stopping_combined],
         verbose=1
)

Train on 45885 samples, validate on 22601 samples
Epoch 1/50
45885/45885 [==============================] - 16s 342us/step - loss: 0.3317 - acc: 0.8795 - val_loss: 0.2978 - val_acc: 0.8911

Epoch 00001: val_acc improved from -inf to 0.89107, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 2/50
45885/45885 [==============================] - 11s 247us/step - loss: 0.2974 - acc: 0.8936 - val_loss: 0.2864 - val_acc: 0.8936

Epoch 00002: val_acc improved from 0.89107 to 0.89363, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 3/50
45885/45885 [==============================] - 11s 234us/step - loss: 0.2903 - acc: 0.8972 - val_loss: 0.2913 - val_acc: 0.8863

Epoch 00003: val_acc did not improve from 0.89363
Epoch 4/50
45885/45885 [==============================] - 11s 232us/step - loss: 0.2814 - acc: 0.9008 - val_loss: 0.2839 - val_acc: 0.8939

Epoch 00004: val_acc improved from 0.89363 to 0.89394, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 5/50
45885/45885 [==============================] - 11s 230us/step - loss: 0.2763 - acc: 0.9009 - val_loss: 0.2687 - val_acc: 0.9004

Epoch 00005: val_acc improved from 0.89394 to 0.90040, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 6/50
45885/45885 [==============================] - 10s 226us/step - loss: 0.2724 - acc: 0.9025 - val_loss: 0.2719 - val_acc: 0.9010

Epoch 00006: val_acc improved from 0.90040 to 0.90098, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 7/50
45885/45885 [==============================] - 10s 228us/step - loss: 0.2698 - acc: 0.9044 - val_loss: 0.2676 - val_acc: 0.9036

Epoch 00007: val_acc improved from 0.90098 to 0.90363, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 8/50
45885/45885 [==============================] - 10s 227us/step - loss: 0.2652 - acc: 0.9066 - val_loss: 0.2675 - val_acc: 0.9021

Epoch 00008: val_acc did not improve from 0.90363
Epoch 9/50
45885/45885 [==============================] - 10s 224us/step - loss: 0.2632 - acc: 0.9070 - val_loss: 0.2576 - val_acc: 0.9055

Epoch 00009: val_acc improved from 0.90363 to 0.90554, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 10/50
45885/45885 [==============================] - 10s 223us/step - loss: 0.2609 - acc: 0.9081 - val_loss: 0.2680 - val_acc: 0.9054

Epoch 00010: val_acc did not improve from 0.90554
Epoch 11/50
45885/45885 [==============================] - 10s 224us/step - loss: 0.2580 - acc: 0.9093 - val_loss: 0.2582 - val_acc: 0.9043

Epoch 00011: val_acc did not improve from 0.90554
Epoch 12/50
45885/45885 [==============================] - 10s 223us/step - loss: 0.2556 - acc: 0.9098 - val_loss: 0.2547 - val_acc: 0.9063

Epoch 00012: val_acc improved from 0.90554 to 0.90633, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 13/50
45885/45885 [==============================] - 10s 224us/step - loss: 0.2547 - acc: 0.9106 - val_loss: 0.2592 - val_acc: 0.9099

Epoch 00013: val_acc improved from 0.90633 to 0.90987, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 14/50
45885/45885 [==============================] - 10s 224us/step - loss: 0.2516 - acc: 0.9108 - val_loss: 0.2549 - val_acc: 0.9057

Epoch 00014: val_acc did not improve from 0.90987
Epoch 15/50
45885/45885 [==============================] - 10s 223us/step - loss: 0.2504 - acc: 0.9118 - val_loss: 0.2537 - val_acc: 0.9085

Epoch 00015: val_acc did not improve from 0.90987
Epoch 16/50
45885/45885 [==============================] - 10s 224us/step - loss: 0.2487 - acc: 0.9127 - val_loss: 0.2580 - val_acc: 0.9099

Epoch 00016: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00016: val_acc did not improve from 0.90987
Epoch 17/50
45885/45885 [==============================] - 10s 226us/step - loss: 0.2380 - acc: 0.9161 - val_loss: 0.2442 - val_acc: 0.9128

Epoch 00017: val_acc improved from 0.90987 to 0.91279, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 18/50
45885/45885 [==============================] - 10s 225us/step - loss: 0.2352 - acc: 0.9169 - val_loss: 0.2448 - val_acc: 0.9123

Epoch 00018: val_acc did not improve from 0.91279
Epoch 19/50
45885/45885 [==============================] - 10s 226us/step - loss: 0.2346 - acc: 0.9173 - val_loss: 0.2488 - val_acc: 0.9137

Epoch 00019: val_acc improved from 0.91279 to 0.91372, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 20/50
45885/45885 [==============================] - 10s 227us/step - loss: 0.2329 - acc: 0.9183 - val_loss: 0.2510 - val_acc: 0.9120

Epoch 00020: val_acc did not improve from 0.91372
Epoch 21/50
45885/45885 [==============================] - 11s 249us/step - loss: 0.2312 - acc: 0.9184 - val_loss: 0.2512 - val_acc: 0.9097

Epoch 00021: val_acc did not improve from 0.91372
Epoch 22/50
45885/45885 [==============================] - 14s 301us/step - loss: 0.2311 - acc: 0.9191 - val_loss: 0.2437 - val_acc: 0.9124

Epoch 00022: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.

Epoch 00022: val_acc did not improve from 0.91372
Epoch 23/50
45885/45885 [==============================] - 11s 246us/step - loss: 0.2246 - acc: 0.9204 - val_loss: 0.2426 - val_acc: 0.9142

Epoch 00023: val_acc improved from 0.91372 to 0.91421, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 24/50
45885/45885 [==============================] - 13s 288us/step - loss: 0.2227 - acc: 0.9213 - val_loss: 0.2407 - val_acc: 0.9149

Epoch 00024: val_acc improved from 0.91421 to 0.91492, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 25/50
45885/45885 [==============================] - 11s 244us/step - loss: 0.2214 - acc: 0.9219 - val_loss: 0.2404 - val_acc: 0.9117

Epoch 00025: val_acc did not improve from 0.91492
Epoch 26/50
45885/45885 [==============================] - 11s 247us/step - loss: 0.2203 - acc: 0.9216 - val_loss: 0.2432 - val_acc: 0.9150

Epoch 00026: val_acc improved from 0.91492 to 0.91496, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 27/50
45885/45885 [==============================] - 12s 270us/step - loss: 0.2194 - acc: 0.9226 - val_loss: 0.2435 - val_acc: 0.9154

Epoch 00027: val_acc improved from 0.91496 to 0.91545, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 28/50
45885/45885 [==============================] - 12s 272us/step - loss: 0.2200 - acc: 0.9222 - val_loss: 0.2439 - val_acc: 0.9158

Epoch 00028: val_acc improved from 0.91545 to 0.91580, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 29/50
45885/45885 [==============================] - 13s 281us/step - loss: 0.2194 - acc: 0.9224 - val_loss: 0.2431 - val_acc: 0.9142

Epoch 00029: val_acc did not improve from 0.91580
Epoch 30/50
45885/45885 [==============================] - 13s 284us/step - loss: 0.2178 - acc: 0.9230 - val_loss: 0.2428 - val_acc: 0.9155

Epoch 00030: val_acc did not improve from 0.91580
Epoch 31/50
45885/45885 [==============================] - 13s 277us/step - loss: 0.2183 - acc: 0.9234 - val_loss: 0.2429 - val_acc: 0.9140

Epoch 00031: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.

Epoch 00031: val_acc did not improve from 0.91580
Epoch 32/50
45885/45885 [==============================] - 11s 250us/step - loss: 0.2135 - acc: 0.9240 - val_loss: 0.2409 - val_acc: 0.9157

Epoch 00032: val_acc did not improve from 0.91580
Epoch 33/50
45885/45885 [==============================] - 11s 236us/step - loss: 0.2126 - acc: 0.9246 - val_loss: 0.2469 - val_acc: 0.9156

Epoch 00033: val_acc did not improve from 0.91580
Epoch 34/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2123 - acc: 0.9245 - val_loss: 0.2399 - val_acc: 0.9161

Epoch 00034: val_acc improved from 0.91580 to 0.91611, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 35/50
45885/45885 [==============================] - 11s 236us/step - loss: 0.2124 - acc: 0.9247 - val_loss: 0.2441 - val_acc: 0.9156

Epoch 00035: val_acc did not improve from 0.91611
Epoch 36/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2115 - acc: 0.9252 - val_loss: 0.2417 - val_acc: 0.9163

Epoch 00036: val_acc improved from 0.91611 to 0.91629, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 37/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2113 - acc: 0.9249 - val_loss: 0.2422 - val_acc: 0.9158

Epoch 00037: val_acc did not improve from 0.91629
Epoch 38/50
45885/45885 [==============================] - 11s 236us/step - loss: 0.2108 - acc: 0.9253 - val_loss: 0.2427 - val_acc: 0.9167

Epoch 00038: val_acc improved from 0.91629 to 0.91669, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 39/50
45885/45885 [==============================] - 11s 232us/step - loss: 0.2111 - acc: 0.9252 - val_loss: 0.2437 - val_acc: 0.9161

Epoch 00039: val_acc did not improve from 0.91669
Epoch 40/50
45885/45885 [==============================] - 11s 234us/step - loss: 0.2111 - acc: 0.9252 - val_loss: 0.2409 - val_acc: 0.9171

Epoch 00040: val_acc improved from 0.91669 to 0.91708, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 41/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2097 - acc: 0.9256 - val_loss: 0.2396 - val_acc: 0.9160

Epoch 00041: val_acc did not improve from 0.91708
Epoch 42/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2106 - acc: 0.9252 - val_loss: 0.2433 - val_acc: 0.9158

Epoch 00042: val_acc did not improve from 0.91708
Epoch 43/50
45885/45885 [==============================] - 11s 233us/step - loss: 0.2097 - acc: 0.9257 - val_loss: 0.2403 - val_acc: 0.9173

Epoch 00043: val_acc improved from 0.91708 to 0.91726, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 44/50
45885/45885 [==============================] - 11s 234us/step - loss: 0.2085 - acc: 0.9254 - val_loss: 0.2420 - val_acc: 0.9168

Epoch 00044: val_acc did not improve from 0.91726
Epoch 45/50
45885/45885 [==============================] - 11s 234us/step - loss: 0.2089 - acc: 0.9255 - val_loss: 0.2436 - val_acc: 0.9161

Epoch 00045: val_acc did not improve from 0.91726
Epoch 46/50
45885/45885 [==============================] - 11s 234us/step - loss: 0.2091 - acc: 0.9260 - val_loss: 0.2422 - val_acc: 0.9170

Epoch 00046: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.

Epoch 00046: val_acc did not improve from 0.91726
Epoch 47/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2066 - acc: 0.9269 - val_loss: 0.2433 - val_acc: 0.9175

Epoch 00047: val_acc improved from 0.91726 to 0.91748, saving model to ./combined_cnn_mlp_model.3.h5
Epoch 48/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2059 - acc: 0.9269 - val_loss: 0.2422 - val_acc: 0.9174

Epoch 00048: val_acc did not improve from 0.91748
Epoch 49/50
45885/45885 [==============================] - 11s 235us/step - loss: 0.2051 - acc: 0.9270 - val_loss: 0.2435 - val_acc: 0.9170

Epoch 00049: val_acc did not improve from 0.91748
Epoch 50/50
45885/45885 [==============================] - 11s 236us/step - loss: 0.2054 - acc: 0.9269 - val_loss: 0.2430 - val_acc: 0.9178

Epoch 00050: val_acc improved from 0.91748 to 0.91779, saving model to ./combined_cnn_mlp_model.3.h5

This model appears to produce very good results, with an accuracy score of 91.8%.

In the following section, we'll evaluate its performance a bit deeper.

Model Evaluation ¶

with open('./sequence_data.pickle', 'rb') as f:
    sequence_data = pickle.load(f)
    
with open('./numerical_data.pickle', 'rb') as f:
    numeric_data = pickle.load(f)
    
X_seq, y = sequence_data
X_num, y = numeric_data

X = np.concatenate([X_num, X_seq], axis=1)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.33)

num_train = x_train[:, :8]
seq_train = x_train  

num_test = x_test[:, :8]
seq_test = x_test  

xc_train = [num_train, seq_train] 
xc_test = [num_test, seq_test]

def false_neg(preds, y_test):
    fn = 0
    for i in range(len(preds)): 
        if preds[i]==0 and y_test[i]!=preds[i]:
            fn += 1

    return fn / len(y_test)

def evaluate(preds, y_test):
    print(f'Accuracy Score: {round(accuracy_score(preds, y_test)*100, 2)}%')
    print(f'AUC ROC Score: {round(roc_auc_score(preds, y_test)*100, 2)}%')
    print('\n__CLASSIFICATION REPORT__:')
    print(classification_report(preds, y_test))
    print('\n__CONFUSION MATRIX__:')
    print(confusion_matrix(preds, y_test)/len(y_test))
    print('\n__FALSE NEGATIVE RATE__:')
    print(f'{round(false_neg(preds, y_test)*100,2)}%')

cm = load_model('./combined_cnn_mlp_model.3.h5')

Deep Learning Evaluation¶

input_data = [x_test[:, :8], x_test[:, 8:]]
cm_preds = cm.predict(input_data)
evaluate(np.rint(cm_preds), y_test)

Accuracy Score: 92.59%
AUC ROC Score: 92.6%

__CLASSIFICATION REPORT__:
              precision    recall  f1-score   support

         0.0       0.93      0.92      0.93     11427
         1.0       0.92      0.93      0.93     11174

    accuracy                           0.93     22601
   macro avg       0.93      0.93      0.93     22601
weighted avg       0.93      0.93      0.93     22601


__CONFUSION MATRIX__:
[[0.46489093 0.04070616]
 [0.0334056  0.4609973 ]]

__FALSE NEGATIVE RATE__:
4.07%

For this evaluation, model performed at 92.5% accuracy, with precision and recall falling at similar values between 92% and 93%. Crucially, the false negative rate is very low here, at only ~4% -- which means our model will be able to keep railway workers safe about 96% of the time when a car with unknown contents arrives.

Conclusion ¶

Overall, I was very please with this model's performance -- with a validated accuracy score of 91.8% and a false negative rate of only 4%. Treating the rail car's routes as sentences and applying NLP methods, like vector embedding and sentiment analysis, worked better than I had expected, and has made me consider other applications where treating sequences in the same way as language might be beneficial.

	is_hazardous	miscellaneous_charges_dollars	all_rail_intermodal_code	estimated_short_line_miles	number_of_interchanges	number_of_articulated_units
is_hazardous	1.000000	-0.008042	-0.175271	-0.354814	0.008080	-0.570608
miscellaneous_charges_dollars	-0.008042	1.000000	0.010225	0.009447	-0.015822	-0.004391
all_rail_intermodal_code	-0.175271	0.010225	1.000000	0.147743	0.119982	0.141289
estimated_short_line_miles	-0.354814	0.009447	0.147743	1.000000	0.208191	0.504362
number_of_interchanges	0.008080	-0.015822	0.119982	0.208191	1.000000	-0.089477
number_of_articulated_units	-0.570608	-0.004391	0.141289	0.504362	-0.089477	1.000000

	miscellaneous_charges_dollars	all_rail_intermodal_code	estimated_short_line_miles	number_of_interchanges	number_of_articulated_units
is_hazardous
0	11.42581	5.901936	1314.652630	0.157142	2.641416
1	7.93879	4.502497	761.128114	0.164063	0.400958

	miscellaneous_charges_dollars	all_rail_intermodal_code	estimated_short_line_miles	number_of_interchanges	number_of_articulated_units
is_hazardous
0	226.332045	3.892399	819.678226	0.420387	1.937867
1	206.821272	3.968175	625.963166	0.436034	1.201439

	car_ownership_category_code_P	all_rail_intermodal_code_1	all_rail_intermodal_code_9
0	1	1	0
1	1	0	1
2	1	1	0
3	1	0	1
4	1	1	0

	item	name	number_of_positions
0	1	waybill_date	6
1	2	accounting_period	4
2	3	number_of_carloads	4
3	4	car_ownership_category_code	1
4	5	aar_equipment_type_code	4

	waybill_date	accounting_period	number_of_carloads	car_ownership_category_code	aar_equipment_type_code	aar_mechanical_designation	stb_car_type	tofc_cofc_service_code	number_of_tofc_cofc_units	trailer_container_unit_ownership_code	...	extreme_outside_height	type_of_wheel_bearings_and_brakes	number_of_axles	draft_gear	number_of_articulated_units	aar_error_codes
0	011017	0119	0001	P	C214	XPI	41		0000		...	0000			00	0	0879
1	121417	0119	0001	P	S342	FCA	46	X	0001	P	...	0504	E	F	88	4	1379
2	011218	0119	0001	R	F483	FBC	49		0000		...	0208	E	4	15	0	7900
3	012618	0119	0001	R	A806	XP	38		0000		...	1610	E	4	15	0	0000
4	020218	0119	0015	P	C112	LO	41		0000		...	0903	E	4	66	0	5600

	waybill_date	accounting_period	number_of_carloads	car_ownership_category_code	aar_equipment_type_code	aar_mechanical_designation	stb_car_type	tofc_cofc_service_code	number_of_tofc_cofc_units	trailer_container_unit_ownership_code	...	tare_weight_of_car	outside_length	outside_width	outside_height	extreme_outside_height	type_of_wheel_bearings_and_brakes	number_of_axles	draft_gear	number_of_articulated_units	aar_error_codes
0	2017-01-10	2019-01-01	1	P	C214	XPI	41	NaN	0	NaN	...	0	0	0	0	0	NaN	NaN	0	0	879
1	2017-12-14	2019-01-01	1	P	S342	FCA	46	X	1	P	...	2176	28806	1005	705	504	E	F	88	4	1379
2	2018-01-12	2019-01-01	1	R	F483	FBC	49	NaN	0	NaN	...	607	8007	910	1506	208	E	4	15	0	7900
3	2018-01-26	2019-01-01	1	R	A806	XP	38	NaN	0	NaN	...	1125	9210	911	1700	1610	E	4	15	0	0
4	2018-02-02	2019-01-01	15	P	C112	LO	41	NaN	0	NaN	...	526	4200	1008	1506	903	E	4	66	0	5600

	waybill_date	accounting_period	number_of_carloads	car_ownership_category_code	aar_equipment_type_code	aar_mechanical_designation	stb_car_type	tofc_cofc_service_code	number_of_tofc_cofc_units	trailer_container_unit_ownership_code	...	extreme_outside_height	type_of_wheel_bearings_and_brakes	number_of_axles	draft_gear	number_of_articulated_units	aar_error_codes	stcc_description	origin_location	terminal_location	is_hazardous
0	2018-04-16	2019-01-01	1	P	T105	T	50	None	0	None	...	306	E	4	77	0	0	Chemicals	Chicago-Gary-Kenosha, IL-IN-WI	Los Angeles-Riverside-Orange County, CA-AZ	1
1	2018-08-30	2019-01-01	1	P	C214	LO	41	None	0	None	...	806	E	4	88	0	0	Chemicals	Chicago-Gary-Kenosha, IL-IN-WI	Philadelphia-Wilmington-Atlantic City, PA-NJ-D...	1
2	2018-09-05	2019-01-01	1	P	C214	LO	41	None	0	None	...	810	E	4	88	0	0	Chemicals	New Orleans, LA-MS	Birmingham, AL	1
3	2018-10-10	2019-01-01	1	P	S342	FCA	46	X	1	P	...	504	E	F	88	4	1300	Chemicals	Baton Rouge, LA-MS	Alberta	1
4	2018-10-23	2019-01-01	1	P	T106	T	51	None	0	None	...	400	E	4	77	0	0	Chemicals	Chicago-Gary-Kenosha, IL-IN-WI	Shreveport-Bossier City, LA-AR	1

	waybill_date	accounting_period	number_of_carloads	car_ownership_category_code	aar_equipment_type_code	aar_mechanical_designation	stb_car_type	tofc_cofc_service_code	number_of_tofc_cofc_units	trailer_container_unit_ownership_code	...	extreme_outside_height	type_of_wheel_bearings_and_brakes	number_of_axles	draft_gear	number_of_articulated_units	aar_error_codes	stcc_description	origin_location	terminal_location
0	2019-04-24	2019-04-01	1	P	T389	T	51	None	0	None	...	306	E	4	77	0	0	Petroleum Products	Houston-Galveston-Brazoria, TX	Houston-Galveston-Brazoria, TX
1	2019-03-08	2019-03-01	1	R	V418	FA	47	None	0	None	...	504	E	4	10	0	0	Transportation Equipment	Portland-Salem, OR-WA	Kansas City, MO-KS
2	2019-04-28	2019-04-01	1	R	S635	FCA	46	M	1	P	...	808	E	8	66	3	1367	Mixed Shipments	Los Angeles-Riverside-Orange County, CA-AZ	Atlanta, GA-AL-NC
3	2019-09-03	2019-09-01	1	T	V978	FA	47	None	0	None	...	508	E	4	10	0	0	Transportation Equipment	Detroit-Ann Arbor-Flint, MI	St. Louis, MO-IL
4	2019-06-19	2019-06-01	1	P	S342	FCA	46	X	1	P	...	504	E	F	88	4	1367	Unidentified Waste	Seattle-Tacoma-Bremerton, WA	Pendleton, OR-WA