Throughout the COVID-19 pandemic we've heard a lot about the "essential worker": the doctors and nurses on the front lines, the cashiers and clerks keeping our grocery stores open, the postal workers delivering vital parcels across the U.S., and countless others who risk their health and safety every day to keep the rest of us on our feet. Freight line workers are certainly among this heroic group, with their work being invaluable in supplying the country with vital goods. It's dangerous work, regardless of the pandemic, involving operating heavy machinery and moving massive cargo freights, and with this project we seek to make the job a little safer using machine learning.
The objective of this project is to find the probability that a freight car with unknown contents contains hazardous materials, hazardous waste, or chemicals provided the rest of the information on the car's waybill. The ability to predict this would help freight line workers to anticipate potential hazards and take appropriate measures to stay safe.
We'll be using a mixed approach here, modeling-wise, using a simple multi-layer perceptron neural network to make predictions from the rail car's properties (i.e. its estimated short-line miles, number of articulated units, ownership category, and the intermodal code), and then we'll use the car's route data (origin, destination, and interchange states) to build a model analogous to sentiment analysis using NLP techniques.
This project uses the Carload Sample Waybill (waybill) data. This data contains a stratified sample of all U.S. rail traffic for rail companies that terminate 4,500 or more revenue carloads annually. The waybill data is confidential for a number of reasons and can only be access by entities operating within approved industries, but the Surface Transportation Board makes a publicly available version of the data for independent use.
The publicly available waybill data, along with the reference guide documentation, can be downloaded from the Surface Transportation Board site here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import plaidml.keras as pk
pk.install_backend()
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Conv1D, Dropout, LeakyReLU, MaxPooling1D, Embedding, Flatten, Input, Concatenate
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report, confusion_matrix
The data file has been formatted as a flat, 247-byte-per-line .txt file. Let's take a look at the raw data to get an idea of what we're working with.
FILE_PATH = './'
with open(FILE_PATH + 'PublicUseWaybillSample2019.txt', 'r') as f:
lines = f.readlines()
print(lines[:5])
At face-value, this data is a nightmare; it just looks like maybe 6 columns of perhaps tab-delimited alpha-numeric serial references. However, we can find some help in the reference guide in the section "2019 Surface Transportation Board Public Use Waybill 247-Byte Record Layout" (p. 99). Here we find a table that will help us in decoding the data, shown below:
This is only the first few rows of the table, but this table provides us with the field names in the data, along with the character positions that hold the data for a particular field.
First, since this table is in a pdf document, we'll need to get it into some workable format for us to use in our code. I was able to copy and paste it into Excel, but the formatting was still a hassle. After formatting the data, I saved it to a .csv file, to be read in by pandas.
Let's take a look at the data:
column_lookup_path = FILE_PATH + 'column_positions_lookup.csv'
column_lookup_df = pd.read_csv(column_lookup_path, delimiter=',')
column_lookup_df.head()
Now let's make a lookup dictionary that we can use to look up a column's name and the number of character positions that column should contain.
column_lookup_dict = {k: v for k,v in column_lookup_df[['name', 'number_of_positions']].to_numpy()}
We can use this dictionary, along with the following function, to parse each line of the data and create a JSON-like list of dictionaries containing the data.
def read_waybill_line(line, lookup_dict):
out = {}
start = 0
for key in lookup_dict:
if key == 'blank':
continue
end = start + int(lookup_dict[key])
out[key] = line[start: end]
start = end
return out
dataset = [read_waybill_line(line, column_lookup_dict) for line in lines]
Let's convert the data to a pandas DataFrame and preview it.
df = pd.DataFrame(dataset)
df.head()
That looks much better!
Now that we have the data in a readable format, let's take a look at the shape, data types, and memory footprint.
df.shape
df.info(memory_usage='deep')
So this is a pretty big dataset, with over 670k rows, and it's eating up a ton of memory -- about 2.4 GB!
We have a couple of red flags here, though, that we should inspect: first, 670k rows is big, but not big enough to merit a 2.3 GB memory footprint; second, none of the columns contain any null values -- which we know should not be the case from our preview of the pandas data.
So we'll tackle the "no nulls" issue first. From inspecting the original .txt file, we can see that all the values that should be null have been filled with spaces, which eats up memory by both simply having the spaces there at all and by forcing some of the numerical columns to be encoded as text. However, with some regex, we can find and replace these values with np.nan.
Once that is done, we can cast the columns to their proper data types, and check their min and max values to make sure that we are using the proper number of bits for our integers.
The following function will complete these tasks and give us a summary at the end of the memory reduction.
def minimize_mem_usage(df, verbose=False):
start_mem_usg = df.memory_usage(deep=True).sum() / 1024**2 # MB
# replace space-filled values with np.nan
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)
df['waybill_date'] = pd.to_datetime(df['waybill_date'], format='%m%d%y')
df['accounting_period'] = pd.to_datetime(df['accounting_period'], format='%m%y')
# cast numerical columns to int
for col in df.columns:
if col in ['waybill_date', 'accounting_period']:
continue
try:
df[col] = df[col].astype(int)
except ValueError:
continue
# drop na's
df.dropna(axis=1, how='all', inplace=True)
df.dropna(axis=0, how='all', inplace=True)
NAlist = []
for col in df.columns:
if df[col].dtype not in ['object', 'datetime64[ns]']:
IsInt = False
mx = df[col].max()
mn = df[col].min()
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(df[col]).all():
NAlist.append(col)
df[col].fillna(mn-1,inplace=True)
# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
result = (df[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
df[col] = df[col].astype(np.uint8)
elif mx < 65535:
df[col] = df[col].astype(np.uint16)
elif mx < 4294967295:
df[col] = df[col].astype(np.uint32)
else:
df[col] = df[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
# Make float datatypes 32 bit
else:
df[col] = df[col].astype(np.float32)
# Print final result
end_mem_usg = df.memory_usage().sum() / 1024**2
mem_reduction = 100*end_mem_usg/start_mem_usg
if verbose:
print("___MEMORY REDUCTION STATS:___")
print("Starting memory usage was {:.2f}MB".format(start_mem_usg))
print("Current memory usage is {:.2f}MB".format(end_mem_usg))
print("This is {:.2f}% of the initial size".format(mem_reduction))
return df
df = minimize_mem_usage(df, verbose=True)
We have great performance coming out of our memory optimization, with a reduction of about 95% of the original memory!
Let's have a quick look at our datatypes to make sure everything went well and makes sense.
df.info()
We can see above that we now have plenty of null values (which is a good sign), and our integer data types have been optimized to the lowest bit size of unsigned integer types, giving us a very reasonable 127MB of data.
We'll have one final look at the data, and then save it to a .csv for later use in analysis.
df.head()
df.to_csv('./prelim_waybill_data.csv', index=False)
We have our waybill data clean now, but we will need a couple of other datasets to get a complete picture. In particular, we will need to decode the Surface Transportation Commodity Codes (STCC) and Business Economic Area (BEA) codes, so that we can know what type of freight is being moved and its origin and destination points.
The STCC code headers and BEA codes can be found in similar tables to the data position tables in the reference guide. I've copied the tables into csv's.
We'll read all the data back into Spark DataFrames, so that we can take advantage of Spark SQL to combine the data.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
sc = pyspark.SparkContext()
spark = SparkSession.builder.getOrCreate()
waybill = spark.read.csv('./prelim_waybill_data.csv', header=True, inferSchema=True)
stcc = spark.read.csv('./stcc_data.csv', header=True, inferSchema=True)
bea = spark.read.csv('./bea_codes.csv', header=True, inferSchema=True)
waybill.createOrReplaceTempView('waybill')
stcc.createOrReplaceTempView('stcc')
bea.createOrReplaceTempView('bea')
haz_query = """
SELECT
w.*,
s.short_description as stcc_description,
o.location as origin_location,
t.location as terminal_location,
s.is_hazardous
FROM waybill w
JOIN stcc s ON s.stcc_code = LEFT(w.commodity_code_stcc,2) /* stcc_code contains only the first 2 digits of the stcc */
JOIN bea o ON o.bea_code = w.origin_bea_area
JOIN bea t ON t.bea_code = w.termination_bea_area
WHERE 1=1
AND s.is_hazardous = 1
"""
hazardous = spark.sql(haz_query)
hazardous.count()
hdf = hazardous.toPandas()
hdf.head()
nonhaz_query = """
SELECT
w.*,
s.short_description as stcc_description,
o.location as origin_location,
t.location as terminal_location,
s.is_hazardous
FROM waybill w
JOIN stcc s ON s.stcc_code = LEFT(w.commodity_code_stcc,2)
JOIN bea o ON o.bea_code = w.origin_bea_area
JOIN bea t ON t.bea_code = w.termination_bea_area
WHERE 1=1
AND s.is_hazardous = 0
"""
There are far more non-hazardous shipments than hazardous ones, so we'll take a random sample from the non-hazardous data equal in size to the hazardous dataset.
nonhaz = spark.sql(nonhaz_query).rdd.takeSample(False, len(hdf), 42)
nonhaz = sc.parallelize(nonhaz)
schema = hazardous.schema
nhdf = nonhaz.toDF(schema=schema).toPandas()
nhdf.head()
As our objective is to predict whether or not a shipment is hazardous, we won't need all of the columns. I've selected a handful of columns that would appear, based on their data descriptions in the reference guide, to have a relationship with whether or not the freight contains hazardous materials.
df = pd.concat([hdf, nhdf], axis=0)
relevant_cols = [
'is_hazardous',
'car_ownership_category_code',
'miscellaneous_charges_dollars',
'all_rail_intermodal_code',
'estimated_short_line_miles',
'number_of_interchanges',
'number_of_articulated_units',
'origin_location',
'interchange_state_1',
'interchange_state_2',
'interchange_state_3',
'interchange_state_4',
'interchange_state_5',
'terminal_location'
]
df = df[relevant_cols]
df.head()
len(df)
This is now our prepared dataset, and I'll write it to a csv to be read in for EDA.
df.to_csv('./prepared_waybill_data.csv', index=False)
Let's start by taking a look at the correlation matrix, to see if any of our numerical columns correlate with the is_hazardous column.
df.corr()
There are slight negative correlations with the number of articulated units and the estimated short-line miles. The all_rail_intermodal_code
should not actually be a numerical value, as the codes (1, 2, and 9) simply refer to whether or not the freight has been moved using multiple travel media (i.e. it traveled by ship for a time, then moved to the railway). We'll examine these values categorically in a future step.
As we discover columns that have weak predictive power, we will append them to a list of columns to drop, then we will drop them from our data at the end of our analysis.
drop_cols = []
Next, we'll take a look at the distribution of each data point. While some columns don't correlate with hazard, if their distributions differ widely between hazardous and non-hazardous freight, then we might still be able to get some segmenting power from them.
grouped = df.groupby('is_hazardous')
grouped.mean()
grouped.std()
At a glance, we can see a few things: miscellaneous charges tend to be higher and more variant for non-hazardous shipments; estimated short-line miles are far less for hazardous freight (meaning hazardous freight travels a smaller distance, on average); and non-hazardous freight tend to travel in articulated units more frequently than hazardous freight.
Let's visualize these with kernel density plots.
for c in df.columns[df.dtypes!='object']:
if c == 'is_hazardous':
continue
sns.kdeplot(df.loc[df['is_hazardous']==0, c], label='Non-Hazardous')
sns.kdeplot(df.loc[df['is_hazardous']==1, c], label='Hazardous')
if c == 'miscellaneous_charges_dollars':
plt.xlim(0, 200)
plt.title(c)
plt.legend()
plt.show()
if c in ['all_rail_intermodal_code', 'number_of_interchanges', 'number_of_articulated_units']:
print('__HAZARDOUS__')
print(df.loc[df['is_hazardous']==0, c].value_counts())
print('__NON-HAZARDOUS__')
print(df.loc[df['is_hazardous']==1, c].value_counts())
print('='*20)
print('\n')
It seems that even though miscellaneous_charges_dollars tend to be higher, on average, for non-hazardous freight, we don't have many non-zero charges, so the data for this column will not be particularly useful. We'll add it to the drop list.
We can see some separability in the all rail/intermodal code. It seems that non-hazardous freight tend to have a higher likelihood of being unidentified (code 9), so we'll keep this column for now.
As we gathered from the correlation matrix, we can see that the estimated short-line miles are shorter for hazardous freight. This could be good for our future model.
The number of interchanges seems to be the same for both types of freight, so we will drop this column.
Hazardous shipments are less likely to travel in articulated units, as we saw in the correlation matrix, so we will keep this column.
drop_cols.append('miscellaneous_charges_dollars')
drop_cols.append('number_of_interchanges')
Let's take a look at our categorical columns now.
g = sns.FacetGrid(df, col='is_hazardous', height=8)
g.map(sns.countplot, 'all_rail_intermodal_code')
We see some evidence of separability in our all-rail/intermodal codes, so we will officially retain this column.
g = sns.FacetGrid(df, col='is_hazardous', height=8)
g.map(sns.countplot, 'car_ownership_category_code')
We see here that hazardous freight is more likely to be privately owned, so we will retain this column as well.
The following plots are for the origin locations, terminal locations, and interchange states along the route. These appear to have a good amount of predictive power, as it seems that hazardous and non-hazardous freight tend to originate in separate areas, arrive in separate areas, and follow differing routes that hazardous freight.
def plot_bars(df, colname):
haz_cnt = df.loc[df['is_hazardous']==1, colname].value_counts().reset_index(drop=False)
haz_cnt['is_hazardous'] = 'yes'
nhaz_cnt = df.loc[df['is_hazardous']==0, colname].value_counts().reset_index(drop=False)
nhaz_cnt['is_hazardous'] = 'no'
cnts = pd.concat([haz_cnt, nhaz_cnt], axis=0)
cnts.head()
g = sns.catplot(
data=cnts, kind="bar",
x="index", y=colname, hue="is_hazardous",
ci="sd", palette="dark", alpha=.6, height=8, aspect=3
)
x_max = min(len(cnts), 50)
plt.xlim((-1,x_max))
plt.xticks(rotation=90)
plt.show()
Below, one can see that the distributions of terminal locations and origin locations differ widely between hazardous and non-hazardous freight. This will be a powerful predictor.
plot_bars(df, 'origin_location')
plot_bars(df, 'terminal_location')
The distributions for interchange states, while not as drastically different as those for origin and terminal location, will still be very informative for our model as well.
plot_bars(df, 'interchange_state_1')
plot_bars(df, 'interchange_state_2')
plot_bars(df, 'interchange_state_3')
Very few shipments have more than 3 interchanges, so we will drop interchange states 4 and 5.
drop_cols.append('interchange_state_4')
drop_cols.append('interchange_state_5')
Let's execute our column drops and save the final data to a csv.
df.drop(drop_cols, axis=1, inplace=True)
df.to_csv('./waybill_relevant_data.csv', index=False)
During our EDA, we were able to do the feature selection, so now all that's left is to prepare the data with encoding and scaling. We'll also be developing the "route sentences," which will be the text representation of our train routes.
We're going to try using some NLP techniques on the train route, since right now the routes are stored as sequences of string data. So let's engineer the features in the same ways we would engineer sentences, for example.
In this way our predictions become comparable to a simple case of sentiment analysis.
seq_cols = [
'origin_location',
'interchange_state_1',
'interchange_state_2',
'interchange_state_3',
'terminal_location'
]
def to_sentence(row):
ls = []
for c in seq_cols:
if (row[c] is None) or (row[c] is np.nan):
continue
ls.append(row[c].replace(', ', '').replace(' ', '').replace('-', ''))
return ' '.join(ls)
routes = df.apply(to_sentence, axis=1).to_list()
routes[:10]
Now that we have our "route sentences" we need to know the vocabulary size.
long_str = ' '.join(routes)
long_ls = long_str.split()
vocab_set = set()
for l in long_ls:
vocab_set.add(l)
print(len(vocab_set))
We have 212 distinct words in our vocab, so we'll choose a vocab size of 300, allowing for some unseen values. The maximum route length is five stops, so our max sentence length will be 5 words.
vocab_size = 300
max_length = 5
Now we apply our encoding and padding, and then dump the numpy arrays of our prepared data and the encoder to a pickle file.
encoded_routes = [one_hot(r, vocab_size) for r in routes]
encoded_routes[:5]
with open('./one_hot_encoder.pickle', 'wb') as f:
pickle.dump(one_hot, f)
padded_routes = pad_sequences(encoded_routes, maxlen=max_length, padding='post')
padded_routes[:5]
X_seq = padded_routes
y = df['is_hazardous'].to_numpy()
seq_data = (X_seq, y)
with open('sequence_data.pickle', 'wb') as f:
pickle.dump(seq_data, f)
For our traditional numerical and category data, we'll be applying one-hot encoding and min-max scaling, respectively.
num_cols = [
'estimated_short_line_miles',
'number_of_articulated_units'
]
cat_cols = [
'car_ownership_category_code',
'all_rail_intermodal_code'
]
nums = df[num_cols]
cats = df[cat_cols].astype(str)
df['all_rail_intermodal_code'] = df['all_rail_intermodal_code'].astype(str)
encoded_cats = pd.get_dummies(cats)
encoded_cats.head()
X_cats = encoded_cats.to_numpy()
scaler = MinMaxScaler()
X_nums = scaler.fit_transform(nums)
X_nums[:10]
Now that the data is scaled and encoded, we will pickle the scaler and the numerical dataset.
with open('./min_max_scaler.pickle', 'wb') as f:
pickle.dump(scaler, f)
X = np.concatenate([X_nums, X_cats], axis=1)
X.shape
X_seq.shape
num_data = (X, y)
with open('./numerical_data.pickle', 'wb') as f:
pickle.dump(num_data, f)
Now we are ready to send our data to model training!
We're going to be training a Siamese Neural Network to make our predictions. This is not a very common approach, as it's typical that all the data going into a given neural network would be of the same type, but here we will need to feed the network numerical data as well as sequences of words, which require differing input layers and differing initial layer structures.
The structure of our Siamese Neural Network will be the following: the numerical data will be fed into a shallow multilayer perceptron model whose output will be weights from fully-connected layer with 64 nodes; the train route sequence data will be embedded into a vector field and then fed into a 1D Convolutional Neural Network, like one might use for sentiment analysis, outputting another fully-connected layer with 64 nodes. The weights from the two models' outputs will then be concatenated and fed into a third neural network -- a deep learning MLP -- which will give us our final predictions.
with open('./sequence_data.pickle', 'rb') as f:
sequence_data = pickle.load(f)
with open('./numerical_data.pickle', 'rb') as f:
numeric_data = pickle.load(f)
X_seq, y = sequence_data
X_num, y = numeric_data
print(X_seq.shape)
print(X_num.shape)
print(y.shape)
X = np.concatenate([X_num, X_seq], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.33)
xs_train, xs_test, ys_train, ys_test = train_test_split(X_seq, y, test_size=.33)
xn_train, xn_test, yn_train, yn_test = train_test_split(X_num, y, test_size=.33)
vocab_size = 300
max_length = 5
batch_size = 32
def build_model():
# cnn with only 1 dense layer
seq_input = Input(shape=(max_length,))
x = Embedding(vocab_size, 3, input_length=max_length)(seq_input)
x = Conv1D(256, kernel_size=3, strides=1)(x)
x = LeakyReLU()(x)
x = MaxPooling1D(pool_size=2)(x)
x = Flatten()(x)
seq_output = Dense(64, activation='relu')(x)
cnn = Model(inputs=seq_input, outputs=seq_output)
# mlp with only one dense layer
num_input = Input(shape=(8,))
mlp_output = Dense(64, activation='relu')(num_input)
mlp = Model(inputs=num_input, outputs=mlp_output)
# combine
combined = Concatenate()([cnn.output, mlp.output])
z = Dense(512)(combined)
z = LeakyReLU()(z)
z = Dropout(.5)(z)
z = Dense(512)(z)
z = LeakyReLU()(z)
z = Dropout(.2)(z)
z = Dense(256)(z)
z = LeakyReLU()(z)
z = Dropout(.2)(z)
z = Dense(64)(z)
z = LeakyReLU()(z)
z = Dropout(.2)(z)
output = Dense(1, activation='sigmoid')(z)
final_model = Model(inputs=mlp.inputs + cnn.inputs, outputs=[output])
final_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
return final_model
model = build_model()
num_train = x_train[:, :8]
seq_train = x_train[:, 8:]
num_test = x_test[:, :8]
seq_test = x_test[:, 8:]
xc_train = [num_train, seq_train]
xc_test = [num_test, seq_test]
learning_rate_reduction_combined = ReduceLROnPlateau(monitor='val_acc', patience=3,
verbose=2, factor=0.5, min_lr=0.00001)
best_model_combined = ModelCheckpoint('./combined_cnn_mlp_model.3.h5', monitor='val_acc', verbose=2,
save_best_only=True, mode='max')
early_stopping_combined = EarlyStopping(monitor='val_loss', min_delta=1e-10,
patience=10, restore_best_weights=True)
hist = model.fit(xc_train, y_train,
batch_size=batch_size,
epochs=50,
validation_data=(xc_test, y_test),
callbacks = [learning_rate_reduction_combined, best_model_combined, early_stopping_combined],
verbose=1
)
This model appears to produce very good results, with an accuracy score of 91.8%.
In the following section, we'll evaluate its performance a bit deeper.
with open('./sequence_data.pickle', 'rb') as f:
sequence_data = pickle.load(f)
with open('./numerical_data.pickle', 'rb') as f:
numeric_data = pickle.load(f)
X_seq, y = sequence_data
X_num, y = numeric_data
X = np.concatenate([X_num, X_seq], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.33)
num_train = x_train[:, :8]
seq_train = x_train
num_test = x_test[:, :8]
seq_test = x_test
xc_train = [num_train, seq_train]
xc_test = [num_test, seq_test]
def false_neg(preds, y_test):
fn = 0
for i in range(len(preds)):
if preds[i]==0 and y_test[i]!=preds[i]:
fn += 1
return fn / len(y_test)
def evaluate(preds, y_test):
print(f'Accuracy Score: {round(accuracy_score(preds, y_test)*100, 2)}%')
print(f'AUC ROC Score: {round(roc_auc_score(preds, y_test)*100, 2)}%')
print('\n__CLASSIFICATION REPORT__:')
print(classification_report(preds, y_test))
print('\n__CONFUSION MATRIX__:')
print(confusion_matrix(preds, y_test)/len(y_test))
print('\n__FALSE NEGATIVE RATE__:')
print(f'{round(false_neg(preds, y_test)*100,2)}%')
cm = load_model('./combined_cnn_mlp_model.3.h5')
input_data = [x_test[:, :8], x_test[:, 8:]]
cm_preds = cm.predict(input_data)
evaluate(np.rint(cm_preds), y_test)
For this evaluation, model performed at 92.5% accuracy, with precision and recall falling at similar values between 92% and 93%. Crucially, the false negative rate is very low here, at only ~4% -- which means our model will be able to keep railway workers safe about 96% of the time when a car with unknown contents arrives.
Overall, I was very please with this model's performance -- with a validated accuracy score of 91.8% and a false negative rate of only 4%. Treating the rail car's routes as sentences and applying NLP methods, like vector embedding and sentiment analysis, worked better than I had expected, and has made me consider other applications where treating sequences in the same way as language might be beneficial.