Time Series Clustering Analysis of ACNH Turnip Price Trends

By Nigel Story

gettyimages-1026933968-640x640.jpg



1. Introduction

In February of 2020, the world was shaken by a new coronavirus disease poised to upend the world economy and cost hundreds of thousands, or perhaps millions, of lives. By March, the shadow of threat had very much been realized in the United States as community spread of the virus took hold of metropolitan areas and rural towns, unchecked by any definitive or cohesive containment strategy. One-by-one, states and counties began issuing emergency orders -- closing the doors of non-essential businesses, requiring office workers to work from home, imposing curfews and travel restrictions -- and whether your local government called it "safer at home" or "shelter in place," the message was clear: stay home.

For a nation under quarantine, Animal Crossing: New Horizons for the Nintendo Switch (ACNH) was the escape everyone needed. Its open-aired in-game atmosphere and activities, like fishing and landscaping, offered players a relaxing and fresh alternative to the closed and uncertain reality of the pandemic, and the casual and stress-free gameplay made the game not only enjoyable to the seasoned gamer but to everyone, from younger children to parents.

The in-game Turnip "Stalk" Market, however, takes any notion of relaxation and throws it out the window. It's a frantic race of buying and selling where the stakes can go as high as you want, and players whose islands hold the highest prices can demand tariffs and tribute from visiting players hoping to take advantage of the "spike." In the world of ACNH, the players who know how to work the turnip market are royalty.

How does the Stalk Market work?

The mechanics of the Stalk Market are simple: on Sunday mornings, a turnip vendor comes to your island offering to sell you turnips at a randomly-generated price, then on the rest of the days of the week, the island shop will offer to buy the turnips from you for a price that changes twice daily. With a little luck, one can pull a nice profit by buying low and selling high.

The skill comes in being able to predict how the prices for selling one's turnips will unfold as the week progresses. The game has certain trends programmed for the prices that, while the exact values are randomly generated, can give the investor an idea of when the prices will hit their peak. Another aspect of the game is that a player with high turnip prices on their island can invite other players to come sell at their island and charge commission in the form of in-game currency or rare items.

In this project, we will be exploring the existing weekly trends that the prices follow, with the objective of being able to predict which trend an island will have based on the first few days of pricing data, and thus the best day on which to sell turnips or invite visitors to exploit with commission charges.

Approach

The approach will come in three phases: data collection and EDA, unsupervised clustering to detect how many trend families exist, and finally classification of weeks' trends based on the clusters.

The data will be collected from three sources: a turnip price tracking web application that I developed (full-stack), a community Google Sheet found here, and finally a random price generator website developed by Git user chengluyu, who used fellow user Treeki's reverse engineering of Nintendo's turnip price RNG.

For our clustering methods, rather than expressing each week's data as 2-dimensional time series data of the form {($t_{1}$, $p_{1}$),($t_{2}$, $p_{2}$),...,($t_{12}$, $p_{12}$)}, where $p_{i}$ is the turnip price at time $t_{i}$, we will project them as single points in $\mathbb{R}^{12}$ of the form ($p_{1}$, $p_{2}$, ... , $p_{12}$), and perform our clustering in that space. We will be comparing Hierarchical Clustering, KMeans++, and DBSCAN.

We will be comparing the performances of Random Forest Classification and an Artificial Neural Network classifier, testing the predictive power of the first six price readings for a given week.

Packages and Dependencies

For the most part, our choices in packages are pretty conventional: Pandas and Numpy for data wrangling, Sci-Kit Learn and SciPy for preprocessing and machine learning tasks, and Matplotlib and Seaborn for visualization. Where we deviate a bit from the norm is in our choice of using PlaidML as the back-end for Keras rather than TensorFlow. The reason for this is that our project was done on an iMac, and TensorFlow has not been optimized for MacOS (pre M1 processor chip), particularly when we want to make use of the AMD GPU.

We'll also make use of a custom class for interacting with the MySQL database that stores our turnip price data, which can be found in the helpers directory of this repo.

In [2]:
import os
import sys

import warnings
warnings.filterwarnings('ignore')

# override TensorFlow backend
import plaidml.keras as pk
pk.install_backend()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.decomposition import PCA

# for resampling of data
from imblearn.over_sampling import SMOTE, RandomOverSampler

# for hierarchical clustering
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import cdist

# for time series analysis
from statsmodels.tsa.statespace.tools import diff
from statsmodels.tsa.stattools import adfuller

# to build ANN
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils, to_categorical

# for mysql db interaction
from helpers.db_connection import DBConnect

%matplotlib inline

Data Collection

Full-Stack Price Tracking Site

The first point of data collection we employed was a turnip price tracking site called "The Stalk Market," the code for which can be found here. The site had very few users because it was never deployed publicly, but from select friends and family we were able to collect about 15 weeks of pricing data (~186 observed prices).

The interface for the site was simple: there was a login screen, a page for one to enter their observed prices, and a simple dashboard where one could compare their weekly trends with those of friends.

Screen%20Shot%202020-11-26%20at%209.04.41%20PM.png

Though this was a fun way to interact with friends and family who enjoyed ACNH, it was clear that in order to collect enough data to analyze, we would need a much bigger net.

Community Contributions

The Animal Crossing: New Horizons - Era Tracker+ Google Sheet provided my next haul of data of about 91 weeks of data in csv files. However, these data were largely incomplete weeks that required a fair amount of munging and interpolation to be usable.

As we completed earlier iterations of this project, it again became clear that we were going to need data on the order of thousands of weeks to get any reliable results. This meant that, realistically, our analysis depended on our ability to generate our own data on a massive scale.

Web Scraping

Not long after the release of ACNH, Treeki uploaded his C++ code for the turnip price RNG, which paved the way for many turnip price predictor sites and the way for chengluyu's random price generator website.

Using Python's Selenium package, we were able to build a web-scraping bot to visit chengluyu's site, randomly generate 10,000 weeks of data, and scrape the data into a Pandas DataFrame to be uploaded to the MySQL database for use in this project.

This is by far our most relied upon data source for this project, given that it produced 10,000 perfect weeks, compared to the other methods that, combined, only produced 106 weeks of imperfect data.

You can find the scripts for the web-scraping in the helpers directory of this repository.


Cleaning and EDA

Cleaning and Combining the Data

From Price Tracking Site

The user data from "The Stalk Market" site were uploaded to a MySQL server into a database called turnips, the setup scripts for which can be found in the sql directory of this repository.

The main table we'll be selecting from is the turnip_prices table, which contains columns for the date, user, and price of turnips -- dat, user_id, and price, respectively. We'll also make use of columns from the date_helper_day table, which will aid us in pivoting the data in a later step.

In [3]:
q = """
    SELECT 
        p.*, 
        d.week_id, 
        d.day_of_week,
        case when hour(p.dat) = 0 then 'am' else 'pm' end as time_of_day
    FROM turnip_prices p
    JOIN date_helper_day d ON d.dat = p.dat
"""

with DBConnect(db='turnips') as cnx:
    df_db = pd.read_sql(con=cnx.cnx, sql=q)
In [4]:
df_db.head()
Out[4]:
dat user_id price week_id day_of_week time_of_day
0 2020-03-23 00:00:00 1 44 1 2 am
1 2020-03-23 00:00:00 2 79 1 2 am
2 2020-03-23 12:00:00 1 40 1 2 pm
3 2020-03-23 12:00:00 2 75 1 2 pm
4 2020-03-24 00:00:00 1 36 1 3 am

The data are retrieved in a conventional time series format, with the date and price held as separate columns of the DataFrame. However, given that we will not be doing any forecasting per se, but will rather be classifying "trend families" based on the whole week of prices, it will benefit us to pivot this data to where each time value (Mon-AM, Mon-PM, Tues-AM, etc.) is in its own column. This will allow us to perform our analysis in $\mathbb{R}^{12}$ rather than in $\mathbb{R}^{2}$, which will simplify our clustering and classifying efforts.

In [5]:
def pivot_data(df):
    """Pivot data to have days of week in cols, separate weeks in rows.
    
    Args:
        df (pandas.DataFrame): Turnip price data to be pivoted.
        
    Returns:
        out (pandas.DataFrame): Pivoted data with days of week in cols, separate weeks in rows.
    """
    
    out = pd.pivot_table(df, index=['week_id', 'user_id'], columns=['day_of_week', 'time_of_day'])
    out.columns = ['Mon-AM', 'Mon-PM', 'Tues-AM', 'Tues-PM', 'Wed-AM', 'Wed-PM', 
                   'Thurs-AM', 'Thurs-PM', 'Fri-AM', 'Fri-PM', 'Sat-AM', 'Sat-PM']
    out.reset_index(drop=True, inplace=True)
    out.dropna(axis=0, inplace=True)
        
    return out
In [6]:
df_db = pivot_data(df_db)
df_db.head()
Out[6]:
Mon-AM Mon-PM Tues-AM Tues-PM Wed-AM Wed-PM Thurs-AM Thurs-PM Fri-AM Fri-PM Sat-AM Sat-PM
0 44.0 40.0 36.0 33.0 93.0 124.0 127.0 175.0 168.0 54.0 51.0 48.0
1 79.0 75.0 71.0 68.0 62.0 111.0 108.0 176.0 189.0 146.0 88.0 83.0
5 65.0 61.0 58.0 54.0 49.0 102.0 109.0 159.0 169.0 167.0 58.0 54.0
6 107.0 130.0 153.0 134.0 143.0 117.0 67.0 60.0 51.0 100.0 81.0 74.0
7 60.0 124.0 130.0 145.0 147.0 144.0 75.0 72.0 67.0 64.0 59.0 55.0

Now that we have our website data in the preferred format, we will move on to the community data from the Era Tracker Google Sheet.

From Animal Crossing: New Horizons - Era Tracker+

The data from the "Era Tracker+" were pulled down weekly as csv's and stored in the csv_data directory in. We'll be using a from_csv() function to read the data into Pandas, and we'll use regular expression matching to be sure we only read in the relevant files.

In [7]:
def from_csv():
    """Read turnip price data from csv's in "csv_data" directory. Format for use in analysis.
    
    Returns:
        out (Pandas.DataFrame): Formatted turnip price data.
    """
    root = './csv_data/'
    pattern = re.compile(r'turnip-week[1-9]*-gs.csv')
    cols = ['Mon-AM', 'Mon-PM', 'Tues-AM', 'Tues-PM', 'Wed-AM', 'Wed-PM', 
            'Thurs-AM', 'Thurs-PM', 'Fri-AM', 'Fri-PM', 'Sat-AM', 'Sat-PM']
    out = pd.DataFrame(columns=cols)
    
    for f in os.listdir(root):
        if pattern.search(f): 
            dummy = pd.read_csv(root + f).iloc[:, 1:]
            dummy.dropna(inplace=True, axis=0, thresh=8)
            dummy.columns = cols
            out = out.append(dummy)
        else:
            continue
            
    out.reset_index(drop=True, inplace=True)
    
    return out
In [8]:
df_csv = from_csv()
df_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 97 entries, 0 to 96
Data columns (total 12 columns):
Mon-AM      91 non-null float64
Mon-PM      91 non-null float64
Tues-AM     95 non-null float64
Tues-PM     96 non-null float64
Wed-AM      95 non-null float64
Wed-PM      91 non-null float64
Thurs-AM    93 non-null float64
Thurs-PM    93 non-null float64
Fri-AM      85 non-null float64
Fri-PM      74 non-null float64
Sat-AM      54 non-null float64
Sat-PM      47 non-null float64
dtypes: float64(12)
memory usage: 9.2 KB

From looking at the dataframe's info, we can see 91 week's of data, though the completeness of the data declines over the week, with Saturday evening only having 47 entries. We'll make a quick heat map in Seaborn to visualize the missing data.

In [9]:
sns.heatmap(df_csv.isnull(), cbar=False)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8560ecac10>

This is a significant decision point in our analysis. We should impute values to the missing data, and we have several ways of doing it. For simplicity, we'll be imputing values with the mean, but the decision remains whether to take the mean of each day's values for imputing or the mean of the week that the missing data belong to.

Given that we're going to be performing clustering, we'll be making this decision based on maximizing separability of the data after the imputing. If we, for instance, fill missing data for Mon-PM with the average of Mon-PM prices, then in the Mon-PM dimension of our data we will be reducing the separability, since every week that had a missing value there will now have the same value, regardless of the trend for each week. So with this in mind, we're going to be imputing by week, filling missing values with the average price of the given week.

To do this, we'll need to transpose our dataset, take the average and fill the empty values by column, and then transpose the data again to its original form.

In [10]:
df_csvT = df_csv.transpose()

for col in list(df_csvT.columns):
    df_csvT.loc[df_csvT[col].isna(), col] = df_csvT[col].mean()

df_csv = df_csvT.transpose()
In [11]:
df_csv.isna().sum()
Out[11]:
Mon-AM      0
Mon-PM      0
Tues-AM     0
Tues-PM     0
Wed-AM      0
Wed-PM      0
Thurs-AM    0
Thurs-PM    0
Fri-AM      0
Fri-PM      0
Sat-AM      0
Sat-PM      0
dtype: int64

Now that we've filled the missing data, let's take a quick look at the head of the data as a quick quality check before moving on to the generated data.

In [12]:
df_csv.head()
Out[12]:
Mon-AM Mon-PM Tues-AM Tues-PM Wed-AM Wed-PM Thurs-AM Thurs-PM Fri-AM Fri-PM Sat-AM Sat-PM
0 96.0 92.0 87.0 83.0 78.0 73.0 68.0 63.0 59.0 55.0 50.0 46.0
1 89.0 86.0 82.0 78.0 74.0 70.0 65.0 122.0 160.0 342.0 160.0 122.0
2 121.0 105.0 110.0 71.0 64.0 54.0 93.0 112.0 123.0 92.4 71.0 92.4
3 99.0 93.0 119.0 202.0 589.0 174.0 105.0 55.0 51.0 93.0 158.0 158.0
4 46.0 43.0 39.0 34.0 30.0 26.0 21.0 126.0 115.0 204.0 68.4 68.4

Generated Data

The data scraped from chenluyu's site were stored in the turnips database, separate from the user-added data, in the generated_turnip_prices table. The data doesn't require any cleaning or modifying in any way, so we'll simply read it into Pandas.

In [13]:
with DBConnect(db='turnips') as cnx:
    df_gen = pd.read_sql(con=cnx.cnx, sql="SELECT * FROM generated_turnip_prices").iloc[:, 1:]
df_gen.head()
Out[13]:
Mon-AM Mon-PM Tues-AM Tues-PM Wed-AM Wed-PM Thurs-AM Thurs-PM Fri-AM Fri-PM Sat-AM Sat-PM
0 54 49 126 120 137 139 138 52 48 44 40 37
1 108 101 115 152 131 71 61 54 153 145 82 75
2 81 77 73 69 65 62 59 56 52 48 44 40
3 94 90 87 82 138 213 327 179 116 73 80 64
4 58 54 51 47 44 40 36 113 103 142 143 142

Combining the Data

To combine our data from the three sourced, we'll just need to concatenate the data frames along the rows.

In [14]:
df = pd.concat([df_db, df_csv], axis=0)
df = pd.concat([df, df_gen], axis=0)
df = df.sample(frac=1).reset_index(drop=True) # shuffle rows and reset index
df = df.astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10106 entries, 0 to 10105
Data columns (total 12 columns):
Mon-AM      10106 non-null float64
Mon-PM      10106 non-null float64
Tues-AM     10106 non-null float64
Tues-PM     10106 non-null float64
Wed-AM      10106 non-null float64
Wed-PM      10106 non-null float64
Thurs-AM    10106 non-null float64
Thurs-PM    10106 non-null float64
Fri-AM      10106 non-null float64
Fri-PM      10106 non-null float64
Sat-AM      10106 non-null float64
Sat-PM      10106 non-null float64
dtypes: float64(12)
memory usage: 947.6 KB

We're now ready to do some EDA on our complete dataset of 10,106 weeks of turnip prices.

EDA

According to several community forums, each week of turnip prices can be sorted into four trend families: large spike, small spike, random, and decreasing. The large spike trend family is characterized by decreasing prices in the beginning of the week followed by a sudden and large increase and sharp decrease. The small spike is characterized by decreasing prices at the beginning of the week followed by a more gradual increase in prices with a lower maximum value than the large spike. The random trend behaves as a random walk, and the decreasing trend is monotonically decreasing over the week. Let's visualize our data and see if we can see those trends for ourselves.

The fist thing we'll need to do is whip up a quick a function to let us plot the weeks of data, which are stored as the rows of a df. The function will need to take in a data frame, transpose it, and then plot the columns.

In [15]:
def plot_ts(df, title=''):
    """Plot df values by row as line plot.
    """
    
    df.transpose().plot(legend=False)
    plt.title(title)

Let's plot the head of our dataset and see what kind of trends we can see.

In [16]:
plot_ts(df.head(10))

In the first ten week's of data, we can see evidence for the four trend family hypothesis. The purple line follows the paradigm for a "large spike," the yellow lines follow the pattern of a small spike, and we can see several lines that are more or less zig-zagging and constantly decreasing. The gray and green lines, however, are middling and hard to definitively classify. We'll see if patterns like these affect our ability to discern large and small spikes.

Next, we'll be looking at the 2-dimensional projections of our data, seeing if we can visually detect any pairwise correlations or clustering in our data between the days and times.

In [17]:
sns.pairplot(df)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x7f85b4b8c090>

This is a lot to look at all at once, but there are definite patterns emerging, so let's try to break some of those out.

First, we have a lot of these "Y" or "L" shaped patterns, which are valuable because they can show us when we are more likely to have higher values. For instance, if we look at the plot at (Mon-PM, Wed-AM), we can see that if your first price of the week is between 80 or 100, then your first chance for a spike would be on Wednesday morning. We can also tell that if you have a higher value, >100, on Monday evening, then you are most likely to have a max on Tuesday morning or Tuesday evening. Next, we have some plots that show localized but very strong positive correlation, like in the plots for (Mon-AM, Mon-PM) and (Mon-AM, Tues-AM). This tells us that, given a Monday morning price less than about 100, we have a good chance of seeing a linear change over Monday evening and Tuesday morning. It's also important to note where we have no visible patterns, like between (Mon-AM, Sat-PM). This tells us that using Monday morning's value alone, we have very little information for predicting the price on Saturday evening.

From the above plots, we can also see the distribution of prices by day. This gives us information on which days have the highest possible prices and which days have the possibility of presenting large spikes. By looking at the ticks for the plots, we can see that from Tues-PM to Fri-PM we have prices reach over 600, whereas the rest of the days max out around 150 to 200.

Let's take a closer look at the daily price distributions with some kernel density plots.

In [18]:
for col in df.columns:
    sns.kdeplot(df[col])
    plt.xlim((0, 700))
    plt.title(col)
    plt.show()

Interestingly, all of the days display bimodality, though some more definitively than others. For every day, the most probable values fall between 0 and 100, and the second most fall between 100 and 200, which are indicative of small spikes. However, between Tues-PM and Fri-PM, we have very trace probabilities of reaching prices from 250 to as high as 650. This shows just how high the prices can go and how infrequently one would encounter a large spike in the prices.

Preprocessing and Feature Engineering

Given our particular dataset, the only preprocessing that we really have to do is scaling. The spikes in our data will make [0,1] scaling impractical, as it will squash the other trend families that have lesser or no spikes, so we will be normalizing our data to the standard normal distribution.

In [19]:
# Prep Data
def scale_data(df):
    
    dfT = df.transpose()
    
    scaler = StandardScaler()
    dfT = scaler.fit_transform(dfT)
    
    df = pd.DataFrame(dfT.transpose())
    #df = pd.DataFrame([x - x[0] for x in df.values], columns=df.columns) # force initial value to zero

    return df
In [20]:
X = scale_data(df)
plot_ts(X.head())

Looking at the same plot of the head of our data that we initially looked at (but only the first five weeks), we can see some improvements in clarity -- the green line, which we were initially not sure how to classify, can now be clearly identified as a large spike.

The red and blue lines are clearly members of the random trend family. These will add noise to our data and will complicate our clustering analysis by making our sample space more nebulous. It would be good if we had a way of identifying them and labeling them or removing them from the dataset.

If they are truly random walks, then it follows that as time series data they would be stationary. Since we only have one period of the time series, we can't really use the Augmented Dickey-Fuller (ADF) test to test the whole week for stationarity. We might, however, be able to run windowed ADF tests on segments of the weeks to see if they are everywhere stationary or not.

In [21]:
# ADF function -- test for stationarity
def adf_test(series):
    
    #p <= 0.05 means stationary (null hyp is non-stationary)
    result = adfuller(series.dropna(), autolag='AIC')
    
    return result[1] # return p-value

# scan series with window size, search for non-stationary segments
def windowed_stationarity_test(series, window_size):
    
    for i in range(0, len(series)+1-window_size):
        if adf_test(series[i:i+window_size+1]) > 0.05:
            return 'non-stationary'
        
    return 'stationary'

# apply windowed stationarity test to df, add column of results
def add_stationarity_col(df):
    
    stat_vals = []
    for idx, row in df.iterrows():
        stat_vals.append(windowed_stationarity_test(row, 5))
    return stat_vals

After running this test in a preliminary analysis, it wasn't successful. The test only returned three weeks that were everywhere stationary. Our inability to filter out the random trends introduces a challenge to our clustering that will become evident as we go through our clustering steps.

In [22]:
#X['stationarity'] = add_stationarity_col(X)
In [23]:
#stationary = X.loc[X['stationarity']=='stationary', :]
#stationary.head()
In [24]:
#plot_ts(stationary.iloc[:5, :-1])
In [25]:
#non_stationary =  X.loc[X['stationarity']!='stationary', :]
#plot_ts(non_stationary.iloc[:5, :-1])
In [26]:
#X.loc[X['stationarity']!='stationary', 'stationarity'] = 1
#X.loc[X['stationarity']=='stationary', 'stationarity'] = 0
#X = X.iloc[:, :-1]

Clustering Analysis

Our clustering analysis will come in two steps: first we'll look at the dendrogram for hierarchical clustering to see if we can visually identify a good number of clusters that we should target, then we'll use KMeans++ to perform the actual clustering that we'll use.

When executing our clustering algorithms, we will be treating each week of data as a point in $\mathbb{R}^{12}$ and will be clustering based on Euclidean distances in that space. For visualizing the results, we will project each week back down to 2-dimensional space and plot them as conventional time series data.

Hierarchical Clustering Analysis (HCA)

We're going to be using agglomerative, or bottom-up, HCA. In this algorithm, each point starts as its own cluster and is sequentially joined into clusters with neighboring points based on the distance between the points or centroids of the clusters. We'll be using Ward's method to combine clusters by their squared Euclidean distances.

In [27]:
dend = dendrogram(linkage(X, method='ward'))

We don't get a whole of information about our clusters from this method, however. The noise caused by our random trend family closes the distances between our centroids and leads to a dataset that is too homogeneous for solely distance-based clustering methods.

We are likely to run into the same issue with our KMeans analysis as well.

K-Means ++

Since we were not able to discern a good number of clusters to use from our dendrogram, we'll try an elbow plot to see if can get any more information there.

In [28]:
# k means determine k
def elbow_plot(X):
    distortions = []
    K = range(1,20)
    for k in K:
        kmeanModel = KMeans(n_clusters=k,init='k-means++').fit(X)
        distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

    # Plot the elbow
    plt.plot(K, distortions, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Distortion')
    plt.title('The Elbow Method showing the optimal k')
    plt.show()
    
elbow_plot(X)

Here we see an almost perfectly smooth elbow plot, which gives us very little information due to the same issue with homogeneity that we saw in the dendrogram.

This calls into question whether or not this data can be effectively clustered at all, so let's assume that we need 12 clusters because each week has 12 price values.

In [29]:
n_clusters = 12
clusters = KMeans(n_clusters=n_clusters, init='k-means++').fit_predict(X)
In [30]:
clusters
Out[30]:
array([8, 0, 1, ..., 0, 7, 4], dtype=int32)
In [31]:
df['kmeans'] = clusters

Let's plot our original, unscaled data grouped by their clusters to visually evaluate the performance of KMeans clustering on our data.

In [32]:
for i in range(0, n_clusters):
    plot_ts(df.loc[df['kmeans']==i].iloc[0:250, :-1], title=f'cluster {i}')

The above plots actually show that the data is definitely suitable for clustering, but are sensitive to the noise of the random trend family. The cluster 9 plot is a great example: we see that the clustering is picking up the small spike on Saturday morning, but it's being clouded up by the random trends with values between 100 and 125.

With this discovery in hand, it became necessary to find a clustering algorithm that was robust to noise, and DBSCAN was a great candidate.

DBSCAN

DBSCAN is a density-based clustering algorithm that was designed to deal with outlying noise. Roughly speaking, the way it works is by looking at each point in the dataset, and if there are at least n other samples within some radius $\epsilon$ of a point, then it groups those points as a cluster. It's strength comes in that if a point doesn't have at least n neighbors within $\epsilon$, then that point will be marked as an outlier, represented as the -1 cluster in Sci-Kit Learn's DBSCAN class.

After some tuning and experimentation, it became clear that using 18 as our minimum sample threshold and 0.5 as our radius produced the best defined and balanced clusters.

In [33]:
dbscan = DBSCAN(eps=.5, min_samples=18).fit_predict(X)

df['dbscan'] = dbscan
sns.countplot(df['dbscan'])
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f85a11fd750>

It's worth noting that our -1 cluster is by far the largest cluster, which means a significant proportion of our weeks followed the random trend. This is why our other clustering methods struggled so much.

In [34]:
for i in range(-1, dbscan.max()+1):
    plot_ts(df.loc[df['dbscan']==i].iloc[0:250, :-2], title=f'cluster {i}')

Comparing the above plots with the plots produced by KMeans clustering, we see a vast improvement. The -1 cluster is capturing all of the random noise in our data, and the remaining clusters are well segmented by trend family (decreasing, large spike, and small spike) and by the day and time on which the prices reach their peak.

Clustering Results

The weeks that would be classified as having the random trend family posed a real challenge for our distance-based clustering methods, KMeans++ and HCA, partially due to the high proportion of our weeks that followed that trend.

DBSCAN definitely performed the best, owing to its ability to ignore points that fall in low-density regions, and produced the clusters that we will use to train our classifiers.


Building the Classifier

We're going to try out two methods for building a classifier: an artificial neural network and random forest. The objective is to predict the trend family that a week is likely to follow using only the fist few days of the week, and, by doing so, be able to tell on which day the given week will have its peak price.

We'll be using the first six price values to classify the week, which is half of the total readings for the week. It would be better if we could use a smaller number of readings for classifying because then we would be able to tell which trend family a week belongs to very early on and would have a better chance of catching the best price; however, due to the fact that nearly every trend begins with a steady price decrease for the first few readings, using too few price values to classify would lead to huge confusions in the model and tank our precision and recall metrics. This presents a significant drawback to our method, given that 9 of the 16 non-random trends have their maximum prices after Wednesday evening. This means that if we develop a perfect classifier, we will still miss 43.8% of times with maximum price values while waiting to have enough values for the classifier to work.

With these challenges in mind, we'll jump in by preparing and balancing the data.

Preparing and Balancing the Data (Random Over-Sampling)

The first thing that we'll need to do is declare our X and y arrays that we'll be using in our algorithms, and then we'll feed those into Sci-Kit Learn's RandomOverSampler to ensure that our dataset is balanced.

In [35]:
col_fltr = df.columns != 'kmeans'
df_cls = df.loc[:, col_fltr]
In [36]:
df_cls
Out[36]:
Mon-AM Mon-PM Tues-AM Tues-PM Wed-AM Wed-PM Thurs-AM Thurs-PM Fri-AM Fri-PM Sat-AM Sat-PM dbscan
0 116.0 119.0 104.0 106.0 68.0 63.0 58.0 111.0 92.0 71.0 65.0 114.0 -1
1 83.0 78.0 75.0 72.0 68.0 63.0 59.0 55.0 51.0 48.0 44.0 41.0 0
2 89.0 85.0 81.0 78.0 73.0 69.0 95.0 141.0 330.0 190.0 122.0 56.0 9
3 133.0 113.0 80.0 75.0 129.0 70.0 66.0 57.0 120.0 140.0 92.0 134.0 -1
4 91.0 87.0 83.0 80.0 99.0 200.0 414.0 148.0 122.0 77.0 84.0 87.0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10101 59.0 55.0 121.0 105.0 152.0 162.0 158.0 74.0 70.0 65.0 62.0 58.0 5
10102 92.0 88.0 85.0 81.0 77.0 72.0 68.0 97.0 116.0 199.0 202.0 195.0 10
10103 83.0 78.0 75.0 71.0 67.0 64.0 60.0 56.0 53.0 49.0 45.0 42.0 0
10104 98.0 93.0 88.0 85.0 104.0 160.0 248.0 215.0 111.0 50.0 54.0 96.0 -1
10105 92.0 105.0 170.0 596.0 171.0 104.0 71.0 58.0 73.0 47.0 77.0 61.0 7

10106 rows × 13 columns

In [37]:
val_counts = df['dbscan'].value_counts()
max_class = val_counts[val_counts == val_counts.max()].index[0]
class_count = len(val_counts)
print(f'There are {class_count} classes\nThe most frequent class is ({max_class})')
There are 17 classes
The most frequent class is (-1)
In [38]:
cols = df_cls.columns[df_cls.columns != 'dbscan']
X_ann = df_cls.loc[:, cols]
X_ann
Out[38]:
Mon-AM Mon-PM Tues-AM Tues-PM Wed-AM Wed-PM Thurs-AM Thurs-PM Fri-AM Fri-PM Sat-AM Sat-PM
0 116.0 119.0 104.0 106.0 68.0 63.0 58.0 111.0 92.0 71.0 65.0 114.0
1 83.0 78.0 75.0 72.0 68.0 63.0 59.0 55.0 51.0 48.0 44.0 41.0
2 89.0 85.0 81.0 78.0 73.0 69.0 95.0 141.0 330.0 190.0 122.0 56.0
3 133.0 113.0 80.0 75.0 129.0 70.0 66.0 57.0 120.0 140.0 92.0 134.0
4 91.0 87.0 83.0 80.0 99.0 200.0 414.0 148.0 122.0 77.0 84.0 87.0
... ... ... ... ... ... ... ... ... ... ... ... ...
10101 59.0 55.0 121.0 105.0 152.0 162.0 158.0 74.0 70.0 65.0 62.0 58.0
10102 92.0 88.0 85.0 81.0 77.0 72.0 68.0 97.0 116.0 199.0 202.0 195.0
10103 83.0 78.0 75.0 71.0 67.0 64.0 60.0 56.0 53.0 49.0 45.0 42.0
10104 98.0 93.0 88.0 85.0 104.0 160.0 248.0 215.0 111.0 50.0 54.0 96.0
10105 92.0 105.0 170.0 596.0 171.0 104.0 71.0 58.0 73.0 47.0 77.0 61.0

10106 rows × 12 columns

In [39]:
y = df_cls['dbscan']
y
Out[39]:
0        -1
1         0
2         9
3        -1
4         1
         ..
10101     5
10102    10
10103     0
10104    -1
10105     7
Name: dbscan, Length: 10106, dtype: int64
In [40]:
#smote = SMOTE(ratio='minority')
#X_sm, y_sm = smote.fit_sample(X, y)
In [41]:
ros = RandomOverSampler(ratio='auto', random_state=42);  
X_res, y_res = ros.fit_sample(X_ann, y)
In [42]:
sns.countplot(y_res)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f851023d650>

Now that our data are balaced, we'll perform our train-test split, using 40% of our data for validation.

In [43]:
x_train, x_test, y_train, y_test = train_test_split(X_res, y_res, test_size=.4, random_state=100)

We will now extract the first six columns from our dataset, corresponding to the first six price values for the week. These will be the independent variables for training and testing our models, with the cluster categories being the targets.

In [44]:
feature_count = 6

scaler = MinMaxScaler()
scaler.fit(x_train[:,:feature_count])

x_train = scaler.transform(x_train[:,:feature_count])
x_test = scaler.transform(x_test[:,:feature_count])
In [45]:
sns.countplot(y_train)
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f85734edf10>

We can see above that our data has remained balanced after our training and test sets have been defined.

Next, we'll encode our cluster classes, using the LabelEncoder class so that we're able to invert the encoding and retrieve our original cluster classes when we evaluate our models' performances.

In [46]:
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.fit_transform(y_test)

y_train_cat = to_categorical(y_train_encoded, num_classes=class_count, dtype='float32')
y_test_cat = to_categorical(y_test_encoded, num_classes=class_count, dtype='float32')

Our data is now ready to be used in training a classifier!

Artificial Neural Network

The first method we will use is an ANN, utilizing the Keras library and PlaidML on the back-end. We'll be using the rectified linear unit function as our intermediate activations and softmax on the output, since this is a multi-class classification model. We will also be using the adam optimizer over gradient descent, as the adam optimizer yielded better results in earlier iterations. For validation, we will be recording the accuracy of the model, and we'll use categorical cross-entropy as our loss function.

The network itself will be composed of 7 hidden layers and an input and output layer. We will not be performing any kind of regularization or dropout, since the model showed no signs of over-fitting in earlier iterations. We'll start with Keras' default learning rate of 0.001, and we will reduce it by a factor of 0.5 for any learning plateaus lasting 10 epochs, with a minimum learning rate of 0.00001. We'll also be monitoring the validation loss for early stopping.

In [47]:
classifier = Sequential()

classifier.add(Dense(64, input_dim=feature_count))
classifier.add(Activation('relu'))

classifier.add(Dense(128))
classifier.add(Activation('relu'))

classifier.add(Dense(256))
classifier.add(Activation('relu'))

classifier.add(Dense(512))
classifier.add(Activation('relu'))

classifier.add(Dense(1024))
classifier.add(Activation('relu'))

classifier.add(Dense(1024))
classifier.add(Activation('relu'))

classifier.add(Dense(128))
classifier.add(Activation('relu'))

classifier.add(Dense(64))
classifier.add(Activation('relu'))

classifier.add(Dense(class_count))
classifier.add(Activation('softmax'))

classifier.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
INFO:plaidml:Opening device "metal_amd_radeon_pro_570x.0"
In [48]:
learning_rate_reduction = ReduceLROnPlateau(monitor='val_acc', patience=10, 
                                            verbose=2, factor=0.5, min_lr=0.00001)

best_model = ModelCheckpoint('trend_classifier_ann.h5', monitor='val_acc', verbose=2, 
                             save_best_only=True, mode='max')

early_stopping = EarlyStopping(monitor='val_loss', min_delta=1e-10, 
                               patience=25,restore_best_weights=True)

model = classifier.fit(x=x_train, y=y_train_cat,
               batch_size=500, epochs=500,
               validation_data=(x_test, y_test_cat),
               callbacks=[learning_rate_reduction,best_model,early_stopping],
               verbose=0)
Epoch 00001: val_acc improved from -inf to 0.74537, saving model to trend_classifier_ann.h5

Epoch 00002: val_acc improved from 0.74537 to 0.74953, saving model to trend_classifier_ann.h5

Epoch 00004: val_acc improved from 0.74953 to 0.76516, saving model to trend_classifier_ann.h5

Epoch 00005: val_acc improved from 0.76516 to 0.77532, saving model to trend_classifier_ann.h5

Epoch 00010: val_acc improved from 0.77532 to 0.77619, saving model to trend_classifier_ann.h5

Epoch 00011: val_acc improved from 0.77619 to 0.77850, saving model to trend_classifier_ann.h5

Epoch 00012: val_acc improved from 0.77850 to 0.78003, saving model to trend_classifier_ann.h5

Epoch 00013: val_acc improved from 0.78003 to 0.78039, saving model to trend_classifier_ann.h5

Epoch 00017: val_acc improved from 0.78039 to 0.78270, saving model to trend_classifier_ann.h5

Epoch 00022: val_acc improved from 0.78270 to 0.78301, saving model to trend_classifier_ann.h5

Epoch 00025: val_acc improved from 0.78301 to 0.78583, saving model to trend_classifier_ann.h5

Epoch 00031: val_acc improved from 0.78583 to 0.78788, saving model to trend_classifier_ann.h5

Epoch 00032: val_acc improved from 0.78788 to 0.79239, saving model to trend_classifier_ann.h5

Epoch 00042: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.

Epoch 00043: val_acc improved from 0.79239 to 0.79372, saving model to trend_classifier_ann.h5

Epoch 00044: val_acc improved from 0.79372 to 0.79552, saving model to trend_classifier_ann.h5

Epoch 00054: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.

Epoch 00055: val_acc improved from 0.79552 to 0.79813, saving model to trend_classifier_ann.h5

Epoch 00065: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.

Epoch 00066: val_acc improved from 0.79813 to 0.79875, saving model to trend_classifier_ann.h5

Epoch 00067: val_acc improved from 0.79875 to 0.80003, saving model to trend_classifier_ann.h5

Epoch 00071: val_acc improved from 0.80003 to 0.80049, saving model to trend_classifier_ann.h5

Epoch 00076: val_acc improved from 0.80049 to 0.80070, saving model to trend_classifier_ann.h5

Epoch 00077: val_acc improved from 0.80070 to 0.80177, saving model to trend_classifier_ann.h5

Epoch 00082: val_acc improved from 0.80177 to 0.80316, saving model to trend_classifier_ann.h5

Epoch 00083: val_acc improved from 0.80316 to 0.80352, saving model to trend_classifier_ann.h5

Epoch 00093: ReduceLROnPlateau reducing learning rate to 6.25000029685907e-05.

Epoch 00096: val_acc improved from 0.80352 to 0.80398, saving model to trend_classifier_ann.h5

Epoch 00097: val_acc improved from 0.80398 to 0.80516, saving model to trend_classifier_ann.h5

Epoch 00102: val_acc improved from 0.80516 to 0.80603, saving model to trend_classifier_ann.h5

Epoch 00106: val_acc improved from 0.80603 to 0.80629, saving model to trend_classifier_ann.h5

Epoch 00112: val_acc improved from 0.80629 to 0.80659, saving model to trend_classifier_ann.h5

Epoch 00114: val_acc improved from 0.80659 to 0.80772, saving model to trend_classifier_ann.h5

Epoch 00121: val_acc improved from 0.80772 to 0.80808, saving model to trend_classifier_ann.h5

Epoch 00131: ReduceLROnPlateau reducing learning rate to 3.125000148429535e-05.

Epoch 00134: val_acc improved from 0.80808 to 0.80890, saving model to trend_classifier_ann.h5

Epoch 00136: val_acc improved from 0.80890 to 0.80972, saving model to trend_classifier_ann.h5

Epoch 00146: ReduceLROnPlateau reducing learning rate to 1.5625000742147677e-05.

Epoch 00156: ReduceLROnPlateau reducing learning rate to 1e-05.

Epoch 00181: val_acc improved from 0.80972 to 0.81080, saving model to trend_classifier_ann.h5

Epoch 00412: val_acc did not improve from 0.81080

ANN Evaluation

This ANN is scoring just over 81% for accuracy on the validation data, which is pretty good. However, given that we are using half of each week's prices to train, I would prefer to target an accuracy of over 90%.

It also appears that we were overly patient with our early stopping criteria, which could lead to over fitting. Let's check the accuracy and loss histories to make sure we haven't over-fitted the model.

In [49]:
plt.title('Loss')
plt.plot(model.history['loss'], label='Train Loss')
plt.plot(model.history['val_loss'], label='Test Loss')
plt.legend()
plt.show()

plt.title('Accuracy')
plt.plot(model.history['acc'], label='Train Accuracy')
plt.plot(model.history['val_acc'], label='Test Accuracy')
plt.legend()
plt.show()

The plots of accuracy and loss show very little evidence of over-fitting. We do, however, see a long plateau where the model had all but stopped improving, and though we were lucky enough to avoid over-fitting, it is a waste of computer hours.

To sum up our evaluation of the ANN, the model did a pretty good job in predicting the trend families of a given week, with an accuracy score of 81%, but we would prefer a score above 90%, given the limitations of our data. The training was also time expensive, though this was mainly due to our relaxed early stopping criteria.

Let's see if we can do better with a random forest model.

Random Forest

Random forest is notorious for having a lot of hyperparameters to tune when building models. Things like deciding how many decision trees to use, the maximum depth, the minimum samples when splitting, and many other parameters can make or break your model, and tuning them without the aid of a computer can be a real headache. So we're going to depend on the RandomizedSearchCV class from the Sci-Kit Learn library to search a range of parameters for us and return the model that performs the best.

For our model, we'll be tuning the number of decision tree estimators, the maximum number of features used, the max tree depth, the minimum number of samples to split, min samples in a leaf node, and whether or not to resample using bootstrapping. The reason we're doing a randomized search rather than a grid search is that we can search a wider range of parameters using a randomized search, since we won't have to exhaustively go through every combination of values; instead, we'll choose 100 parameter sets at random to test using 3-fold cross-validation.

In [50]:
rf = RandomForestClassifier()
In [51]:
n_estimators = [x for x in range(1,201,10)]
max_features = ['auto', 'sqrt']
max_depth = [x for x in range(1, 101, 10)] + [None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

params = {
    'n_estimators': n_estimators,
    'max_features': max_features,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'bootstrap': bootstrap   
}

rf_grid = RandomizedSearchCV(estimator = rf, param_distributions = params,
                             n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
In [52]:
rf_grid.fit(x_train, y_train_cat)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:   42.9s
[Parallel(n_jobs=-1)]: Done 150 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  7.1min finished
Out[52]:
RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None,
                                                    oob_sc...
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [1, 11, 21, 31, 41, 51, 61,
                                                      71, 81, 91, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [1, 11, 21, 31, 41, 51,
                                                         61, 71, 81, 91, 101,
                                                         111, 121, 131, 141,
                                                         151, 161, 171, 181,
                                                         191]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

Now that we've fit our model to the data, let's take a look at the returned best parameter set.

In [53]:
rf_grid.best_params_
Out[53]:
{'n_estimators': 51,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 91,
 'bootstrap': True}

According to our search, we should be using 51 trees, splitting nodes with as little as 2 samples with a minimum leaf sample of 1, a maximum depth of 91, and bootstrapping.

Random Forest Evaluation

Let's take a look at our model's accuracy as a baseline performance measure.

In [54]:
best_rf = rf_grid.best_estimator_

preds = best_rf.predict(x_test)
acc = accuracy_score(preds, y_test_cat)
print(f'Accuracy: {round(acc*100,2)}%')
Accuracy: 94.28%

The random forest model is out-performing the ANN by quite a bit, with an accuracy score of 94.3% (compared to 81% for the ANN). To further confirm the accuracy, we'll run a 10-fold cross-validation and compare to our original accuracy score.

In [55]:
kfold = KFold(n_splits=10, shuffle=True)
kfold_results = cross_val_score(best_rf, x_test, y_test_cat, cv=kfold)
In [56]:
results = f'KFold Avg. Accuracy: {round(kfold_results.mean()*100, 2)}% ({round(kfold_results.std()*100,2)})%'
print(results)
KFold Avg. Accuracy: 92.51% (0.37)%

For our cross-validation, we're getting a slightly lower accuracy of 92.5%, with a standard deviation of 0.37%, but that is still very good.

Let's take a look at the precision and recall values using the classification_report() function.

In [57]:
decoded_preds = le.inverse_transform(preds.argmax(axis=1))
decoded_cats = le.inverse_transform(y_test_cat.argmax(axis=1))

report = classification_report(decoded_cats, decoded_preds)
print(report)
              precision    recall  f1-score   support

          -1       0.83      0.82      0.83      1176
           0       0.87      0.63      0.73      1133
           1       0.98      1.00      0.99      1122
           2       0.99      1.00      0.99      1121
           3       0.98      1.00      0.99      1142
           4       0.80      0.83      0.81      1134
           5       0.99      1.00      0.99      1161
           6       0.99      1.00      0.99      1136
           7       0.99      1.00      1.00      1184
           8       1.00      1.00      1.00      1206
           9       0.81      0.88      0.84      1135
          10       0.93      0.94      0.93      1145
          11       0.98      1.00      0.99      1159
          12       0.94      0.93      0.94      1099
          13       0.99      1.00      0.99      1158
          14       0.98      1.00      0.99      1176
          15       0.99      1.00      1.00      1116

    accuracy                           0.94     19503
   macro avg       0.94      0.94      0.94     19503
weighted avg       0.94      0.94      0.94     19503

Overall, this report is very good. Most of the clusters were correctly predicted, with precision and recall scores over 80%.

The lowest score here is 63% for the recall of cluster 0. Knowing that the recall is true positives divided by the number of actual positives in the data, this mean that our classifier overly classified 0's as being members of other clusters, and therefore the false negative rate for the cluster 0 was high. Let's take a look at the confusion matrix to confirm.

We'll be using a custom labeled confusion matrix function so that we can more easily see the clusters' names.

In [58]:
def labeled_cm(actuals, preds, normalize=False, precision=2):
    """Create a labeled confusion matrix.
    
    Args:
        actuals (nx1 numpy.array): Actual label values from data.
        preds (nx1 numpy.array): Predicted label values from model.
        
    Returns:
        cm (pandas.DataFrame): Labeled confusion matrix.
    """
    
    if len(actuals) != len(preds):
        raise ValueError('Arguments must have the same length.')
        
    labels = list(set([x for x in actuals]))
    labels.sort()
    
    confusion_dict = {}
    for l in labels:
        confusion_dict[str(l)] = {}
        for k in labels:
            confusion_dict[str(l)][str(k)] = 0
            
    for i in range(len(actuals)):
        confusion_dict[str(actuals[i])][str(preds[i])] += 1
        
    cm = pd.DataFrame()
    
    for i in labels:
        i = str(i)
        for j in confusion_dict[i].keys():
            cm.loc[i, j] = confusion_dict[i][j]
            
    if normalize:
        for l in labels:
            cm.loc[str(l), :] = round(cm.loc[str(l), :] / cm.loc[str(l), :].sum(), precision)
        
    return cm
In [59]:
cm = labeled_cm(decoded_cats, decoded_preds, normalize=True)
cm
Out[59]:
-1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-1 0.82 0.02 0.02 0.01 0.02 0.01 0.01 0.01 0.01 0.0 0.01 0.00 0.02 0.00 0.01 0.03 0.01
0 0.09 0.63 0.00 0.00 0.00 0.14 0.00 0.00 0.00 0.0 0.11 0.02 0.00 0.01 0.00 0.00 0.00
1 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0.04 0.04 0.00 0.00 0.00 0.83 0.00 0.00 0.00 0.0 0.06 0.03 0.00 0.01 0.00 0.00 0.00
5 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00
9 0.03 0.03 0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.0 0.88 0.01 0.00 0.00 0.00 0.00 0.00
10 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.0 0.02 0.94 0.00 0.03 0.00 0.00 0.00
11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 1.00 0.00 0.00 0.00 0.00
12 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.0 0.02 0.01 0.00 0.93 0.00 0.00 0.00
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 1.00 0.00 0.00
14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 1.00 0.00
15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 1.00
In [60]:
plt.figure(figsize=(14,14))
sns.heatmap(cm, annot=True, fmt = '.2f', square=1, linewidth=1., yticklabels=False, cmap='YlGnBu')
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8581daf7d0>

Looking at our confusion matrix, it's easy to confirm our hypothesis: weeks in the cluster 0 are being incorrectly classified as -1, 4, and 9.

Looking at the plots of our clusters above, one can easily see how the confusion is happening for clusters 4 and 9. Both clusters 4 and 9 are large spike trends with maxes in the later half of the week. This means that for the first half of the week, they look exactly like the 0 cluster: constantly decreasing.

This is the confusion that I referred to in the "Building the Classifier" introduction, and if we tried to use fewer days of data to predict the week, then this confusion would be exacerbated, and more clusters would be incorrectly identified as cluster 0 and vice versa.

Classification Results

The random forest classifier out-performed the ANN by quite a large margin: 92% accuracy versus 81% for the ANN. The random forest classifier was also much quicker to train and validate, though, the ANN was slowed down due to an over-abundance of patience with the early stopping criterion.

The precision and recall for the random forest model was also very good, with only one value below 80% -- the recall for cluster 0. This was an expected bias caused by each trend family's early-week decent pattern, which mimicked the decreasing trend family's pattern for the first few days.

With our model that relies on the first six price values for the week and the fact that only 9 out of 16 non-random trend families are predictable by this classifier, this means our model will only really be useful about 51% of the time. Not a super powerful result, but it's the hand we've been given.


Conclusion

This has been a fairly comprehensive project, from the collection of the data to the clustering and classification of each time series. The random trend family presented a serious threat to our ability to cluster the weeks of data, but we were able to overcome that by relying on the DBSCAN algorithm to filter out the random noise during its cluster assignment. The decreasing trend family made our classification of the data very difficult, forcing us to rely on more days of data than we would want in order for our classifier to make a usable predictor. Overall, we were able to get some promising results with our random forest classifier, which scored 92% accuracy in its cross-validation, but its usefulness in practice is limited by the fact that it requires 6 price values -- half of the week's readings -- to work.

In conclusion, if you want to know which trend your week is likely to follow, use one of the online calculators based on Treeki's work. Those will rely on deterministic predictions based on the actual RNG of ACNH, so you really just can't go wrong there. Regardless, this has been a ton of fun to work on, and thanks for reading!