I cut my teeth in the retail world - analyzing churn and LTV, identifying real estate opportunities for store deployments, optimizing margin, etc. - so I've often wanted to work on personal projects that use retail data. I started searching, and it turns out that retail companies are not so eager to share their data (understandably so), so I've embarked on a mission to create my own.
Meet Naturals Retailer, your one-stop shop for soaps, lotions, and suppliments, and the official retailer of brands like "Bill's Soap Co." and "Treesap Naturals"!
This simulation will revolve around the idea that the trends and patterns observed in retail analytics have their roots in individual customer decisions and behaviors. Therefore, our primary simulated object will be the customers themselves. This will allow us to see high-level metrics, like total seasonal sales, all the way down to micro-level "customer x likes to buy product y". If you want to see how things turn out, check out the Power BI dashboard that uses data from this simulation.
Now let's get going!
import pandas as pd
import numpy as np
import datetime as dt
import math
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup
from io import StringIO
import matplotlib.pyplot as plt
%matplotlib inline
The first step is build our business - "Natural Retailers." I've created product categories, brands, and products for our simulated business to sell. Each product has a brand, category, msrp, and unit cost associated with it, which will allow us to do the majority of classic retail KPI tracking.
categories = [
{'id': 1, 'descr': 'Soap'},
{'id': 2, 'descr': 'Lotion'},
{'id': 3, 'descr': 'Supplement'}
]
brands = [
{'id': 1, 'descr': "Bill's Soap Co."},
{'id': 2, 'descr': "Taysha's Health & Wellness"},
{'id': 3, 'descr': "Treesap Naturals"}
]
products = [
{'id': 1, 'descr': 'Gentle Soap', 'msrp': 10.00, 'unit_cost': 5.00, 'brand': 1, 'category': 1},
{'id': 2, 'descr': 'Rejuvinating Soap', 'msrp': 12.00, 'unit_cost': 6.00, 'brand': 1, 'category': 1},
{'id': 3, 'descr': 'Moisturizing Lotion', 'msrp': 12.00, 'unit_cost': 6.00, 'brand': 1, 'category': 2},
{'id': 4, 'descr': 'All Clean Hand Soap', 'msrp': 11.00, 'unit_cost': 6.60, 'brand': 2, 'category': 1},
{'id': 5, 'descr': 'All Clean Hand Lotion', 'msrp': 11.00, 'unit_cost': 6.60, 'brand': 2, 'category': 2},
{'id': 6, 'descr': 'Hempy Hand Lotion', 'msrp': 13.00, 'unit_cost': 7.80, 'brand': 2, 'category': 2},
{'id': 7, 'descr': 'Hempy Health Supplements', 'msrp': 15.00, 'unit_cost': 9.00, 'brand': 2, 'category': 3},
{'id': 8, 'descr': 'Natural Cedar Lotion', 'msrp': 12.00, 'unit_cost': 6.00, 'brand': 3, 'category': 2},
{'id': 9, 'descr': 'Natural Extract', 'msrp': 15.00, 'unit_cost': 7.50, 'brand': 3, 'category': 3},
{'id': 10, 'descr': 'Natural Energy', 'msrp': 14.00, 'unit_cost': 7.00, 'brand': 3, 'category': 3}
]
products = pd.DataFrame(products)
categories = pd.DataFrame(categories)
brands = pd.DataFrame(brands)
vi_products = pd.merge(left=products, right=brands, left_on='brand', right_on='id', how='inner', suffixes=('', '_brand'))
vi_products = pd.merge(left=vi_products, right=categories, left_on='category', right_on='id', how='inner', suffixes=('', '_categ'))
vi_products.drop(['id_categ', 'id_brand'], axis=1, inplace=True)
vi_products.head()
Next, we need customers! Or at least, names to give them. We'll use the top 50 male and female names listed on the SSA website below, along with the most common U.S. surnames listed on thoughtco.com.
# baby names url
url = "https://www.ssa.gov/oact/babynames/decades/century.html"
res = requests.get(url)
if res.status_code==200:
soup = BeautifulSoup(res.text)
rows = soup.find('table').find('tbody').find_all('tr')
rows[:5]
names = []
for r in rows:
data = r.find_all('td')
try:
names.append((data[1].text, data[3].text))
except IndexError as e:
continue
len(names)
male_names = [x[0] for x in names]
female_names = [x[1] for x in names]
# last names url
last_url = "https://www.thoughtco.com/most-common-us-surnames-1422656"
res = requests.get(last_url)
if res.status_code==200:
soup = BeautifulSoup(res.text)
rows = soup.find('table').find('tbody').find_all('tr')
rows[0]
last_names = []
for r in rows:
try:
last_names.append(r.find_all('a')[0].text)
except IndexError as e:
continue
len(last_names)
The customers gotta live somewhere! Rather than listing out the 50 states myself, a much faster option is to scrape them from the web. This way, we also gain access to states' U.S. region names and divisions.
states_url = "https://raw.githubusercontent.com/cphalpert/census-regions/master/" + \
"us%20census%20bureau%20regions%20and%20divisions.csv"
res = requests.get(states_url)
if res.status_code == 200:
states_csv = StringIO(res.text)
states_df = pd.read_csv(states_csv, sep=',')
states_df.head()
state_names = states_df['State Code'].tolist()
states_df.Region.unique()
states_df.loc[states_df['State Code']=='NY', 'Region'].values[0]
Here's where the heavy lifting starts. We will need to be able to define customers' shopping behaviors, and fundamental to that is the concept of seasonality.
We're going to make a shopper's seasonality dependent on their region. For instance, the weather in California permits shoppers to shop year-round, whereas a shopper in the Notheast might have trouble driving to the store in winter. We're also going to account somewhat for the holiday's (a spike in November from Black Friday) and other seasonal "bumps."
Each Seasonality
object will have a seasonal CDF distribution that will allow for the use of the inverse transform method during simulation.
class Seasonality:
def __init__(self, state_code):
self._seasonality_lookup = {
'Northeast': [.2, .5, .1, .2, .5, .3, .6, .4, .2, .1, 1, .7], # Jan - Dec
'South': [0, .3, .1, .4, .6, .5, .6, .4, .1, .2, .8, .8],
'Midwest': [.3, .4, .1, 0, .3, .7, .8, .5, .4, .2, .9, .5],
'West': [.3, .4, .2, .1, .1, .2, .4, .6, .7, .3, .6, .5]
}
self._seasonal_region = states_df.loc[states_df['State Code']==state_code, 'Region'].values[0]
self._seasonal_weights = self._seasonality_lookup[self._seasonal_region]
self._seasonal_rands = [max(np.random.normal(loc=x, scale=0.15), 0) for x in self._seasonal_weights]
rand_tot = sum(self._seasonal_rands)
self.seasonal_behavior = [x/rand_tot for x in self._seasonal_rands]
self.seasonal_cdf = list(self.gen_seasonal_cdf())
def gen_seasonal_cdf(self):
tot = 0
for item in self.seasonal_behavior:
tot += item
yield tot
s = Seasonality('LA')
s.seasonal_cdf
Next, we'll need a class to determine the behavior of an individual customer. Each customer will have shopping frequencies, spending distributions, seasonalities, and brand and category preferences that will be initialized upon their instanciation. These will be created by a Behavior
class.
The Behavior
class will also create segmenting behaviors between male and female customers, though the customer's sex will be omitted from the final dataset. This will allow for analytists to explore these differences with unsupervised methods.
class Behavior:
def __init__(self):
self.frequency = 0
self.avg_spend = 0
self.std_spend = 0
self.seasonality = None
self.brand_weights = {
1: 0,
2: 0,
3: 0
}
self.category_weights = {
1: 0,
2: 0,
3: 0
}
self.seasonal_cdf = []
def cum_sum(self, ls):
tot = 0
for item in ls:
tot += item
yield tot
def init_brand_weights(self, customer):
if customer.gender == 'Male':
pref = [.4, .1, .3]
else:
pref = [.5, .4, .3]
rands = np.random.rand(1,3) + pref
weights = rands / rands.sum()
weights = list(weights[0])
cdf = [x for x in self.cum_sum(weights)]
for i in range(len(cdf)):
self.brand_weights[i+1] = cdf[i]
def init_category_weights(self, customer):
if customer.gender == 'Male':
pref = [.3, .5, 1]
else:
pref = [.3, .8, 1]
rands = np.random.rand(1,3) + pref
weights = rands / rands.sum()
weights = list(weights[0])
cdf = [x for x in self.cum_sum(weights)]
for i in range(len(cdf)):
self.category_weights[i+1] = cdf[i]
def init_seasonal_cdf(self, customer):
self.seasonality = Seasonality(customer.state)
self.seasonal_cdf = self.seasonality.seasonal_cdf
def fit(self, customer):
# create variance between sex
if customer.gender == 'Male':
self.avg_spend = max(np.random.normal(loc=23, scale=12.5), 10)
self.std_spend = max(np.random.normal(loc=8, scale=3), 2)
self.frequency = max(np.random.normal(loc=3, scale=3), 1)
else:
self.avg_spend = max(np.random.normal(loc=27, scale=14), 10)
self.std_spend = max(np.random.normal(loc=5, scale=2), 2)
self.frequency = max(np.random.normal(loc=4, scale=2), 1)
self.init_brand_weights(customer)
self.init_category_weights(customer)
self.init_seasonal_cdf(customer)
return self
The Customer
class will bring it all together. Each customer will have id information, location data, and a set of behaviors to determine what decisions they will make during the simulation.
class Customer:
def __init__(self, id):
self.id = id
self.first_name = None
self.last_name = None
self.gender = None
self.state = None
self.behavior = None
def __str__(self):
print_out = (f"First Name: {self.first_name}\n" +
f"Last Name: {self.last_name}\n" +
f"Gender: {self.gender}\n" +
f"State: {self.state}\n" +
f"Behavior:" +
f"\n\tAvg. Spend: {self.behavior.avg_spend:.2f}" +
f"\n\tFreq.: {self.behavior.frequency:.2f}" +
f"\n\tSeas.: {[round(x, 2) for x in self.behavior.seasonal_cdf]}" +
f"\n\tSeas. Reg.: {self.behavior.seasonality._seasonal_region}")
return print_out
def __call__(self):
return self.generate_customer()
def generate_customer(self):
r = np.random.rand()
self.gender = 'Male' if r < 0.4 else 'Female'
if self.gender == 'Male':
self.first_name = np.random.choice(male_names)
else:
self.first_name = np.random.choice(female_names)
self.last_name = np.random.choice(last_names)
self.state = np.random.choice(state_names)
self.behavior = Behavior().fit(self)
return self
print(Customer(id=0).generate_customer())
Simulate transactions for an entire year for each customer, then look at high-level trends to see how well the simulation works.
To simulate shopping times:
n_shops
.n_shops
random numbers [0, 1]
and use inverse transform on the customer's seasonal_cdf
to generate shopping months.To simulate a transactions:
For each customer:
# inverse tranform method on discrete cdf
def itm(n, cdf):
for i in range(len(cdf)):
if n > cdf[i]:
continue
else:
break
return i + 1
# shopping times
def generate_shopping_months(customer, year):
f = customer.behavior.frequency
cdf = customer.behavior.seasonal_cdf
n_shops = np.random.poisson(f)
r = np.random.rand(n_shops)
months = [itm(x, cdf) for x in r] # inverse transform on seasonal_cdf
months.sort()
months = [int(str(year)+str(x).zfill(2)) for x in months]
return months
test_cust = Customer(id=0)()
print(test_cust)
shop_months = generate_shopping_months(test_cust, 2020)
shop_months
def generate_transaction(customer, id, month):
month_id = month
brand_cdf = [customer.behavior.brand_weights[x] for x in customer.behavior.brand_weights]
categ_cdf = [customer.behavior.category_weights[x] for x in customer.behavior.category_weights]
spend_cap = max(np.random.normal(loc=customer.behavior.avg_spend, scale=customer.behavior.std_spend), 10)
purchased_items = []
ext_prc = 0
while ext_prc <= spend_cap:
r = np.random.rand(2)
brand = itm(r[0], brand_cdf)
prods = products.loc[products['brand']==brand, ['id', 'category', 'msrp']]
cat_opts = prods['category'].tolist()
cat = itm(r[1], categ_cdf)
if cat not in cat_opts:
cat = np.random.choice([min(cat_opts), max(cat_opts)])
items = prods.loc[prods['category']==cat, ['id', 'msrp']]
purchased_item = np.random.choice(items['id'].tolist())
ext_prc += items.loc[items['id']==purchased_item, 'msrp'].values[0]
purchased_items.append(purchased_item)
transaction = {'id': id, 'month': month, 'customer_id': customer.id, 'product_ids': purchased_items}
return transaction
generate_transaction(test_cust, 0, 202004)
def line_sales_rows(trx):
prods = trx['product_ids']
uprods = {}
for p in prods:
if p in uprods:
uprods[p] += 1
else:
uprods[p] = 1
rows = []
for p in uprods:
dummy = {}
dummy['id'] = trx['id']
dummy['product_id'] = p
dummy['qty_sold'] = uprods[p]
rows.append(dummy)
return rows
test_trx = generate_transaction(test_cust, 0, 202004)
line_sales_rows(test_trx)
def transaction_rows(trx):
return {'id': trx['id'], 'month': trx['month'], 'customer_id': trx['customer_id']}
transaction_rows(test_trx)
For each customer:
customers = [Customer(id=x)() for x in range(2000)]
cols = ['id', 'first_name', 'last_name', 'state']
cust_ls = [(c.id, c.first_name, c.last_name, c.state) for c in customers]
customers_df = pd.DataFrame(cust_ls, columns=cols)
customers_df.head()
years = [2018, 2019, 2020]
cust_months = [{c: generate_shopping_months(c, y)} for c in customers for y in years]
cust_months[:5]
cust_month_pairs = []
for cm in cust_months:
for c in cm:
for m in cm[c]:
cust_month_pairs.append((c, m))
cust_month_pairs[:5]
cust_month_df = pd.DataFrame(cust_month_pairs)
cust_month_df.sort_values(1, ascending=True, inplace=True)
cust_month_df.reset_index(drop=True, inplace=True)
cust_month_df.reset_index(drop=False, inplace=True) # use index as trx_id
cust_month_df.columns = ['id', 'customer_id', 'month']
cust_month_df.head()
cm = cust_month_df.to_numpy()
raw_transactions = []
for row in tqdm(cm):
raw_transactions.append(generate_transaction(row[1], row[0], row[2]))
line_transactions = []
for t in raw_transactions:
line_transactions += line_sales_rows(t)
transactions = [transaction_rows(t) for t in raw_transactions]
line_transactions = pd.DataFrame(line_transactions)
transactions = pd.DataFrame(transactions)
transactions['trx_dat'] = pd.to_datetime(
(transactions['month'].astype(str).str.slice(0, 4) +
'-' +
transactions['month'].astype(str).str.slice(4)) +
'-01'
)
line_transactions.head()
transactions.head()
vi_sales_history = pd.merge(left=transactions, right=line_transactions, on='id', how='inner')
vi_sales_history = pd.merge(left=vi_sales_history, right=vi_products,
left_on='product_id', right_on='id',
how='inner', suffixes=('_trx', '_product'))
vi_sales_history['ext_prc'] = vi_sales_history['qty_sold'] * vi_sales_history['msrp']
vi_sales_history.sort_values('id_trx', ascending=True, inplace=True)
vi_sales_history.head()
monthly_sales = vi_sales_history[['trx_dat', 'ext_prc']].groupby('trx_dat').sum()
monthly_sales.head()
monthly_sales.plot()
customers_df['last_trx_dat'] = vi_sales_history[['customer_id', 'trx_dat']].groupby('customer_id').max()
products.head()
categories.head()
brands.head()
vi_products.head()
line_transactions.head()
transactions.head()
states_df.columns = ['descr', 'id', 'region', 'division']
states_df.head()
customers_df.head()
vi_sales_history.head()
This simlulation produced the types of behaviors that a data analyst for a retail company would expect to see, minus one thing: trend. While the simulation produces realistic seasonalities and customer shopping habits, it does not allow for the growth of sales year-over-year.
However, the results are good enough to provide simlulated data for anyone looking to try out retail analytics. For a look at some of the results, check out the dashboard I created that uses this data.