Liquor Sales Prediction

Yousefxelbayomi
19 min readFeb 4, 2022

--

Yousef Elbayoumi

Project Description:

The main goal of this project is to predict liquor sales (prices). I will work on uncleaned data, I will show how I handle the cleaning part, then I will show the machine learning model and how to find the optimal model and save it and test it.

I will show the output in bold to make it clear for viewers.

Import necessary libraries

import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import iplot
import pickle

from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.layers import Dense, Activation
from keras.models import Sequential

Ignore all annoying warnings

warnings.filterwarnings("ignore")

The dataset I used

https://www.kaggle.com/residentmario/iowa-liquor-sales

Data description:

Note: I have no interest in the data :D I just wanted to pick a huge dataset with many columns and not cleaned so I can show some real work.

df = pd.read_csv('Iowa_Liquor_Sales.csv')

The data at a glance

df.head()
png
# Print the numerical data analysis
df.describe()
png
# Print the data types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12591077 entries, 0 to 12591076
Data columns (total 24 columns):
# Column Dtype
--- ------ -----
0 Invoice/Item Number object
1 Date object
2 Store Number int64
3 Store Name object
4 Address object
5 City object
6 Zip Code object
7 Store Location object
8 County Number float64
9 County object
10 Category float64
11 Category Name object
12 Vendor Number float64
13 Vendor Name object
14 Item Number int64
15 Item Description object
16 Pack int64
17 Bottle Volume (ml) int64
18 State Bottle Cost object
19 State Bottle Retail object
20 Bottles Sold int64
21 Sale (Dollars) object
22 Volume Sold (Liters) float64
23 Volume Sold (Gallons) float64
dtypes: float64(5), int64(5), object(14)
memory usage: 2.3+ GB

The number of rows:

# Print the number of rows
print(len(df))
12591077

The number of columns:

# Print the number of columns
print(len(df.columns))
24

Listing all column names:

# list all column names
list(df)
['Invoice/Item Number',
'Date',
'Store Number',
'Store Name',
'Address',
'City',
'Zip Code',
'Store Location',
'County Number',
'County',
'Category',
'Category Name',
'Vendor Number',
'Vendor Name',
'Item Number',
'Item Description',
'Pack',
'Bottle Volume (ml)',
'State Bottle Cost',
'State Bottle Retail',
'Bottles Sold',
'Sale (Dollars)',
'Volume Sold (Liters)',
'Volume Sold (Gallons)']

The shape of data:

# Print the shape of data
print(df.shape)
(12591077, 24)

Checking duplicates:

# Check for the duplicates values
df.drop_duplicates()
df.shape
(12591077, 24)

It looks like we don’t have duplicated values

# Print the size of data (columns * rows)
df.size
302185848

Checking for null values:

# Print the number of null values in each column
print(df.isnull().sum())
Invoice/Item Number 0
Date 0
Store Number 0
Store Name 0
Address 2376
City 2375
Zip Code 2420
Store Location 2375
County Number 79178
County 79178
Category 8020
Category Name 16086
Vendor Number 3
Vendor Name 1
Item Number 0
Item Description 0
Pack 0
Bottle Volume (ml) 0
State Bottle Cost 10
State Bottle Retail 10
Bottles Sold 0
Sale (Dollars) 10
Volume Sold (Liters) 0
Volume Sold (Gallons) 0
dtype: int64

The sum of all nulls:

# Print the total number of null values in data
print(df.isnull().sum().sum())
192042

First issues I’ve noticed

1. Null values

We could have multiple null values in the same row so the percent would be less than 1.525%

null_percent = round(df.isnull().sum().sum()/len(df) * 100.0, 3)
print('The max percent of null values in the dataset is: {}%'.format(null_percent))
The max percent of null values in the dataset is: 1.525%

Dealing with numerical columns

Replace with mode value:

I decided to use the mode in the following null values because I can’t use mean obviously as I would get a new value not in the list, also min and max don’t look correct to me because I don’t want more outliers, mode is the correct option.

df['County Number'].value_counts().idxmax()

# The value 77.0 has repeated most of the time, so I would replace it with the nan values
77.0

Dealing with NA countries:

# Replace the missing 'County Number' values by the most frequent 
df['County Number'].replace(np.nan, 77.0, inplace=True)

#df.loc[df['County Number'] == 77.0] # This is how I get the country name
# And by logic, I will replace nan values in 'Country' with 'POLK' as it's the country name for country number 77.0
df['County'].replace(np.nan, 'POLK', inplace=True)

The mode value for the category column:

df['Category'].value_counts().idxmax()
1031080.0

Dealing with NA categories:

# Replace the missing 'Category' values by the most frequent 
df['Category'].replace(np.nan, 1031080.0, inplace=True)

#df.loc[df['Category'] == 1031080.0] # This is how I get the category name
# And by logic, I will replace nan values in 'Category Name' with 'VODKA 80 PROOF'
# as it's the category name for category number 1031080.0
df['Category Name'].replace(np.nan, 'VODKA 80 PROOF', inplace=True)

The mode value for the vendor number column:

df['Vendor Number'].value_counts().idxmax()
260.0

Dealing with NA vendors:

# Replace the missing 'Vendor Number' values by the most frequent 
df['Vendor Number'].replace(np.nan, 260.0, inplace=True)

#df.loc[df['Vendor Number'] == 260.0] # This is how I get the vendor name
# And by logic, I will replace nan values in 'Vendor Name' with 'DIAGEO AMERICAS'
# as he's the vendor name for vendor number 260.0
df['Vendor Name'].replace(np.nan, 'DIAGEO AMERICAS', inplace=True)

Replace with the mean value:

I decided to use the mean in the following null values because it’s the most logical thing when we talk about prices, min and max don’t look correct to me because I don’t want more outliers, mode can be correct but as I said mean is the best option for missing prices.

# First I got to remove the $ sign and change the type to 'float' with this simple code
df['State Bottle Cost'] = df['State Bottle Cost'].str.replace('$', '').astype('float')

avg_sbc = df['State Bottle Cost'].mean(axis=0)
print(round(avg_sbc, 2))
df['State Bottle Cost'].replace(np.nan, 9.66, inplace=True)
9.66

Removing dollar sign from ‘State Bottle Retail’ values:

# First I got to remove the $ sign and change the type to 'float' with this simple code
df['State Bottle Retail'] = df['State Bottle Retail'].str.replace('$', '').astype('float')

avg_sbr = df['State Bottle Retail'].mean(axis=0)
print(round(avg_sbr, 2))
df['State Bottle Retail'].replace(np.nan, 14.51, inplace=True)
14.51

Removing dollar sign from ‘Sale (Dollars)’ values:

# First I got to remove the $ sign and change the type to 'float' with this simple code
df['Sale (Dollars)'] = df['Sale (Dollars)'].str.replace('$', '').astype('float')

avg_sale = df['Sale (Dollars)'].mean(axis=0)
print(round(avg_sale, 2))
df['Sale (Dollars)'].replace(np.nan, 116.59, inplace=True)
116.59

Dealing with text columns:

When we deal with text, there are not really many options about how to deal with null values, I only can use mode, replace with upper or lower value, or delete the row. I decided to use the mode because the text type is an address that is so detailed, it’s hard to predict the correct value so I just use mode.

df['City'].value_counts().idxmax()'DES MOINES'
# Replace the missing 'City' values by the most frequent
df['City'].replace(np.nan, 'DES MOINES', inplace=True)

Creating ‘DES MOINES’ data frame:

# As the remaining missing values all must be in 'DES MOINES' city, I will create sub data frame
df2 = df.loc[df['City'] == 'DES MOINES']

Working on the address column:

df2['Address'].value_counts().idxmax()'3221 SE 14TH ST'
# Replace the missing 'Address' values by the most frequent
df['Address'].replace(np.nan, '3221 SE 14TH ST', inplace=True)

Working on the store location column:

df2['Store Location'].value_counts().idxmax()'3221 SE 14TH ST\nDES MOINES 50320\n(41.554101, -93.596754)'
# Replace the missing 'Store Location' values by the most frequent
df['Store Location'].replace(np.nan, '3221 SE 14TH ST\nDES MOINES 50320\n(41.554101, -93.596754)', inplace=True)

Working on the zip code column:

df2['Zip Code'].value_counts().idxmax()'50314'# Replace the missing 'Zip Code' values by the most frequent 
df['Zip Code'].replace(np.nan, '50314', inplace=True)

Checking for null values, again:

# Print the number of null values in each column
print(df.isnull().sum())

# Print the total number of null values in data
print('Total number of null values:',df.isnull().sum().sum())
Invoice/Item Number 0
Date 0
Store Number 0
Store Name 0
Address 0
City 0
Zip Code 0
Store Location 0
County Number 0
County 0
Category 0
Category Name 0
Vendor Number 0
Vendor Name 0
Item Number 0
Item Description 0
Pack 0
Bottle Volume (ml) 0
State Bottle Cost 0
State Bottle Retail 0
Bottles Sold 0
Sale (Dollars) 0
Volume Sold (Liters) 0
Volume Sold (Gallons) 0
dtype: int64
Total number of null values: 0

BUT!

Normally, this is how I deal with null values, but in our current situation, we should be able to drop null values and still have over 12M rows to choose from, this loss is acceptable and more accurate.

So,

df = pd.read_csv('Iowa_Liquor_Sales.csv') # Loading the dataset again
df.dropna(inplace=True) # Dropping all rows contains null values

# Print the number of null values in each column
print(df.isnull().sum())
# Print the total number of null values in data
print('Total number of null values:',df.isnull().sum().sum())
Invoice/Item Number 0
Date 0
Store Number 0
Store Name 0
Address 0
City 0
Zip Code 0
Store Location 0
County Number 0
County 0
Category 0
Category Name 0
Vendor Number 0
Vendor Name 0
Item Number 0
Item Description 0
Pack 0
Bottle Volume (ml) 0
State Bottle Cost 0
State Bottle Retail 0
Bottles Sold 0
Sale (Dollars) 0
Volume Sold (Liters) 0
Volume Sold (Gallons) 0
dtype: int64
Total number of null values: 0

2. Data types

2.1. Date column

# Change the data type from object to date
df['Date'] = pd.to_datetime(df['Date'])

2.2. State Bottle Cost column & State Bottle Retail & Sale (Dollars) columns

# Change the data type from object to float
df['State Bottle Cost'] = df['State Bottle Cost'].str.replace('$', '').astype('float')
df['State Bottle Retail'] = df['State Bottle Retail'].str.replace('$', '').astype('float')
df['Sale (Dollars)'] = df['Sale (Dollars)'].str.replace('$', '').astype('float')

2.3. Zip Code column

# Actually, I didn't change the column type because I realized some rows has the Zip Code as "712-2",
# therefore it will stay as object type, and I don't need it as an integer anyways

print('Number of unique Zip Code:', df['Zip Code'].nunique())

Number of unique Zip Code: 871

Rows with ‘712–2’ zip code value:

df.loc[df['Zip Code'] == '712-2']
png

3. Column names

# Rename columns
nameDict = {"Invoice/Item Number":"Invoice_number",
"Store Number":"Store_number",
"Store Name":"Store_name",
"Zip Code":"Zip_code",
"Store Location":"Store_location",
"County Number":"County_number",
"County":"County_name",
"Category":"Category_number",
"Category Name":"Category_name",
"Vendor Name":"Vendor_name",
"Vendor Number":"Vendor_number",
"Item Number":"Item_number",
"Item Description":"Item_description",
"Bottle Volume (ml)":"Bottle_volume_ml",
"State Bottle Cost":"State_bottle_cost_($)",
"State Bottle Retail":"State_bottle_retail_($)",
"Bottles Sold":"Bottles_sold",
"Sale (Dollars)":"Sale_price_($)",
"Volume Sold (Liters)":"Volume_sold_litre",
"Volume Sold (Gallons)":"Volume_sold_gallon"}

df.rename(columns = nameDict,inplace=True)

4. Renaming and removing duplicated values

# Fixing 'Category_name' column
df.loc[df['Category_name'].str.contains('WHISK', case=False), 'Category_name'] = 'WHISKEY'
df.loc[df['Category_name'].str.contains('VODKA', case=False), 'Category_name'] = 'VODKA'
df.loc[df['Category_name'].str.contains('TEQ', case=False), 'Category_name'] = 'TEQUILA'
df.loc[df['Category_name'].str.contains('GIN', case=False), 'Category_name'] = 'GIN'
df.loc[df['Category_name'].str.contains('SCHNA', case=False), 'Category_name'] = 'SCHNAPPS'
df.loc[df['Category_name'].str.contains('RUM', case=False), 'Category_name'] = 'RUM'
df.loc[df['Category_name'].str.contains('Liqueur', case=False), 'Category_name'] = 'LIQUEUR'
df.loc[df['Category_name'].str.contains('WINE', case=False), 'Category_name'] = 'WINE'
df.loc[df['Category_name'].str.contains('BEER', case=False), 'Category_name'] = 'BEER'
df.loc[df['Category_name'].str.contains('Bourbon', case=False), 'Category_name'] = 'BOURBON'
df.loc[df['Category_name'].str.contains('Brandie', case=False), 'Category_name'] = 'Brandies'
df.loc[df['Category_name'].str.contains('scotch', case=False), 'Category_name'] = 'SCOTCH'

Cleaning city column:

# Fixing 'City' column
print('Total number of cities before:',df['City'].nunique())

df['City'] = df['City'].str.upper()

print('Total number of cities after:',df['City'].nunique())
Total number of cities before: 790
Total number of cities after: 416

Cleaning county name column:

# Fixing 'County_name' column
print('Total number of countries before:',df['County_name'].nunique())

df['County_name'] = df['County_name'].str.upper()

print('Total number of countries after:',df['County_name'].nunique())
Total number of countries before: 200
Total number of countries after: 103

Cleaning category name column:

# Fixing 'Category_name' column
print('Total number of Categories before:',df['Category_name'].nunique())

df['Category_name'] = df['Category_name'].str.upper()

print('Total number of Categories after:',df['Category_name'].nunique())
Total number of Categories before: 42
Total number of Categories after: 41

Cleaning vendor name column:

# Fixing 'Vendor_name' column
print('Total number of vendors before:',df['Vendor_name'].nunique())

df['Vendor_name'] = df['Vendor_name'].str.upper()

print('Total number of vendors after:',df['Vendor_name'].nunique())
Total number of vendors before: 384
Total number of vendors after: 348

Now is a good time to check for duplicates again

# Check for the duplicates values again
df.drop_duplicates()
df.shape
(12495974, 24)

Now I got 95103 duplicated values, it appears after cleaning the data, feels nice.

# Check how the dataset look like now
df.head()
png
# Check how the data types look like now
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12495974 entries, 6 to 12591076
Data columns (total 24 columns):
# Column Dtype
--- ------ -----
0 Invoice_number object
1 Date datetime64[ns]
2 Store_number int64
3 Store_name object
4 Address object
5 City object
6 Zip_code object
7 Store_location object
8 County_number float64
9 County_name object
10 Category_number float64
11 Category_name object
12 Vendor_number float64
13 Vendor_name object
14 Item_number int64
15 Item_description object
16 Pack int64
17 Bottle_volume_ml int64
18 State_bottle_cost_($) float64
19 State_bottle_retail_($) float64
20 Bottles_sold int64
21 Sale_price_($) float64
22 Volume_sold_litre float64
23 Volume_sold_gallon float64
dtypes: datetime64[ns](1), float64(8), int64(5), object(10)
memory usage: 2.3+ GB

Working with data

New columns

# Create year column
df['Year'] = df['Date'].dt.year

# Create month column
df['Month'] = df['Date'].dt.month

# Create day column
df['Day'] = df['Date'].dt.day

# Create day_of_week column
df['Day_of_week'] = df['Date'].dt.day_name()

# Creating new column of sales amount per store
df['Sales_per_store_($)'] = df['Sale_price_($)'] / df['Store_number']

# Creating new column of net profit per bottle for each bottle
df['State_profit_per_bottle_($)'] = (df['State_bottle_retail_($)'] - df['State_bottle_cost_($)']) * df['Bottles_sold']

Understanding data

# Printing minimum and the maximum date from dataset.
print(df['Date'].min())
print(df['Date'].max())
# So we have dataset from the 3rd january 2012 to 31st october 2017
2012-01-03 00:00:00
2017-10-31 00:00:00

Data visualizations

Vendors with the highest bottles sold

df.groupby('Vendor_name')['Bottles_sold'].sum().sort_values(ascending=False).head(10).plot(kind='barh', figsize=(20,10))
plt.title('Bottles Sold')
plt.show()

Items with the highest bottles sold

df.groupby('Item_description')['Bottles_sold'].sum().sort_values(ascending=False).head(10).plot(kind='barh', figsize=(20,10))
plt.title('Bottles Sold')
plt.show()

Items with the highest sales

df.groupby('Item_description')['Sale_price_($)'].sum().sort_values(ascending=False).head(10).plot(kind='barh', figsize=(20,10))
plt.title('Sale (Dollars)')
plt.show()

Stores with the highest sales

df_store = df.groupby('Store_name').agg({'Sale_price_($)':'sum'}).reset_index().sort_values('Sale_price_($)',ascending = False)
dfs = df_store.head(50)

a1 = (15, 15)
fig, ax = plt.subplots(figsize=a1)
plot= sns.barplot(x="Store_name", y="Sale_price_($)", data=dfs,palette ="deep")
plot.set_xticklabels(ax.get_xticklabels(),rotation=90)
plot.set_title('Best Store Sale ')
ax.set_ylabel('Sale (Dollars)')
ax.set_xlabel('Store Name')

Creating data frame plot for categories visualization

df_plot = df.groupby(['Category_name','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Category_name','Pack']).max().sort_values().groupby(
['Category_name']).sum().sort_values(ascending=False)
all_count = pd.DataFrame(df_plot)
top_count = pd.DataFrame(df_plot.head(10))

Categories of liquors and their sales

fig_reg = px.bar(all_count,x=all_count.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per category",
xaxis_title=" Category Name",
yaxis_title="Sales in dollars",
)
fig_reg.show()

Top 10 categories of liquors that have highest sale

fig_reg = px.bar(top_count,x=top_count.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per category",
xaxis_title=" Category Name",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# The whiskey has the highest number of sales maybe becuase it's very famous liquor brand

So we know WHISKEY and VODKA are top 2 categories, let’s compare between them

monthly_sales_whis = df[df['Category_name'] == 'WHISKEY'].groupby('Month', as_index=False)['Sale_price_($)'].sum()
#print(monthly_sales_whis)

x = monthly_sales_whis['Month']
y1 = monthly_sales_whis['Sale_price_($)']

ax = sns.pointplot(x, y1, data= monthly_sales_whis, color= "black", markers= "H")
plt.axhline(y1.mean(), color= "k", linestyle= "dotted", linewidth= 2)

ax.set_title('WHISKEY monthly sales')
monthly_sales_vod = df[df['Category_name'] == 'VODKA'].groupby('Month', as_index=False)['Sale_price_($)'].sum()
#print(monthly_sales_vod)

x = monthly_sales_vod['Month']
y2 = monthly_sales_vod['Sale_price_($)']

ax = sns.pointplot(x, y2, data= monthly_sales_vod, color= "black", markers= "H")
plt.axhline(y2.mean(), color= "k", linestyle= "dotted", linewidth= 2)

ax.set_title('VODKA monthly sales')
data = [y1,y2]
headers = ["WHISKEY", "VODKA"]
top2_sell = pd.concat(data, axis=1, keys=headers)
top2_sell.index = top2_sell.index +1

ax = top2_sell.plot(linewidth=2, fontsize=8);
ax.set_xlabel('Month');
ax.legend(fontsize=8);

Creating data frame plot for cities visualization

df_plot = df.groupby(['City','Pack','Date'])['Sale_price_($)'].sum().groupby(
['City','Pack']).max().sort_values().groupby(
['City']).sum().sort_values(ascending=False)
all_count = pd.DataFrame(df_plot)
top_count = pd.DataFrame(df_plot.head(20))

Sales of liquor by city name

fig_reg = px.bar(all_count,x=all_count.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per city",
xaxis_title=" City Name",
yaxis_title="Sales in dollars",
)
fig_reg.show()

Top 20 sales of liquor by city name

fig_reg = px.bar(top_count,x=top_count.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per city",
xaxis_title=" City Name",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# The 'CEDAR RAPIDS' city has highest sales of liquors

Creating data frame plot for date (time) visualization

df_plot_year = df.groupby(['Year','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Year','Pack']).max().sort_values().groupby(
['Year']).sum().sort_values(ascending=False)

df_plot_month = df.groupby(['Month','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Month','Pack']).max().sort_values().groupby(
['Month']).sum().sort_values(ascending=False)

df_plot_day = df.groupby(['Day','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Day','Pack']).max().sort_values().groupby(
['Day']).sum().sort_values(ascending=False)

df_plot_day_of_week = df.groupby(['Day_of_week','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Day_of_week','Pack']).max().sort_values().groupby(
['Day_of_week']).sum().sort_values(ascending=False)

all_count_year = pd.DataFrame(df_plot_year)
all_count_month = pd.DataFrame(df_plot_month)
all_count_day = pd.DataFrame(df_plot_day)
all_count_day_of_week = pd.DataFrame(df_plot_day_of_week)

Sales of liquor each year

fig_reg = px.bar(df_plot_year,x=df_plot_year.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per Year",
xaxis_title=" Year Number",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# From the below plot, it is cleared that the 2013 year has highest number of sales of liquors.

Sales of liquor each month

fig_reg = px.bar(df_plot_month,x=df_plot_month.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per Month",
xaxis_title=" Month Number",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# From the below plot, it is cleared that the October month has highest number of sales of liquors.

Sales of liquor each day

fig_reg = px.bar(all_count_day,x=all_count_day.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per Day",
xaxis_title=" Day Number",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# From the below plot, it is cleared that the 4th day has highest number of sales of liquors.

Sales of liquor each day of the week

fig_reg = px.bar(all_count_day_of_week,x=all_count_day_of_week.index, y='Sale_price_($)',color='Sale_price_($)')
fig_reg.update_layout(
title="Sales of liquor per Day of week",
xaxis_title="Day of Week",
yaxis_title="Sales in dollars",
)
fig_reg.show()

# From the below plot, it is cleared that the Friday day has highest number of sales of liquors.

Sales of liquor for the whole data

all_sales = df.groupby('Date', as_index=False)['Sale_price_($)'].sum()

daily_sales = go.Scatter(x=all_sales['Date'], y=all_sales['Sale_price_($)'])
layout = go.Layout(title='Daily sales', xaxis=dict(title='Date'), yaxis=dict(title='Sales'))
fig = go.Figure(data=[daily_sales], layout=layout)
iplot(fig)

# In Oct 4, 2013 there is the highest sales of the liquors with value of 3.50M, I wonder why tho

Getting the most expensive and cheap categories

df_plot = df.groupby(['Category_name','Pack','Date'])['Sale_price_($)'].sum().groupby(
['Category_name','Pack']).max().sort_values().groupby(
['Category_name']).sum().sort_values(ascending=False)
all_count1 = pd.DataFrame(df_plot)

df_plot = df.groupby(['Category_name','Pack','Date'])['Volume_sold_litre'].sum().groupby(
['Category_name','Pack']).max().sort_values().groupby(
['Category_name']).sum().sort_values(ascending=False)
all_count2 = pd.DataFrame(df_plot)
# Ordrening the countries by number of fatalities
all_count = pd.concat([all_count1 , all_count2],axis=1)
all_count = all_count.sort_values(['Sale_price_($)'],ascending=False)
all_count['Price_per_litre_($)'] = all_count['Sale_price_($)'] / all_count['Volume_sold_litre']
all_count
png
print(round(all_count['Price_per_litre_($)'].max(), 3))
print(round(all_count['Price_per_litre_($)'].min(), 3))

print('\nThe most expensive category seems to be "BEER" with price of 145.719$ per litre')
print('The most cheap category seems to be "TRIPLE SEC" with price of 3.677$ per litre')
145.719
3.677
The most expensive category seems to be “BEER” with the price of 145.719 $ per liter.
The cheapest category seems to be “TRIPLE SEC” with the price of 3.677 $ per liter.

Creating a data frame plot for finding the best vendor

df_plot = df.groupby(['Vendor_name','Pack','Date'])['Volume_sold_litre'].sum().groupby(
['Vendor_name','Pack']).max().sort_values().groupby(
['Vendor_name']).sum().sort_values(ascending=False)
all_count = pd.DataFrame(df_plot)
top_count = pd.DataFrame(df_plot.head(20))

All vendors

fig_reg = px.bar(all_count,x=all_count.index, y='Volume_sold_litre',color='Volume_sold_litre')
fig_reg.update_layout(
title="Volume sold by vendor name",
xaxis_title=" Vendor Name",
yaxis_title="Liquor sold in Litres",
)
fig_reg.show()

Best 20 vendors

fig_reg = px.bar(top_count,x=top_count.index, y='Volume_sold_litre',color='Volume_sold_litre')
fig_reg.update_layout(
title="Volume sold by vendor name",
xaxis_title=" Vendor Name",
yaxis_title="Liquor sold in Litres",
)
fig_reg.show()

# 'CONSTELLATION WINE COMPANY, INC.' is the best vendor in terms of litre selling

Correlation

df_corr = df[['Bottles_sold', 'Sale_price_($)', 'Volume_sold_litre', 'Volume_sold_gallon',
'Sales_per_store_($)', 'State_profit_per_bottle_($)']]
df_corr
png
plt.figure(figsize=(10,10))
sns.heatmap(df_corr.corr(), cmap='Blues', annot=True,)
plt.show()
png

Machine Learning Models

The first thing to do is to split the data and we will work on the training data for now.

rng = np.random.RandomState()

train = df.sample(frac=0.6, random_state=rng)
test = df.loc[~df.index.isin(train.index)]

print(train.shape)
print(test.shape)
(7497584, 30)
(4998390, 30)

Saving the data then load the training data:

train.to_csv('Training_Data.csv')
test.to_csv('Testing_Data.csv')
train_data = pd.read_csv('Training_Data.csv')

Deciding X and y values then splitting the data:

X = train_data[['Bottles_sold', 'Volume_sold_litre', 'Volume_sold_gallon', 'Sales_per_store_($)', 'State_profit_per_bottle_($)']]
y = train_data['Sale_price_($)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(5248308, 5)
(5248308,)
(2249276, 5)
(2249276,)

Normally I don’t calculate the confusion matrix for regression problems, but I just wanted to show this way for doing it, I decided the value 36 because it’s the highest mean value in X as you can see below

X.mean()
Bottles_sold 8.134930
Volume_sold_litre 7.494366
Volume_sold_gallon 1.978739
Sales_per_store_($) 0.035755
State_profit_per_bottle_($) 35.634458
dtype: float64

Confusion matrix function with model evaluation results:

def getTPFPTNFN(y_true, y_pred):
TP, FP, TN, FN = 0, 0, 0, 0
for s_true, s_pred in zip (y_true, y_pred):
if s_true <= 36:
if s_pred <= 36:
TP += 1
else:
FN += 1
else:
if s_pred >36 :
TN += 1
else:
FP += 1

print('TP:', TP)
print('FP:', FP)
print('TN:', TN)
print('FN:', FN)

print('Accuracy: %{}'.format(round((TP+TN)/(TP+FP+TN+FN)*100,3)))
print('Sensitivity (Recall): %{}'.format(round(TP/(TP+FN)*100,3)))
print('Precision: %{}'.format(round(TP/(TP+FP)*100,3)))
print('F1-Score: %{}'.format(round((2*TP)/(2*TP+FP+FN)*100,3)))
print('Negative Predictive Value: %{}'.format(round(TN/(TN+FN)*100,3)))
print('Error values:')
print('Error Rate: %{}'.format(round((FN+FP)/(TN+FN)*100,3)))
print('False Negative Rate: %{}'.format(round(FN/(FN+TP)*100,3)))
print('False Discovery Rate: %{}'.format(round(FP/(FP+TP)*100,3)))
print('False Alarm Rate: %{}'.format(round(FP/(FP+TN)*100,3)))

Ridge Regression

# Select the best alpha with RidgeCV
alpha_range = 10.**np.arange(-2, 3)
ridgeregcv = RidgeCV(alphas= alpha_range, normalize= True, scoring= 'neg_mean_squared_error')
ridgeregcv.fit(X_train, y_train)
ridgeregcv.alpha_
0.1

0.1 is the best alpha value

ridgereg = Ridge(alpha= 0.1, normalize= True)
ridgereg.fit(X_train, y_train)
y_pred_ridge = ridgereg.predict(X_test)
print(y_pred_ridge)
print('--------')
print('RMSE =', np.sqrt(metrics.mean_squared_error(y_test, y_pred_ridge)))
[ 83.86799649 112.28175927 115.80363882 ... 47.20502303 29.84366043
26.46827913]

--------
RMSE = 59.12017547891498

Coefficients results:

# examine the coefficients
rc = ridgeregcv.coef_
print(X.columns, rc)
Index(['Bottles_sold', 'Volume_sold_litre', 'Volume_sold_gallon',
'Sales_per_store_($)', 'State_profit_per_bottle_($)'],
dtype='object') [ -1.38878664 -1.54097852 -5.83867247 317.4568416 0.46917706]

Model results:

accuracy = metrics.r2_score(y_test, y_pred_ridge)
mse= metrics.mean_squared_error(y_test, y_pred_ridge)
mae=metrics.mean_absolute_error(y_test,y_pred_ridge)
rmse=np.sqrt(metrics.mean_squared_error(y_test, y_pred_ridge))
mape = metrics.mean_absolute_percentage_error(y_test, y_pred_ridge)


print("R Squared Score:", round(accuracy, 3))
print("Mean Squared Error:", round(mse, 3))
print("Mean Absolute Error:", round(mae, 3))
print("Root Mean Squared Error:", round(rmse, 3))
print("Mean Absolute Percentage Error:", round(mape, 3),"\n")
getTPFPTNFN(y_test,y_pred_ridge)

plt.scatter(y_test, y_pred_ridge)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Ridge Regression Predictor of Future Sales', fontsize='x-large')
R Squared Score: 0.972
Mean Squared Error: 3495.195
Mean Absolute Error: 24.663
Root Mean Squared Error: 59.12
Mean Absolute Percentage Error: 45399326245084.23

TP: 490406
FP: 29950
TN: 1483641
FN: 245279
Accuracy: %87.764
Sensitivity (Recall): %66.66
Precision: %94.244
F1-Score: %78.088
Negative Predictive Value: %85.813
Error values:
Error Rate: %15.919
False Negative Rate: %33.34
False Discovery Rate: %5.756
False Alarm Rate: %1.979




Text(0.5, 1.0, 'Ridge Regression Predictor of Future Sales')
png

Linear Regression

lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred_linear = lm.predict(X_test)
print(y_pred_linear)
[ 32.57353443 116.50544971 103.83744958 ... 47.19600768 18.40008394
27.09433383]

Cross-validation scores:

scores = cross_val_score(lm, X_test, y_test, cv=5)
print(scores)
print(np.mean(scores))
[0.98115181 0.98468622 0.98478903 0.98480276 0.98451184]
0.9839883302269227

Model results:

accuracy = metrics.r2_score(y_test, y_pred_linear)
mse= metrics.mean_squared_error(y_test, y_pred_linear)
mae=metrics.mean_absolute_error(y_test,y_pred_linear)
rmse=np.sqrt(metrics.mean_squared_error(y_test, y_pred_linear))
mape = metrics.mean_absolute_percentage_error(y_test, y_pred_linear)


print("R Squared Score:", round(accuracy, 3))
print("Mean Squared Error:", round(mse, 3))
print("Mean Absolute Error:", round(mae, 3))
print("Root Mean Squared Error:", round(rmse, 3))
print("Mean Absolute Percentage Error:", round(mape, 3),"\n")
getTPFPTNFN(y_test,y_pred_linear)

plt.scatter(y_test, y_pred_linear, marker='+', color='r')
plt.xlabel("True Values")
plt.ylabel("Predictions")
R Squared Score: 0.979
Mean Squared Error: 2660.445
Mean Absolute Error: 23.435
Root Mean Squared Error: 51.58
Mean Absolute Percentage Error: 33871835520454.176

TP: 452561
FP: 166750
TN: 1346841
FN: 283124
Accuracy: %79.999
Sensitivity (Recall): %61.516
Precision: %73.075
F1-Score: %66.799
Negative Predictive Value: %82.63
Error values:
Error Rate: %27.6
False Negative Rate: %38.484
False Discovery Rate: %26.925
False Alarm Rate: %11.017



Text(0, 0.5, 'Predictions')
png

Decision Tree

# create a regressor object
dt_regressor = DecisionTreeRegressor(random_state = 0)

# fit the regressor with X and y data
dt_regressor.fit(X_train, y_train)
y_pred_dt = dt_regressor.predict(X_test)

Cross-validation scores:

scores = cross_val_score(dt_regressor, X_test, y_test, cv=5)
print(scores)
print(np.mean(scores))
[0.99508247 0.99746271 0.95080698 0.97147553 0.99606878]
0.9821792949533045

Model results:

accuracy = metrics.r2_score(y_test, y_pred_dt)
mse= metrics.mean_squared_error(y_test, y_pred_dt)
mae=metrics.mean_absolute_error(y_test,y_pred_dt)
rmse=np.sqrt(metrics.mean_squared_error(y_test, y_pred_dt))
mape = metrics.mean_absolute_percentage_error(y_test, y_pred_dt)


print("R Squared Score:", round(accuracy, 3))
print("Mean Squared Error:", round(mse, 3))
print("Mean Absolute Error:", round(mae, 3))
print("Root Mean Squared Error:", round(rmse, 3))
print("Mean Absolute Percentage Error:", round(mape, 3),"\n")
getTPFPTNFN(y_test,y_pred_dt)

plt.scatter(y_test, y_pred_dt, marker='+', color='r')
plt.xlabel("True Values")
plt.ylabel("Predictions")
R Squared Score: 0.999
Mean Squared Error: 85.576
Mean Absolute Error: 0.325
Root Mean Squared Error: 9.251
Mean Absolute Percentage Error: 0.002

TP: 735152
FP: 445
TN: 1513146
FN: 533
Accuracy: %99.957
Sensitivity (Recall): %99.928
Precision: %99.94
F1-Score: %99.934
Negative Predictive Value: %99.965
Error values:
Error Rate: %0.065
False Negative Rate: %0.072
False Discovery Rate: %0.06
False Alarm Rate: %0.029



Text(0, 0.5, 'Predictions')
png

Neural Network

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Initialising the ANN
model = Sequential()

# Adding the input layer and the first hidden layer
model.add(Dense(32, activation = 'relu'))

# Adding the second hidden layer
model.add(Dense(units = 32, activation = 'relu'))

# Adding the third hidden layer
model.add(Dense(units = 32, activation = 'relu'))

# Adding the output layer

model.add(Dense(units = 1))

#model.add(Dense(1))
# Compiling the ANN
model.compile(optimizer = 'adam', loss = 'mean_squared_error')

Fitting the model (took 50 min approximately):

# Fitting the ANN to the Training set
model.fit(X_train, y_train, batch_size = 10, epochs = 10)

y_pred_nn = model.predict(X_test)
Epoch 1/10
524831/524831 [==============================] - 298s 566us/step - loss: 1725.7399
Epoch 2/10
524831/524831 [==============================] - 292s 557us/step - loss: 578.0598
Epoch 3/10
524831/524831 [==============================] - 312s 595us/step - loss: 401.8799
Epoch 4/10
524831/524831 [==============================] - 291s 555us/step - loss: 452.6916
Epoch 5/10
524831/524831 [==============================] - 297s 565us/step - loss: 439.9620
Epoch 6/10
524831/524831 [==============================] - 294s 560us/step - loss: 319.2640
Epoch 7/10
524831/524831 [==============================] - 297s 566us/step - loss: 326.9521
Epoch 8/10
524831/524831 [==============================] - 295s 561us/step - loss: 252.4429
Epoch 9/10
524831/524831 [==============================] - 299s 570us/step - loss: 263.6143
Epoch 10/10
524831/524831 [==============================] - 295s 562us/step - loss: 281.1537

Model results:

accuracy = metrics.r2_score(y_test, y_pred_nn)
mse= metrics.mean_squared_error(y_test, y_pred_nn)
mae=metrics.mean_absolute_error(y_test,y_pred_nn)
rmse=np.sqrt(metrics.mean_squared_error(y_test, y_pred_nn))
mape = metrics.mean_absolute_percentage_error(y_test, y_pred_nn)


print("R Squared Score:", round(accuracy, 3))
print("Mean Squared Error:", round(mse, 3))
print("Mean Absolute Error:", round(mae, 3))
print("Root Mean Squared Error:", round(rmse, 3))
print("Mean Absolute Percentage Error:", round(mape, 3),"\n")
getTPFPTNFN(y_test,y_pred_nn)

plt.scatter(y_test, y_pred_nn, marker='+', color='r')
plt.xlabel("True Values")
plt.ylabel("Predictions")
R Squared Score: 0.999
Mean Squared Error: 140.079
Mean Absolute Error: 3.509
Root Mean Squared Error: 11.835
Mean Absolute Percentage Error: 12045539362661.217

TP: 726827
FP: 11461
TN: 1502130
FN: 8858
Accuracy: %99.097
Sensitivity (Recall): %98.796
Precision: %98.448
F1-Score: %98.621
Negative Predictive Value: %99.414
Error values:
Error Rate: %1.345
False Negative Rate: %1.204
False Discovery Rate: %1.552
False Alarm Rate: %0.757



Text(0, 0.5, 'Predictions')
png

Save the optimal model

Both ANN and Decision Tree got decent results, but I believe Decision Tree was the optimal model overall.

with open('DT_model', 'wb') as f:
pickle.dump(dt_regressor, f)

Load the model:

with open('DT_model', 'rb') as f:
mod = pickle.load(f)

Test the model with testing data

test_data = pd.read_csv('Testing_Data.csv')X = test_data[['Bottles_sold', 'Volume_sold_litre', 'Volume_sold_gallon', 'Sales_per_store_($)', 'State_profit_per_bottle_($)']]
y = test_data['Sale_price_($)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(3498873, 5)
(3498873,)
(1499517, 5)
(1499517,)

Take numerical columns:

test_data[['Bottles_sold', 'Volume_sold_litre', 'Volume_sold_gallon', 'Sales_per_store_($)', 'State_profit_per_bottle_($)', 'Sale_price_($)']]
png

Here I’m trying to predict the result of the second row

# ['Bottles_sold', 'Volume_sold_litre', 'Volume_sold_gallon', 'Sales_per_store_($)', 'State_profit_per_bottle_($)']
# Predicted value: Sale_price_($)

mod.predict([[3, 4.5, 1.19, 0.0136, 17.43]])
# Got 52.08 and actual value was 52.29 array([52.08])

Model results on testing data:

y_pred_test = mod.predict(X_test)accuracy = metrics.r2_score(y_test, y_pred_test)
mse= metrics.mean_squared_error(y_test, y_pred_test)
mae=metrics.mean_absolute_error(y_test,y_pred_test)
rmse=np.sqrt(metrics.mean_squared_error(y_test, y_pred_test))
mape = metrics.mean_absolute_percentage_error(y_test, y_pred_test)


print("R Squared Score:", round(accuracy, 3))
print("Mean Squared Error:", round(mse, 3))
print("Mean Absolute Error:", round(mae, 3))
print("Root Mean Squared Error:", round(rmse, 3))
print("Mean Absolute Percentage Error:", round(mape, 3),"\n")
getTPFPTNFN(y_test,y_pred_test)

plt.scatter(y_test, y_pred_test, marker='+', color='r')
plt.xlabel("True Values")
plt.ylabel("Predictions")
R Squared Score: 0.964
Mean Squared Error: 5228.589
Mean Absolute Error: 0.379
Root Mean Squared Error: 72.309
Mean Absolute Percentage Error: 0.002

TP: 491229
FP: 308
TN: 1007666
FN: 314
Accuracy: %99.959
Sensitivity (Recall): %99.936
Precision: %99.937
F1-Score: %99.937
Negative Predictive Value: %99.969
Error values:
Error Rate: %0.062
False Negative Rate: %0.064
False Discovery Rate: %0.063
False Alarm Rate: %0.031



Text(0, 0.5, 'Predictions')
png

In the end, I believe all models performed super well, especially ANN and Decision Tree. However, the optimal model in my opinion was Decision Tree and I would recommend it for the final system.

Thanks for viewing my work :D

--

--