Shoe Sales: My first Machine Learning project

learning ai

Published in

Learning Data

5 min readAug 24, 2023

Part I: Exploratory Data Analysis

The first part of a Data Science or Machine Learning project is (Exploratory) Data Analysis.

The data for a small project is usually a small csv file.

What do we look for?

Columns or rows with no values
NaN (not a number) values
Null values
Outliers

Missing values or NaN values

Deletion

Missing values, or, NaN values occur when data is not recorded. The rows or columns with these values can be deleted. This should be done carefully, as it can lead to loss of information.

Imputation

The values can be replaced by Imputation, where missing values are replaced with estimated values based on other available data. This can be done using mean, median, mode, or more complex methods like regression or machine learning-based imputation.

Blank values

Identify: Determine if blanks represent actual missing data or if they have any specific meaning.
Clean or Impute: Depending on the context, you might choose to remove rows or impute values to replace blank entries.

Outliers

Visual Detection: Use graphical methods such as box plots, scatter plots, and histograms to identify outliers.
Statistical Methods: Calculate the z-score or IQR (Interquartile Range) to identify data points that deviate significantly from the mean or median.
Treatment: Depending on the nature of your data and analysis, you can choose to remove outliers, transform them using log or other functions, or handle them separately.

Starting off with a small project for forecasting Shoe Sales will help us understand the concept.

The project details can be obtained from the following link:

The csv is read as a data frame named df.

The file can found from this link:

shoesales.csv

shoesales YearMonth,Shoe_Sales 1980-01,85 1980-02,89 1980-03,109 1980-04,95 1980-05,91 1980-06,95 1980-07,96…

docs.google.com

The head and the tail values of the csv file can also be checked and the information of the data frame can be checked by df.info and df.describe .

To visualise the data more accurately the YearMonth column is separated into Year and Month column.

The first of our visualisations will give the Shoe Sales by month.

To check for the median and total sales in each month for the range of years we can use boxplots as shown below.

df1['Month'] = pd.to_numeric(df1['Month'], errors='coerce')
#january 
january= df1[df1['Month'] == 1]
february= df1[df1['Month'] == 2]
march= df1[df1['Month'] == 3]
april= df1[df1['Month'] == 4]
may= df1[df1['Month'] == 5]
june= df1[df1['Month'] == 6]
july= df1[df1['Month'] == 7]
august= df1[df1['Month'] == 8]
september= df1[df1['Month'] == 9]
october= df1[df1['Month'] == 10]
november= df1[df1['Month'] == 11]
december= df1[df1['Month'] == 12]

january = january.rename(columns={'Shoe_Sales': 'January '})
february= february.rename(columns={'Shoe_Sales': 'February '})
march= march.rename(columns={'Shoe_Sales': 'March '})
april= april.rename(columns={'Shoe_Sales': 'April '})
may= may.rename(columns={'Shoe_Sales': 'May '})
june= june.rename(columns={'Shoe_Sales': 'June '})
july= july.rename(columns={'Shoe_Sales': 'July '})
august= august.rename(columns={'Shoe_Sales': 'August '})
september= september.rename(columns={'Shoe_Sales': 'Sept '})
october= october.rename(columns={'Shoe_Sales': 'Oct '})
november= november.rename(columns={'Shoe_Sales': 'Nov '})
december= december.rename(columns={'Shoe_Sales': 'Dec '})
# Step 2: Combine data into a single DataFrame
combined_df = (pd.concat([january['January '], february['February '], march['March '],april['April '],
                          may['May '], june['June '],july['July '],august['August '],september['Sept '],
                          october['Oct '],november['Nov '],december['Dec ']], axis=1))

# Step 3: Draw boxplots on the same plot
plt.figure(figsize=(8, 6))
sns.boxplot(data=combined_df)
plt.xlabel('Columns')
plt.ylabel('Values')
plt.title('Monthly Sales from years 1980 to 1997')
plt.show()

To see the Time Series plot of Shoe Sales by month the line plot of the sales is used.

df1['Year'] = df1['YearMonth'].dt.year
df1['Month'] = df1['YearMonth'].dt.month

# Create a time series plot for shoe sales by each month
plt.figure(figsize=(10, 6))
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month in range(1, 13):
    monthly_data = df1[df1['Month'] == month]
    plt.plot(monthly_data['Year'], monthly_data['Shoe_Sales'], label=month_names[month - 1])# marker='o',

plt.title('Shoe Sales by Each Month')
plt.xlabel('Year')
plt.ylabel('Shoe Sales')
plt.legend(title='Month', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

The Time Series plot of all monthly sales of each year in the range 1980 to 1997 along with the mean and medium of all the total sales is given by the following.

mean_mean_sales_by_month = df1.groupby('Month')['Shoe_Sales'].mean().mean()

# Calculate the mean and median of median shoe sales by month
mean_median_sales_by_month = df1.groupby('Month')['Shoe_Sales'].median().mean()
median_median_sales_by_month = df1.groupby('Month')['Shoe_Sales'].median().median()

# Group by 'YearMonth' and calculate the mean of all shoe sales
mean_sales_all = df1.groupby('YearMonth')['Shoe_Sales'].mean()



# Create a time series plot for shoe sales by month
plt.figure(figsize=(12, 6))
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
#for month in range(1, 13):
#    monthly_data = df[df['Month'] == month]
#    plt.plot(monthly_data['YearMonth'], monthly_data['Shoe_Sales'],  label=f'{month_names[month - 1]}')

plt.plot(df1['YearMonth'], df1['Shoe_Sales'])
    
# Plot the mean of mean and median of median of total sales
plt.plot(mean_sales_all.index, [mean_mean_sales_by_month] * len(mean_sales_all), linestyle='dashed', color='black', label='Mean')
plt.plot(mean_sales_all.index, [median_median_sales_by_month] * len(mean_sales_all), linestyle='dashed', color='red', label='Median')

plt.title('Shoe Sales by Month with Mean and Median of Total Sales')
plt.xlabel('Year-Month')
plt.ylabel('Shoe Sales')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

That is all for my first part. In the second part I will present how to do the forecasting using various methods like Naive Bayes etc. I realise there will be many more experienced programmers out there but you have to start somewhere.

There will be other plots required so I am hoping that I will get responses to help me plot any missing visualisations. I hope to also get tips and feedback on how to write better code and even questions.

Thank you for reading my little article. I look forward to your responses as this is my first endeavor into Data Science/ Machine Learning.