Forecasting Future Sales for Corporación Favorita Grocery: Unlocking the Potential of Predictive Analytics

34 min readFeb 26, 2023

Corporación Favorita, a large Ecuador-based grocery retailer, they operate hundreds of supermarkets, with several products on their shelves. In this dataset I build several models that more accurately predicts the unit sales for thousands of items sold at different Favorita stores. The training data includes dates, store and item information and whether the item was being promoted or not.

INTRODUCTION

A key factor in a company’s retail performance is the ability to properly estimate sales and manage inventories. The key problem is predicting the sales and inventory requirements for each location to avoid overstocking and understocking, allowing the business to offer the greatest customer service while reducing losses and guaranteeing the store’s sustainability.

In this research, I plan to utilize several time series forecasting method to forecast store sales for the Ecuadorian grocery retailer Corporation Favorita. Dates, stores, product details, whether the item was on sale, and sales statistics are all included in the training data. Separate files are also provided with further information that could be helpful in developing the models.

As usual, I always prefer to use a well-structured approach to understanding and analyzing the Corporación Favorita Grocery data. Therefore I will be using the CRoss Industry Standard Process for Data Mining (CRISP-DM) to help me systematically query the data and extract the right intelligence which can support business growth. CRISP-DM consists of six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

1.0 Business Understanding

The goal of this project is to build a model that accurately predicts the unit sales for items sold at different Favorita stores.

Img Source: Kaggle.com

A critical aspect of the business understanding is developing hypothesis to aid in analyzing the data. For this work, my hypothesis is as follows;

H0: There is no significant increase in sales when product-family are on promotion.

H1: There is a significant increase in sales when products are on promotion.

1.1 BUSINESS QUESTIONS

Business questions help organizations to identify areas of improvement, make better use of their resources, and increase their competitiveness. For the case of Corporación Favorita we are interested in the following questions.

Is the train dataset complete (has all the required dates)?
Which dates have the lowest and highest sales for each year?
Did the earthquake impact sales?
Are certain groups of stores selling more products? (Cluster, city, state, type)
Are sales affected by promotions, oil prices and holidays?
What analysis can we get from the date and its extractable features?
What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

2.0 Data Understanding

The dataset consists of 6 different files.

Train: The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
Test: The test data, having the same features as the training data.
Transactions: Contains date, store_nbr and transaction made on that specific date.
Stores: Store metadata, including city, state, type, and cluster.
holidays events: Holidays and Events, with metadata.
oil: Daily oil price which includes values during both the train and test data timeframes. (Ecuador is an oil-dependent country and its economic health is highly vulnerable to shocks in oil prices.)

3.0 Data Preparation

3.1 Data Importation

#load the data 

Train_df = pd.read_csv('train.csv')
stores_df = pd.read_csv('stores.csv')
Transaction_df = pd.read_csv('transactions.csv')
oil_df = pd.read_csv('oil.csv')
submission_df = pd.read_csv('sample_submission.csv')
test_df = pd.read_csv('test.csv')
Holiday_df = pd.read_csv('holidays_events.csv')

#Merge all the dataset 
#Use left-join to main data consistency 

merged_df = pd.merge(Train_df, stores_df, on='store_nbr', how='left')
merged_df = pd.merge(merged_df,Transaction_df, on=['store_nbr','date'], how='left')
merged_df = pd.merge(merged_df,oil_df, on=['date'], how='left')
merged_df = pd.merge(merged_df,Holiday_df, on=['date'], how='left')

3.2 Data Cleaning:

This involves identifying and removing missing, duplicate or inconsistent data from the dataset.

3.1.2 Missing Values

#Filling missing values in oil data with the the value before that missing data
oil_df = oil_df.bfill()

#Dealing with missing values 

#Filling the missing data using the Backfill method
merged_df["transactions"].fillna(method='bfill', inplace=True)

merged_df["dcoilwtico"].fillna(method='bfill', inplace=True)

# Find the mode of the 'type_y' column
mode = merged_df['type_y'].mode()[0]

# Replace missing values in the 'type_y' column with the mode
merged_df['type_y'].fillna(mode, inplace=True)

# Find the mode of the 'locale' column
mode = merged_df['locale'].mode()[0]

# Replace missing values in the 'locale' column with the mode
merged_df['locale'].fillna(mode, inplace=True)

# Find the mode of the 'description' column
mode = merged_df['description'].mode()[0]

# Replace missing values in the 'description' column with the mode
merged_df['description'].fillna(mode, inplace=True)

# Find the mode of the 'locale_name' column
mode = merged_df['locale_name'].mode()[0]

# Replace missing values in the 'locale_name' column with the mode
merged_df['locale_name'].fillna(mode, inplace=True)

# Find the mode of the 'transferred' column
mode = merged_df['transferred'].mode()[0]

# Replace missing values in the 'locale_name' column with the mode
merged_df['transferred'].fillna(mode, inplace=True)

3.1.3 Duplicate Values

# Use pandas.DataFrame.drop_duplicates method
Train_df.duplicated().any(), 
stores_df.duplicated().any(), 
oil_df.duplicated().any(),  
test_df.duplicated().any(),  
Holiday_df.duplicated().any()

3.1.4 Inconsistent data Values

#Convert date column to datetime attribute for each dataset with date column 

Train_df["date"] = pd.to_datetime(Train_df["date"], format='%Y-%m-%d')
Transaction_df["date"] = pd.to_datetime(Transaction_df["date"], format='%Y-%m-%d')
oil_df["date"] = pd.to_datetime(oil_df["date"], format='%Y-%m-%d')
Holiday_df["date"] = pd.to_datetime(Holiday_df["date"], format='%Y-%m-%d')

test_df["date"] = pd.to_datetime(test_df["date"], format='%Y-%m-%d')

#Create new attributes 'Year', 'Month' and 'Weekday Name'
merged_df['Year'] = merged_df.index.year
merged_df['Month'] = merged_df.index.month_name()
merged_df['Weekday_Name'] = merged_df.index.day_name()

3.3 Data Reduction:

This involves reducing the size of the dataset by removing variables or observations that are not relevant to the analysis.

#We decide to sample the dataset because it was too large to deal with
#So for cases where the large dataset is taking a toll on the computational resources, we use the samples version. 
#An example of such a case is when using seaborn. 

frac=0.1 #percentage of data to be utilized 
seed = 42 #for reproducibility 

# calculate number of rows for each year
year_counts = merged_df.groupby('year').size().reset_index(name='counts')

# calculate the sample size for each year
year_samples = (year_counts['counts'] * frac).astype(int)

# create an empty list to store the samples
samples = []

# loop through each year and sample the required number of rows
for year, sample_size in zip(year_counts['year'], year_samples):
    year_df = merged_df[merged_df['year'] == year]
    sample = year_df.sample(n=sample_size, random_state=seed)
    samples.append(sample)

# concatenate the samples into a single DataFrame
sampled_df = pd.concat(samples)

sampled_df.shape

3.3 Final Dataset

3.4 Exploratory Data Analysis

3.4.1. Univariate Data Analysis

3.4.1.1 Analysis of the family attribute

3.4.1.2 Analysis of the City attribute

3.4.1.3 Analysis of the state attribute

3.4.1.4 Analysis of the Store type attribute

3.4.1.5 Analysis of the cluster attribute

3.4.1.6 Analysis of the transactions attribute

3.4.1.7 Analysis of the oil prices attribute

3.4.1.8 Analysis of the sales attribute

A skewness of 0.410 means that the distribution of the daily sales is slightly skewed to the right (positive skew). This indicates that there are more days with lower sales values and fewer days with higher sales values. A positive skewness means that the tail of the distribution is longer on the right side than on the left. In other words, there are a few days with very high sales that pull the mean to the right. This information can be useful for decision-making or forecasting, as it suggests that the sales values may not be normally distributed and may require additional data analysis or modeling techniques.

3.4.2 Bivariate Data Analysis

3.4.2.1 Analysis of the Sales vs Transactions (Daily)

3.4.2.2 Analysis of the Total Sales by Day

3.4.2.3 Analysis of the Total transactions by Day

3.4.2.4 Analysis of the Percentage of Records with On Promotion by Day

3.4.2.5 Analysis of the total products on promotion over the years

3.4.2.6 Analysis of the Total On-Promotion Count by Store over the years

3.4.2.7 Analysis of On Promotion Count by Day over the years

3.4.2.8 Analysis of On Promotion Count by Day of the Month over the years

3.4.3 Multivariate Data Analysis

3.4.3.1 Correlation Analysis

3.4.3.2 Factor Analysis using correlation matrix

3.4.3.3 Outlier analysis for sales by city

4.0 Modeling

4.1 ANSWERING THE BUSINESS QUESTIONS

4.1.1 Is the train dataset complete (has all the required dates)?

#We first determine the first and last dates 
merged_df.day.min(), merged_df.day.max()

#Calculate the total expected number of days between the first and last days 
min_date = merged_df['day'].min()
max_date = merged_df['day'].max()
days_range = pd.date_range(start=min_date, end=max_date)
num_days = len(days_range)
num_days

#Actual number of days
merged_df['day'].nunique()

#Find the Missing dates
expected_dates = pd.date_range(start=merged_df['day'].min(), end=merged_df['day'].max())
missing_dates = expected_dates.difference(merged_df['day'])
missing_dates

From the result, the train dataset is missing 4 dates. These missing dates follow a sequence. All the years are missing dates of December 25th (12, 25). This date is also a holiday in Ecuador. This analysis therefore assumes that no data was taken on this day every year as the shops may be closed down on that day each year.

4.1.2 Which dates have the lowest and highest sales for each year?

So it’s good to check the total sales per year to identify the highest and lowest year in terms of total sales.

The Company recorded the highest sales in 2016, followed by 2015 and then 2014.The lowest sales was recorded in 2013 followed by 2017. However, the data for 2017 was collected for only 8 months (2017/08/15) which may account for the ranking. So, I focus on the highest (2016) and lowest (2013).

# Check the maximum sales per year

# Extracting year from the 'day' column and adding it as a new column 'year' in merged_df
merged_df['year'] = pd.DatetimeIndex(merged_df['day']).year# Grouping merged_df by year and day, and aggregating sales column to find the maximum value
date_data = merged_df.groupby(by=["year", "day"]).agg({"sales":"max"}).reset_index()# Finding the row which has the maximum sales for each year 
max_day = date_data.loc[date_data.groupby(['year'])['sales'].idxmax()]# Printing the resulting max_day dataframe
print(max_day)

#Plot the day that recorded the maximum sales for each year

# Add a new column 'day' to the merged_df dataframe, extracting just the date part from the existing day column
merged_df['day'] = pd.DatetimeIndex(merged_df['day']).date# Group the merged_df dataframe by year and day, and get the maximum sales for each year/day combination
date_data = merged_df.groupby(by=["year", "day"]).agg({"sales":"max"}).reset_index()# Get the maximum sales for each year, by finding the row with the maximum sales for each year
max_day = date_data.loc[date_data.groupby(['year'])['sales'].idxmax()]# Use seaborn to create a bar plot, showing the maximum sales for each year, with the x-axis being the day and the y-axis being the sales
sns.barplot(x='day', y='sales', hue='year', data=max_day)# Label the x-axis as "Day"
plt.xlabel("Day")# Label the y-axis as "Sales"
plt.ylabel("Sales")# Rotate the x-axis labels 90 degrees to make them more readable
plt.xticks(rotation=90)# Show the plot
plt.show()

#Check the day with the least recorded sale for each year

merged_df['year'] = pd.DatetimeIndex(merged_df['day']).year
date_data = merged_df.groupby(by=["year", "day"]).agg({"sales":"min"}).reset_index()
min_day = date_data.loc[date_data.groupby(['year'])['sales'].idxmax()]
print(min_day)

4.1.3 Did the earthquake impact sales?

A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.

So, although the earthquake happened on the 16th of April, the 17th saw a sharp increase in sales, so we can confidently say the event did not negatively affect sales.

4.1.4 Are certain groups of stores selling more products? (Cluster, city, state, type)

4.1.4.1 Total sales of Each Store

#Total sales of each store
x = merged_df.groupby(['store_nbr'], as_index=False).agg({'sales': 'sum'})

#Get the store with the highest sales
max_sales_store = x[x['sales'] == x['sales'].max()].iloc[0, 0]
max_sales = x[x['sales'] == x['sales'].max()].iloc[0, 1]

#Get the store with the minimum sales
min_sales_store = x[x['sales'] == x['sales'].min()].iloc[0, 0]
min_sales = x[x['sales'] == x['sales'].min()].iloc[0, 1]


print("Store with highest sales:", max_sales_store)
print("Corresponding sales value:", max_sales)
print("")
print("Store with lowest sales:", min_sales_store)
print("Corresponding sales value:", min_sales)

4.1.4.2 Total Sales per product Family

4.1.4.3 Total Sales per city

4.1.4.4 Total Sales per state

4.1.4.5 Total Sales per Store-Type

4.1.4.6 Total Sales per cluster

4.1.4.7 Top Products Sold per Store Number, State, City, Family and Cluster

#store_nbr
# Group data by state and city, and calculate mean sales for each group
top_products = merged_df.groupby(['store_nbr','state','city','family', 'cluster']).agg({'sales':'mean'}).reset_index()

# Sort data by sales in descending order
top_products = top_products.sort_values(by='sales',ascending=False)

# Print top products
print(top_products.head())

4.1.4.7 Lowest Products Sold per Store Number, State, City, Family and Cluster

# Group data by state and city, and calculate mean sales for each group
Low_products = merged_df.groupby(['state','city','family', 'cluster']).agg({'sales':'mean'}).reset_index()

# Sort data by sales in ascending order
Low_products = top_products.sort_values(by='sales',ascending=True)

# Print lowest products
print(Low_products.head())

4.1.5 Are sales affected by promotions, oil prices and holidays?

4.1.5.1 Focus on Promotion

#create a new column 'onpromotion_encoded'
merged_df['onpromotion_encoded'] = np.where(merged_df['onpromotion'] > 1, 1, 0)

4.1.5.2 Correlation between sales and onpromotion

4.1.5.3 Average Sales by Product on/off promotion

4.1.5.4 Total Sales for Product on/off promotion

4.1.5.5 Year-by-Year Total Sales for Product on/off promotion

4.1.5.6 Number of Products on Promotion that Optimizes Sales

4.1.6 Focus on Oil Prices

4.1.6.1 Year-by-Year Changes in Oil Price

4.1.6.2 Correlation between sales and Oil Price

#Determine the correlation between sales and onpromotion
correlation = sampled_df['sales'].corr(sampled_df['dcoilwtico'])
print(correlation)

4.1.7 Focus on Holidays

4.1.6 What analysis can we get from the date and its extractable features?

4.1.6.1 Sales Analysis with Time

4.1.6.1.1 Plotting the average sales per month.

4.1.6.1.2 Average Sales per Quarter

4.1.6.1.3 Average Sales by day of the Month

4.1.6.1.5 Moving average for Week, month and year

4.1.6.2 Transaction Analysis with Time

4.1.6.2.1 Transactions of stores from 2013 to 2017

4.1.6.2.2 Total Transactions of stores from 2013 to 2017

4.1.6.2.3 Average Transactions by day of the week

4.1.6.2.4 Average Transactions by day of the month

4.1.6.2.5 Average Transactions by day of the year

4.1.6.3 Oil Price Analysis with Time

4.1.6.3.1 Sales, Transactions and Oil prices over the years

4.1.6.4. Hypothesis Testing

from scipy.stats import ttest_ind

# Extract the sales of products when they are on promotion and when they are not
sales_on_promotion = merged_df[merged_df['onpromotion_encoded'] == 1]['sales']
sales_not_on_promotion = merged_df[merged_df['onpromotion_encoded'] == 0]['sales']

# Perform t-test
t_stat, p_value = ttest_ind(sales_on_promotion, sales_not_on_promotion)

# Print the t-statistic and p-value
print("t-statistic: ", t_stat)
print("p-value: ", p_value)

4.1.7 What is the difference between RMSLE, RMSE, MSE (or why is the MAE greater than all of them?)

RMSE (Root Mean Squared Error), MSE (Mean Squared Error), and RMSLE (Root Mean Squared Logarithmic Error) are all evaluation metrics used to measure the accuracy of a model’s predictions. Here’s the difference between them:

MSE: Mean Squared Error is the average of the squares of the differences between the predicted values and the actual values. It measures the average magnitude of the error. The larger the MSE, the larger the error, which implies that the model is not making good predictions.

RMSE: Root Mean Squared Error is the square root of the MSE. It provides a more interpretable measure of the error by taking the square root of the mean of the squared differences. This metric is more interpretable than MSE as it provides a measure of the error in the same units as the target variable.

RMSLE: Root Mean Squared Logarithmic Error is similar to RMSE, but it is calculated based on the logarithmic differences between the predicted and actual values. This metric is commonly used in regression problems when the target variable has a high range of values, as it penalizes under-predictions more heavily than over-predictions. The logarithmic transformation makes the difference between large numbers more significant, which is important in cases where the target variable has a wide range of values.

In conclusion, RMSE and RMSLE are used to measure the accuracy of the model’s predictions, with RMSE providing a measure of the magnitude of the error and RMSLE providing a measure of the logarithmic error between the predictions and actual values.

4.2 MODELLING WITH TRADITIONAL MACHINE LEARNING MODELS

4.2.1 Feature Engineering:

This involves creating new variables or transforming existing variables to improve the performance of the machine learning model.

4.2.1.1 Create new attributes ‘Year’, ‘Month’ and ‘Weekday Name’

#Create new attributes 'Year', 'Month' and 'Weekday Name'
merged_df['Year'] = merged_df.index.year
merged_df['Month'] = merged_df.index.month_name()
merged_df['Weekday_Name'] = merged_df.index.day_name()

4.2.1.2 Change data attribute to datetime for consistency

#Change data attribute to datetime for consistency 
merged_df["day"] = pd.to_datetime(merged_df["day"], format='%Y-%m-%d')

#Change data attribute to datetime for consistency 
merged_test_df["day"] = pd.to_datetime(merged_test_df["day"], format='%Y-%m-%d')

4.2.1.3 Extracting Date Features from Day Column in Merged DataFrame

merged_df['day'] = pd.to_datetime(merged_df['day'])
merged_df['week'] = merged_df['day'].dt.isocalendar().week
merged_df['quarter'] = merged_df['day'].dt.quarter
merged_df['month'] = merged_df['day'].dt.month
merged_df['weekday'] = merged_df['day'].dt.weekday
merged_df['Day'] = merged_df['day'].dt.day

4.2.1.4 Dropping Columns

merged_df = merged_df.drop(['Month', 'Weekday_Name', 'description'], axis=1)

merged_test_df = merged_test_df.drop(['Month', 'Weekday_Name'], axis=1)

4.2.1.5 Lag Features

I created Lag Features: These features involve using previous values of the target variable. By shifting the week column by 1 position and assigning the result to a new column week_lag_1, we effectively create a lagged variable that contains the week of the previous observation.

#Creating lags
#creating a new column called 'week_lag_1' in the merged_df dataframe, 
#which contains the values of the 'week' column shifted by 1 row. 

merged_df['week_lag_1'] = merged_df['week'].shift(1)

merged_test_df['week_lag_1'] = merged_test_df['week'].shift(1)

#Creating sales for next day
merged_df['next_day_sales'] = merged_df['week'].shift(-1)

#Creating sales for next day
merged_test_df['next_day_sales'] = merged_test_df['week'].shift(-1)

4.2.1.6 Moving Average Features

We created Moving Average Features: This is based on the moving average of your target variable. Moving average” and “rolling average” are two different names for the same concept. It refers to a statistical technique that’s used to smooth out fluctuations in data over time.

For example, we create features like “week_ma_7”, “week_ma_14”, week_ma_30 etc. to capture the long-term trends in the target variable.

for i in [7, 14, 30]:
    merged_df["week_ma_{}".format(i)] = merged_df["week"].rolling(i).mean()

for i in [7, 14, 30]:
    merged_test_df["week_ma_{}".format(i)] = merged_test_df["week"].rolling(i).mean()

4.2.1.7 The rolling method Features

The rolling method is being applied to the “week” column of the merged_df DataFrame with different window sizes (7, 14, and 30), which creates a rolling window object. The mean and std methods are applied to these window objects to calculate the rolling mean and standard deviation, respectively, for each window size.

merged_df['rolling_mean_7'] = merged_df['week'].rolling(7).mean().reset_index(level=0, drop=True)
merged_df['rolling_mean_14'] = merged_df['week'].rolling(14).mean().reset_index(level=0, drop=True)
merged_df['rolling_mean_30'] = merged_df['week'].rolling(30).mean().reset_index(level=0, drop=True)

merged_df['rolling_std_7'] = merged_df['week'].rolling(7).std().reset_index(level=0, drop=True)
merged_df['rolling_std_14'] = merged_df['week'].rolling(14).std().reset_index(level=0, drop=True)
merged_df['rolling_std_30'] = merged_df['week'].rolling(30).std().reset_index(level=0, drop=True)

4.2.1.8 The difference Features

The “.diff()” function is being used to calculate the difference between the current value and the previous value of the “week” column, where the number of periods to shift backward is specified as 1. This means that for each row in the dataframe, the value in the “diff_1” column will be the difference between the current week’s value and the week’s value from the previous row.

merged_df['diff_1'] = merged_df['week'].diff(periods=1)
merged_test_df['diff_1'] = merged_test_df['week'].diff(periods=1)

4.2.1.8 The percentage change Features

This calculates the percentage change between each element and the previous element of the ‘week’ column. The .pct_change() is a Pandas method used to calculate the percentage change between each element in a series or a dataframe column. It calculates the percentage difference between each element and the previous element, and returns a new series or dataframe column containing the calculated values.

merged_df['pct_change'] = merged_df['week'].pct_change()
merged_test_df['pct_change'] = merged_test_df['week'].pct_change()

4.2.1.10 The Min and Max Values Features

The min() method returns the smallest value in the DataFrame, while the max() method returns the largest value in the DataFrame. The Min column contains the minimum value of the Day column in the respective dataframe, and the Max column contains the maximum value of the Day column in the respective dataframe.

#Min and Max Values
merged_df['Min'] = merged_df['Day'].min()
merged_df['Max'] = merged_df['Day'].max()

#Min and Max Values
merged_test_df['Min'] = merged_test_df['Day'].min()
merged_test_df['Max'] = merged_test_df['Day'].max()

4.2.2 Features Encoding:

We create a binary encoding of the “onpromotion” variable, where 1 indicates that the product was on promotion on a given day and 0 indicates that it was not on promotion.


#create a new column 'onpromotion_encoded'
merged_df['onpromotion_encoded'] = np.where(merged_df['onpromotion'] > 1, 1, 0)

4.2.2.2 Binning the Onpromotion

We utilize Bining on the onpromotion column. Binning is a data preprocessing technique used to transform continuous numerical data into categorical data.

# Determine the minimum and maximum values of the 'onpromotion' column
min_val = merged_df['onpromotion'].min()
max_val = merged_df['onpromotion'].max()

# Add more labels
bins = np.arange(min_val, max_val, (max_val - min_val) / 5)
labels = [f"bin_{i}" for i in range(len(bins) - 1)]

merged_df['Promotions'] = pd.cut(merged_df["onpromotion"], bins=bins, labels=labels)

If we have any missing values after the bining, then replace this with the mode value. This is an assumption which is based on the fact that the onpromotion column consist of the count of products on promotion.

# Find the mode of the 'Promotions' column
mode = merged_df['Promotions'].mode()[0]
#Lets replace the missing values
merged_df['Promotions'].fillna(mode, inplace=True)

4.2.2.3 Holiday Attribute Encoding

We implement a basic encoding of the Holiday type column to reduce the number of encoded outputs when we use one-hot encoder.

merged_df['Holiday_type'] = np.where(merged_df['Holiday_type'].isin(['Holiday', 
                                                                     'Additional', 'Event', 'Transfer', 'Bridge']), 
                                                                                                                'Holiday', 'Workday')

If the value of ‘Holiday_type’ is one of the specified strings ‘Holiday’, ‘Additional’, ‘Event’, ‘Transfer’, or ‘Bridge’, then the new value of ‘Holiday_type’ will be ‘Holiday’. If it is not one of those strings, then the new value will be ‘Workday’.

4.2.2.4 Family Attribute Encoding

merged_df['family'].unique()

We decided to create broader categories based on data similarities. This will help reduce the dimension of the data and thereby prevent the curse of dimensionality.

FOOD: bread/bakery, dairy, deli, eggs, frozen foods, grocery i, grocery ii, meats, poultry, prepared foods, produce, seafood

BEVERAGES: beverages, liquor,wine,beer

HOME and kitchen: home and kitchen i, home and kitchen ii, home appliances, home care

PERSONAL care: beauty, baby care, personal care

CLOTHING: ladieswear, lingerie

OTHERS: automotive, celebration, cleaning, hardware, lawn and garden, magazines, pet supplies, players and electronics, school and office supplies

#Implement the new super grouping of product family on the actual family attribute. 

merged_df['family'] = merged_df['family'].replace({
'AUTOMOTIVE': 'Others',
'BABY CARE': 'Personal Care',
'BEAUTY': 'Personal Care',
'BEVERAGES': 'Beverages',
'BOOKS': 'Others',
'BREAD/BAKERY': 'Food',
'CELEBRATION': 'Food',
'CLEANING': 'Others',
'DAIRY': 'Food',
'DELI': 'Food',
'EGGS': 'Food',
'FROZEN FOODS': 'Food',
'GROCERY I': 'Food',
'GROCERY II': 'Food',
'HARDWARE': 'Others',
'HOME AND KITCHEN I': 'Home and Kitchen',
'HOME AND KITCHEN II': 'Home and Kitchen',
'HOME APPLIANCES': 'Home and Kitchen',
'HOME CARE': 'Home and Kitchen',
'LADIESWEAR': 'Clothing',
'LAWN AND GARDEN': 'Others',
'LINGERIE': 'Clothing',
'LIQUOR,WINE,BEER': 'Beverages',
'MAGAZINES': 'Others',
'MEATS': 'Food',
'PERSONAL CARE': 'Personal Care',
'PET SUPPLIES': 'Others',
'PLAYERS AND ELECTRONICS': 'Others',
'POULTRY': 'Food',
'PREPARED FOODS': 'Food',
'PRODUCE': 'Food',
'SCHOOL AND OFFICE SUPPLIES': 'Others',
'SEAFOOD': 'Food'
})

4.2.2.5 OneHotEncoding

from sklearn.preprocessing import OneHotEncoder

# Fit and transform the 'family', 'Holiday_type', 'Promotions', 'Store_type' columns
features_to_encode = ['family', 'Holiday_type', 'Promotions', 'Store_type']

# Create an instance of the one-hot encoder
one_hot = OneHotEncoder()

# Fit and transform the data using the one-hot encoder
features_encoded = one_hot.fit_transform(merged_df[features_to_encode])

# Concatenate the encoded columns with the original data
merged_df_encoded = pd.concat([merged_df.reset_index(drop=True), pd.DataFrame(features_encoded.toarray(), columns=one_hot.get_feature_names(features_to_encode))], axis=1)

# Drop the original columns
merged_df_encoded = merged_df_encoded.drop(features_to_encode, axis=1)

Final dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3054348 entries, 0 to 3054347
Data columns (total 37 columns):
 #   Column                   Dtype         
---  ------                   -----         
 0   day                      datetime64[ns]
 1   store_nbr                int64         
 2   sales                    float64       
 3   cluster                  int64         
 4   transactions             float64       
 5   dcoilwtico               float64       
 6   week                     UInt32        
 7   quarter                  int64         
 8   month                    int64         
 9   weekday                  int64         
 10  week_lag_1               UInt32        
 11  next_day_sales           UInt32        
 12  week_ma_7                float64       
 13  week_ma_14               float64       
 14  week_ma_30               float64       
 15  diff_1                   UInt32        
 16  pct_change               Float64       
 17  Min                      int64         
 18  Max                      int64         
 19  onpromotion_encoded      int32         
 20  family_Beverages         float64       
 21  family_Clothing          float64       
 22  family_Food              float64       
 23  family_Home and Kitchen  float64       
 24  family_Others            float64       
 25  family_Personal Care     float64       
 26  Holiday_type_Holiday     float64       
 27  Holiday_type_Workday     float64       
 28  Promotions_bin_0         float64       
 29  Promotions_bin_1         float64       
 30  Promotions_bin_2         float64       
 31  Promotions_bin_3         float64       
 32  Store_type_A             float64       
 33  Store_type_B             float64       
 34  Store_type_C             float64       
 35  Store_type_D             float64       
 36  Store_type_E             float64       
dtypes: Float64(1), UInt32(4), datetime64[ns](1), float64(23), int32(1), int64(7)
memory usage: 818.5 MB

4.2.3 Data Transformation:

This involves converting data into a format that is suitable for analysis. This may include normalizing or scaling the data.

4.2.3.1 Scaling the Dataset

Scaling of Features is an essential step in modeling the algorithms with the datasets. The data that is usually used for the purpose of modeling is derived through various means such as:

- Questionnaire
- Surveys
- Research
- Scraping, etc.

So, the data obtained contains features of various dimensions and scales altogether. Different scales of the data features affect the modeling of a dataset adversely.

It leads to a biased outcome of predictions in terms of misclassification error and accuracy rates. Thus, it is necessary to Scale the data prior to modeling.

This is when standardization comes into picture.

Standardization is a scaling technique wherein it makes the data scale-free by converting the statistical distribution of the data into the below format:

mean — 0 (zero)
standard deviation — 1

#change the index label of the DataFrame from 'day' to 'date'

Final_data = Final_data.set_index('day')
Final_data.index.rename('date', inplace=True)

#Resample the data to day with mean attributes 

Final_data_sampled = Final_data.resample('D').mean()

from sklearn.preprocessing import StandardScaler

# Fill any missing values with 0
scaled_data.fillna(0, inplace=True)

It is generally unnecessary to scale or normalize categorical data, as scaling doesn’t change the meaning of the categories. However, it is not necessarily wrong to do so, as it is unlikely to cause any major issues. Some machine learning algorithms may even require the input features to be scaled, so in those cases, it would be necessary to scale both numerical and categorical features. In summary, scaling categorical data is not typically necessary, but it also should not cause any major problems.

# Create an instance of StandardScaler
scaler = StandardScaler()

# Apply StandardScaler to all columns except the first column
scaled_data[scaled_data.columns[1:]] = scaler.fit_transform(scaled_data[scaled_data.columns[1:]])

Filling missing values with 0 assumes that missing values are equivalent to zero or that they represent a lack of occurrence. It’s a common method of imputation used when it’s reasonable to assume that missing values have no influence on the analysis or model being built.

# Fill any missing values with 0
scaled_test_data.fillna(0, inplace=True)

# Create an instance of StandardScaler
scaler = StandardScaler()

# Apply StandardScaler to all columns except the first column
scaled_test_data[scaled_test_data.columns[1:]] = scaler.fit_transform(scaled_test_data[scaled_test_data.columns[1:]])

#Create a dataframe for the scaled data and scaled test data

scaled_data_df = pd.DataFrame(scaled_data, columns=scaled_data.columns)

scaled_test_data_df = pd.DataFrame(scaled_test_data, columns=scaled_test_data.columns)

The abs() function is used to compute the absolute value of each element in a DataFrame. In this case, it’s being used to convert any negative values to their positive counterparts. This is often useful in data preprocessing when dealing with values that are naturally positive but may have been transformed to negative due to some data processing step or error. By taking the absolute value, we ensure that all values are positive, which can simplify further calculations or analysis.

scaled_data_df = abs(scaled_data_df)

scaled_test_data_df = abs(scaled_test_data_df)

4.2.3.2 Replace spaces in feature names with underscores.

We needed to replace spaces, slashes, and commas in the column names with underscores. This is done to make the column names more computer-readable and to avoid any issues with spaces or special characters in the column names. Espercially when running LightGBM which has issues with spaces or special characters in column names, so it’s necessary to replace them with underscores or remove them before training the model.

# Replace spaces in feature names with underscores
scaled_data_df.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
scaled_data_df.rename(columns=lambda x: x.replace("/", "_"), inplace=True)
scaled_data_df.rename(columns=lambda x: x.replace(",", "_"), inplace=True)

# Replace spaces in feature names with underscores
scaled_test_data_df.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
scaled_test_data_df.rename(columns=lambda x: x.replace("/", "_"), inplace=True)
scaled_test_data_df.rename(columns=lambda x: x.replace(",", "_"), inplace=True)

4.2.2 Data Splitting:

This involves dividing the dataset into training and testing datasets. The training dataset is used to train the machine learning model, while the validation dataset is used to evaluate its performance.

# Calculate the number of rows in the data
n_rows = scaled_data_df.shape[0]

# Calculate the split point
split_point = int(n_rows * 0.95)

# Select the first 70% of the rows as the training data
train_data = scaled_data_df.iloc[:split_point]

# Select the remaining rows as the validation data
validation_data = scaled_data_df.iloc[split_point:]

train_data2 = train_data.reset_index()
X_train =train_data2.drop(columns=["sales","date"])
y_train=train_data2["sales"]

validation_data = validation_data.reset_index()
X_validation_test =validation_data.drop(columns=["sales","date"])
y_validation_test=validation_data["sales"]

4.3 Machine learning models

4.3.1 LinearRegression: is a simple linear model for regression analysis that assumes a linear relationship between the dependent variable and the independent variable(s).

4.3.2 RandomForestRegressor: is a tree-based ensemble model that uses a large number of decision trees to make predictions and reduces the variance and overfitting associated with a single decision tree.

4.3.3 XGBRegressor: is an implementation of gradient boosting that uses decision trees as the weak learners and has become very popular due to its high accuracy and speed.

4.3.4 LGBMRegressor: is a gradient boosting framework that uses decision trees as the weak learners and is designed to be efficient and scalable.

4.3.5 CatBoost Regressor: is a gradient boosting framework that uses decision trees as the weak learners and incorporates categorical features in a natural way, without requiring one-hot encoding.

4.3.6 GradientBoosting Regressor: is a gradient boosting framework that uses decision trees as the weak learners and trains each tree in sequence to minimize the loss function.

4.3.7 DecisionTree Regressor: is a tree-based model that uses a tree structure to model the relationships between the independent variables and the dependent variable, making it particularly useful for modeling nonlinear relationships.

5.0 Evaluation

For the evaluation, I am using the RMSLE. The Root Mean Squared Log Error (RMSLE) can be defined using a slight modification on sklearn’s mean_squared_log_error function, which itself a modification on the familiar Mean Squared Error (MSE) metric.

The formula for RMSLE is represented as follows:

5.1 All Evaluation Results

5.1.1 Evaluation for Linear Regression Model

5.1.2 Evaluation for Random Forest Regressor Model

5.1.3 Evaluation for XG Boost Regressor Model

5.1.4 Evaluation for Light GBM Regressor Model

5.1.5 Evaluation for CatBoost Regressor Model

5.1.6 Evaluation for Decision Tree Regressor Model

5.1.7 Evaluation for Gradient Boosting Regressor Model

RMSLE is very robust to outliers. In the case of RMSE, the presence of outliers can explode the error term to a very high value. But, in the case of RMLSE the outliers are drastically scaled down therefore nullifying their effect.

5.2 Machine Learning Results

# Create a list of model names
model_names = ['LinearRegression', 'RandomForestRegressor', 'XGBRegressor', 'LGBMRegressor', 'CatBoostRegressor', 'GradientBoostingRegressor', 'DecisionTreeRegressor']
# Create a list of RMSLE values
rmsle_values = [rmsle_lg, rmsle_RF, rmsle_xgb, rmsle_lgb, rmsle_cb, rmsle_gb, rmsle_dtm]
# Create a dictionary of model names and RMSLE values using the zip function
rmsle_dict = dict(zip(model_names, rmsle_values))
# Find the model with the least RMSLE value using the min function
best_model = min(rmsle_dict, key=rmsle_dict.get)
# Find the model with the biggest RMSLE value using the max function
worst_model = max(rmsle_dict, key=rmsle_dict.get)
sns.barplot(x=list(rmsle_dict.keys()), y=list(rmsle_dict.values()))
plt.xlabel("Models")
plt.ylabel("RMSLE")
plt.title("RMSLE of Different Models")
plt.xticks(rotation=90)
plt.show()

print("Best Model: ", best_model)
print("RMSLE Value: ", rmsle_dict[best_model])
print("")
print("Worst Model: ", worst_model)
print("RMSLE Value: ", rmsle_dict[worst_model])

A low RMSLE value indicates that the model is making accurate predictions, so a very low RMSLE is generally considered desirable. However, a model with an RMSLE that is too low could indicate overfitting, where the model has become too specialized to the training data and is not generalizing well to new, unseen data. In these cases, it may be better to have a slightly higher RMSLE, as this would indicate that the model is making more general predictions that are less likely to be affected by small variations in the training data.

5.3 Feature importance

Feature importance is a technique for understanding the relative importance of features in a model. It involves assigning a score to each feature based on its contribution to the overall performance of the model. The scores can then be used to identify which features are the most important for making accurate predictions and which features may be redundant or even detrimental to the model’s performance.

In the context of regression models, feature importance can be used to improve the RMLSE (root mean squared logarithmic error) by identifying which features have the most impact on the target variable and optimizing the model accordingly. By focusing on the most important features and eliminating or reducing the impact of less important ones, it is possible to improve the model’s accuracy and reduce the RMLSE. Additionally, feature importance can be used to identify which features may be causing overfitting and adjust the model accordingly to improve its generalization performance.

5.3.1 Using the get_feature_importance method from the CatBoostRegressor

Here we use the get_feature_importance method from the CatBoostRegressor class to calculate the feature importances for the model trained on the input data. The feature importances are calculated using the “PredictionValuesChange” type, which indicates the contribution of each feature to the predictions made by the model.

5.3.2 Select the top features and train the model

# Make predictions on the validation set
y_val_pred = model_important.predict(X_validation_test)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_validation_test, y_val_pred)
print("MSE: {:.3f}".format(mse))

# Calculate the root mean squared error (RMSE)
rmse = np.sqrt(mse)
print("RMSE: {:.3f}".format(rmse))


# Calculate the root mean squared logarithmic error (RMSLE)
rmsle_cb_3 = np.sqrt(np.mean((np.log(y_val_pred + 1) - np.log(y_validation_test + 1))**2))
print("RMSLE: {:.3f}".format(rmsle_cb_3))

5.4 Permutation importance

Permutation importance is a method to calculate the importance of individual features in a machine learning model. The idea behind permutation importance is to randomly shuffle the values of a single feature and observe the resulting decrease in the model performance. The decrease in the performance metric indicates the importance of the feature. The more the decrease in the performance, the more important the feature is considered to be.

Permutation importance can be used to improve RandomForestRegressor in the following ways:

Feature selection: By calculating permutation importance, you can identify the most important features for your prediction task. You can then use this information to select the most important features for training your RandomForestRegressor model, which can result in improved performance and reduced training time.

Model improvement: By using permutation importance, you can also identify features that are not contributing much to the model’s performance. By removing such features from the training data, you can improve the performance of the RandomForestRegressor model.

5.4.1 Select the top n features to re-train the RandomForestRegressor

# Select the top n features to re-train the RandomForestRegressor in hopes of improving the model

n = 10 # number of top features
X_important = X_train[feature_importance['feature'].head(n)]

# Train the RandomForestRegressor on the selected features
reg_important = RandomForestRegressor(random_state=0)
reg_important.fit(X_important, y_train)

best_model_RFR = reg_important

# Make predictions on the validation set using the new model
y_pred = reg_important.predict(X_validation_test[feature_importance['feature'].head(n)])

# Calculate the RMSLE
rmsle_RF_opt = np.sqrt(mean_squared_log_error(y_validation_test, y_pred))
print("RMSLE: {:.3f}".format(rmsle_RF_opt))

5.5 Using shapely Values for the feature importance

Shapely values are a feature importance measure used in machine learning to evaluate the contribution of each feature towards the model’s prediction. They measure the change in the model’s prediction when the feature is excluded or permuted randomly. Higher the shapely value, higher is the contribution of the feature towards the model’s prediction.

To use shapely values to evaluate the feature importance, you can calculate the prediction values change when a feature is excluded or permuted, and compare it to the prediction values of the original model. You can calculate the shapely values for each feature and rank them in order to determine the most important features in your model.

By using shapely values to evaluate the feature importance, you can get a better understanding of how the features are contributing to the model’s prediction, and you can use this information to improve the performance of your RandomForestRegressor.

5.5.1 Select the top n features and retrain the XGBRegressor

# Select the top n features
n = 10 # number of top features
X_important = X_train.iloc[:, np.argsort(np.abs(shap_values.values).mean(0))[-n:]]

xgc_important = XGBRegressor(random_state=0)
xgc_important.fit(X_important, y_train)

y_val_pred = xgc_important.predict(X_validation_test.loc[:, X_important.columns])

# Calculate the root mean squared logarithmic error (RMSLE)
rmsle_XGB_O = np.sqrt(np.mean((np.log(y_val_pred + 1) - np.log(y_validation_test + 1))**2))
print("RMSLE: {:.3f}".format(rmsle_XGB_O))

5.5.2 Using the get_score method from the XGBRegressor

The get_score method from the XGBRegressor returns a dictionary of feature importance scores based on the number of times each feature is split on across all trees in the model. These scores can be used to identify the most important features in the dataset and can provide insights into which features are most strongly associated with the target variable.

# Use XGBRegressor to obtain the feature importance scores

xgb = XGBRegressor(random_state=0)
xgb.fit(X_train, y_train)

# Get the feature importance scores
importance = xgb.get_booster().get_score(importance_type='weight')

# Convert the feature importance scores to a DataFrame and sort by score in descending order
importance_df = pd.DataFrame(list(importance.items()), columns=['feature', 'score'])
importance_df.sort_values(by='score', ascending=False, inplace=True)

# Select the top n features
n = 10 # specify the number of top features
top_features = importance_df['feature'].head(n)

# Train XGBRegressor on the top n features
xgb_top_features = XGBRegressor(random_state=0)
xgb_top_features.fit(X_train[top_features], y_train)

# Predict using the model trained on top n features
y_pred = xgb_top_features.predict(X_validation_test[top_features])

# Calculate the RMSLE
rmsle_XGB_O2 = np.sqrt(np.mean((np.log(y_pred + 1) - np.log(y_validation_test + 1))**2))
print("RMSLE: ", rmsle_XGB_O2)

5.6 Comparison for all models


print("Best Model: ", best_model)
print("RMSLE Value: ", rmsle_dict[best_model])
print("")
print("Worst Model: ", worst_model)
print("RMSLE Value: ", rmsle_dict[worst_model])

5.7 HYPERPARAMETER TUNING

Hyperparameter tuning can improve the RMSLE (Root Mean Squared Logarithmic Error) of a regression model by selecting the hyperparameters that result in the best performance on the validation set. By finding the optimal set of hyperparameters, the model can generalize better to new, unseen data, which is reflected in a lower RMSLE. Additionally, hyperparameter tuning can help prevent overfitting, which is a common problem in machine learning.

Since GradientBoostingRegressor and linear regressor did not perform too well, I decided to use hyperparameter tuning to improve the model.

5.7.1 GradientBoostingRegressor with hyperparameter tuning

#First step in for GradientBoostingRegressor

# Define the hyperparameters to search over
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 4, 5],
    'min_samples_split': [3, 4, 5],
    'learning_rate': [0.01, 0.05, 0.1],
    'loss': ['ls', 'lad', 'huber']
}

# Initialize the GradientBoostingRegressor
reg = GradientBoostingRegressor()

# Initialize the RandomizedSearchCV
random_search = RandomizedSearchCV(reg, param_grid, n_iter=10, scoring='neg_mean_squared_log_error', cv=5)

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters found: ", random_search.best_params_)

# Get the best estimator from the random search
best_reg = random_search.best_estimator_

# Predict on the test data using the best estimator
y_pred = best_reg.predict(X_validation_test)
# Calculate the RMSLE
rmsle_gbr_h = np.sqrt(mean_squared_log_error(y_validation_test, y_pred))
print("RMSLE: ", rmsle_gbr_h)

5.7.2 Compare GradientBoostingRegressor with hyperparameter tuning and without

percent_change = (rmsle_gb - rmsle_gbr_h)/rmsle_gb *100

percent_change

This shows that using hyperprameter tuning significantly improved the RMSLE over 50%

5.7.3 LinearRegression with hyperparameter tuning

# Second step is LinearRegression

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to be tuned
param_grid = {'normalize': [True, False]}

# Initialize the linear regression model
reg = LinearRegression()

# Initialize the grid search object
grid_search = GridSearchCV(reg, param_grid, cv=100, scoring='neg_mean_squared_error')

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters: ", grid_search.best_params_)

# Predict on the validation test data
y_pred = grid_search.predict(X_validation_test)

# Calculate the RMSLE
rmsle_lg_hyp = np.sqrt(mean_squared_log_error(y_validation_test, y_pred))
print("RMSLE: ", rmsle_lg_hyp)

5.7.4 Linear Regression with Top 10 Features using Grid Search and Cross Validation

# Select the top 20 features based on permutation importance
top_10_features = feature_importance.head(10)['feature']
X_train_10 = X_train[top_10_features]
X_validation_test_10 = X_validation_test[top_10_features]

# Train a Linear Regression model on the top 20 features
lin_reg = LinearRegression()

# Define the parameter grid for GridSearchCV
param_grid = {'normalize': [True, False]}

# Perform Grid Search with Cross Validation
grid_search = GridSearchCV(lin_reg, param_grid, cv=100, scoring='neg_mean_squared_error')
grid_search.fit(X_train_10, y_train)

# Print the best parameters and the best score
print("Best parameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

# Make predictions on the validation set
y_pred = grid_search.predict(X_validation_test_10)

# Calculate the mean squared error on the validation set
mse = mean_squared_error(y_validation_test, y_pred)
print("Mean Squared Error on the validation set: ", mse)

# Calculate the RMSLE
rmsle_lg_top = np.sqrt(mean_squared_log_error(y_validation_test, y_pred))
print("RMSLE: ", rmsle_lg_top)

5.8 Compare Linear Regressor with hyperparameter tuning and without

print("Best Model: ", best_model)
print("RMSLE Value: ", rmsle_dict[best_model])
print("")
print("Worst Model: ", worst_model)
print("RMSLE Value: ", rmsle_dict[worst_model])

As can be oberserved there was no significant improvement in the RMSLE metric even with hyperparameter tuning and feature importance.

5.8 MODELLING WITH STATS MODELS (ARIMAX/SARIMAX)

5.5.1 Checking for stationarity

#stasionarity test
new_df_sampled = new_df_sampled.dropna()
result = adfuller(new_df_sampled)
test_statistic = result[0]
critical_values = result[4]
if test_statistic < critical_values['5%']:
    print('The time series is stationary.')
else:
    print('The time series is non-stationary.')

#differencing
new_df_sampled2 = new_df_sampled.diff().dropna()
result = adfuller(new_df_sampled2)
test_statistic = result[0]
critical_values = result[4]
if test_statistic < critical_values['5%']:
    print('The time series is stationary.')
else:
    print('The time series is non-stationary.')

5.5.2 ARIMAX RESULTS

5.5.3 SARIMAX RESULTS

6.0 Deployment

import joblib

# Save the selected-feature XGBRegressor model to a file using joblib
filename = 'xgb_model_selected_features.joblib'
joblib.dump(xgb_top_features, filename)

# Load the saved XGBoost model from file
loaded_model = joblib.load('xgb_model_selected_features.joblib')

# Load the test data from file
test_data = Final_test_data_

# Select the same top 10 features as used to train the model

n = 10 # number of top features

X_test_important = test_data[importance_df['feature'].head(n)]

# Make predictions on the test data using the loaded model
predictions = loaded_model.predict(X_test_important)

# Display the predictions
print(predictions)

7.0 Conclusion

In conclusion, this sales forecasting project aimed to predict future sales for a retail store chain based on historical data. Through exploratory data analysis, feature engineering, and the application of machine learning models, we were able to build an accurate sales forecasting model.

We used a combination of regression and time-series models, and experimented with different parameters and feature selections to improve the accuracy of our model. Through evaluation of the model’s performance on the test set, we can confidently say that it will help the retail chain make better-informed decisions on inventory management and sales strategies.

Overall, this project highlights the importance of data-driven decision making in the retail industry. By leveraging historical sales data and applying advanced analytics techniques, retailers can better understand consumer behavior and market trends, ultimately driving increased sales and profitability.

Based on my analysis, here are some major conclusions and lessons that Corporation Favorita can adopt to remain competitive;

Understanding customer behavior: By analyzing customer purchase behavior, the grocery shop can better understand their needs and preferences, and tailor their product offerings to better meet their customers’ demands.
Product assortment: Our analysis showed that certain product categories such as dairy, frozen foods, and grocery items have a higher impact on sales. The grocery shop should consider optimizing their product assortment in these categories and ensure that they have a good mix of products that cater to the needs of their customers.
Promotions and discounts: Our analysis showed that promotions and discounts can significantly impact sales. The grocery shop should consider implementing targeted promotions and discounts, based on customer behavior and preferences, to drive sales and customer loyalty.
Inventory management: Our sales forecasting model can help the grocery shop with inventory management by predicting future sales and demand. By optimizing their inventory levels, they can ensure that they have the right products in stock to meet customer demand, while minimizing waste and reducing costs.
Data-driven decision making: This project highlights the importance of data-driven decision making in the retail industry. By leveraging historical sales data and applying advanced analytics techniques, retailers can gain valuable insights into their business operations and make informed decisions to improve their sales and profitability. The grocery shop should continue to invest in data analytics and machine learning to drive business success.