How to use historical markdown data to predict store sales!

Shritam Kumar Mund
Nov 29, 2019 · 7 min read

A time series is a sequence of historical measurements of an observable variable at equal time intervals. Time series data can be looked at as sequential data.

Time series are studied for several purposes such as the forecasting of the future based on knowledge of the past, the understanding of the phenomenon underlying the measures, or simply a succinct description of the salient features of the series.

As a recruitment competition on Kaggle, we need to Use historical markdown data to predict the next year’s sales.

Problem Statement:

We are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store.

Business Objectives and Constraints:

  • Predict the department-wide sales for each store.
  • No strict latency constraints.

Data Overview:

The data has been taken from Walmart Recruiting challenge on kaggle. Data Field contains a total of 4 datasets:

| stores.csv

This file contains anonymized information about the 45 stores, indicating the type and size of the store.

Store: the store number

Types: Types of the store

Size: the size of the store

| train.csv

This is the historical training data, which covers 2010–02–05 to 2012–11–01. Within this file you will find the following fields:

  • Store: the store number
  • Dept: the department number
  • Date : the dates of sales
  • Weekly_Sales : sales for the given department in the given store
  • IsHoliday : whether the week is a special holiday week

| test.csv

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

| features.csv

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

  • Store — the store number
  • Date — the week
  • Temperature — the average temperature in the region
  • Fuel_Price — the cost of fuel in the region
  • MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI — the consumer price index
  • Unemployment — the unemployment rate
  • IsHoliday — whether the week is a special holiday week

Exploratory Data Analysis:

  1. pie-chart for the visual representation of store types:

2. boxplot for sizes of types of stores:

3. pair plot:

Observations:

  • There are 45 stores in total.
  • There are a total of 3 types of stores: Type A, B, and C.
  • By boxplot and piechart, we can say that type A store is the largest store and C is the smallest
  • There is no overlapped area in size among A, B, and C.

4. Check holidays sales frequencies:

5. Understand department frequency:

Observations:

  • Sales on holiday is a little bit more than sales in not-holiday
  • From this plot, we notice the Department with the highest sales lies between Dept 60 and 80

Total we have 421570 values for training and 115064 for testing as part of the competition. But we will work only on 421570 data as we have labels to test the performance and accuracy of models.

Feature Engineering:

As we have dates in our dataset, we can build some beautiful DateTime features using pandas.

We can take mean of Temp and Unemployment as our features

Merge train, test and the features dataset

Merge all the features in a single data frame:

As we have “IsHoliday ”feature in each of the datasets, it has been duplicated. So let's make it correct by removing one of them and rename it to the original “IsHoliday” column name.

We should represent our IsHoliday column in numeric values. So, let's change IsHoliday column with ‘False’ to be 0 and ‘True’ to be 1.

Similarly, let's convert ‘Types’ of the store to numeric values.

Now let’s check features Correlations:

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of the relationship, the value of the correlation coefficient varies between +1 and -1.

Fixing Missing Values:

Missing Value like in ‘Markdown’, Imputing it with Zero(No Markdown). We can safely fill all missing values with zero.

After a few submissions testing, I got to know that mean of feature is improving the score,

Feature Selection:

After a few submissions, I got to know that Markdown features are not helping much to improve the score. So, I had to dope them out.

And as we have already taken the mean of CPI, Unemployment, Fuel_Price we can drop them aswell. As we saw in the Correlation graph, Day_of_week is not stable, so we also need to drop this feature.

Now we are completely ready to define our final Train and Test data to train our model.

Train set:

Test set:

Machine Learning Models:

Model to Predict the Next Year’s Sales

Final features that we are using to train our model are as follows:

  • Store — the store number
  • Dept — the department number
  • Week: The week ordinal of the year.
  • Month: The month as January=1, December=12.
  • Year: The year of the DateTime.
  • Day: The days of the DateTime.
  • Temperature: the average temperature in the region
  • IsHoliday: If Holiday = True == 1, else 0
  • Size: the size of the store
  • Types: Types of store, A = 1, B = 2, C = 3
  • Temp_mean: Mean value Temperature
  • Unemployment_mean: Mean value of Unemployment
  • Fuel_Price_mean: Mean value of the cost of fuel in the region
  • CPI_mean: Mean value of CPI_mean

Dimensions of the final dataset are not too large, bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

Prediction using our Random Forest model:

Finally, the Weekly Sales Prediction CSV file is generated

Let’s upload our predicted CSV:

TADA!!!

  • This is the complete story of this Kaggle competition which had things that were learned and applied.
  • Thankfully, when uploaded the predicted sales file in Kaggle, I got a score of 2762.09 which is close to rank 41!!

Future work:

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks.

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

For a better result score, we can use this information to make some brilliant features. I am sure, this will definitely work to improve the score.



If you’ve got something on your mind you think this article is missing, leave a response below.

I hope this has helped you better understand the Walmart Recruiting — Store Sales Forecasting problem, and if you are interested, it helps you compete in a Kaggle data science competition. You can see the current active competitions at kaggle.com!

Thanks for reading so far.

I will see you in my next post!!!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Shritam Kumar Mund

Written by

Computer science engineer and Aspiring Data Scientist, Visit my website www.ishritam.ml.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

Get More Out of Google Colab

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade