# How to use historical markdown data to predict store sales!

A time series is a sequence of historical measurements of an observable variable at equal time intervals. Time series data can be looked at as sequential data.

Time series are studied for several purposes such as the forecasting of the future based on knowledge of the past, the understanding of the phenomenon underlying the measures, or simply a succinct description of the salient features of the series.

As a recruitment competition on Kaggle, we need to Use historical markdown data to predict the next year’s sales.

**Problem Statement:**

We are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales for each store.

# Business Objectives and Constraints:

- Predict the department-wide sales for each store.
- No strict latency constraints.

# Data Overview:

The data has been taken from Walmart Recruiting challenge on kaggle. Data Field contains a total of 4 datasets:

*| stores.csv*

This file contains anonymized information about the 45 stores, indicating the type and size of the store.

Store: the store number

Types: Types of the store

Size: the size of the store

*| train.csv*

This is the historical training data, which covers 2010–02–05 to 2012–11–01. Within this file you will find the following fields:

- Store: the store number
- Dept: the department number
- Date : the dates of sales
- Weekly_Sales : sales for the given department in the given store
- IsHoliday : whether the week is a special holiday week

*| test.csv*

This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file.

*| features.csv*

This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:

- Store — the store number
- Date — the week
- Temperature — the average temperature in the region
- Fuel_Price — the cost of fuel in the region
- MarkDown1–5 — anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011 and is not available for all stores all the time. Any missing value is marked with an NA.
- CPI — the consumer price index
- Unemployment — the unemployment rate
- IsHoliday — whether the week is a special holiday week

# Exploratory Data Analysis:

- pie-chart for the visual representation of store types:

2. boxplot for sizes of types of stores:

3. pair plot:

**Observations:**

- There are 45 stores in total.
- There are a total of 3 types of stores: Type A, B, and C.
- By boxplot and piechart, we can say that type A store is the largest store and C is the smallest
- There is no overlapped area in size among A, B, and C.

4. Check holidays sales frequencies:

5. Understand department frequency:

## Observations:

- Sales on holiday is a little bit more than sales in not-holiday
- From this plot, we notice the Department with the highest sales lies between Dept 60 and 80

Total we have **421570 values** for **training** and **115064** for **testing** as part of the competition. But we will **work only on 421570 data** as we have labels to test the performance and accuracy of models.

# Feature Engineering:

As we have dates in our dataset, we can build some beautiful DateTime features using pandas.

We can take mean of Temp and Unemployment as our features

## Merge train, test and the features dataset

## Merge all the features in a single data frame:

As we have “IsHoliday ”feature in each of the datasets, it has been duplicated. So let's make it correct by removing one of them and rename it to the original “IsHoliday” column name.

We should represent our IsHoliday column in numeric values. So, let's change IsHoliday column with ‘False’ to be 0 and ‘True’ to be 1.

Similarly, let's convert ‘Types’ of the store to numeric values.

**Now let’s check features Correlations:**

Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of the relationship, the value of the correlation coefficient varies between +1 and -1.

# Fixing Missing Values:

Missing Value like in ‘Markdown’, Imputing it with Zero(No Markdown). We can safely fill all missing values with zero.

After a few submissions testing, I got to know that mean of feature is improving the score,

# Feature Selection:

After a few submissions, I got to know that Markdown features are not helping much to improve the score. So, I had to dope them out.

And as we have already taken the mean of CPI, Unemployment, Fuel_Price we can drop them aswell. As we saw in the Correlation graph, Day_of_week is not stable, so we also need to drop this feature.

Now we are completely ready to define our final Train and Test data to train our model.

*Train set:*

*Test set:*

# Machine Learning Models:

*Model to Predict the Next Year’s Sales*

Final features that we are using to train our model are as follows:

- Store — the store number
- Dept — the department number
- Week: The week ordinal of the year.
- Month: The month as January=1, December=12.
- Year: The year of the DateTime.
- Day: The days of the DateTime.
- Temperature: the average temperature in the region
- IsHoliday: If Holiday = True == 1, else 0
- Size: the size of the store
- Types: Types of store, A = 1, B = 2, C = 3
- Temp_mean: Mean value Temperature
- Unemployment_mean: Mean value of Unemployment
- Fuel_Price_mean: Mean value of the cost of fuel in the region
- CPI_mean: Mean value of CPI_mean

Dimensions of the final dataset are not too large, bagged decision trees like ** Random Forest** and

**can be used to estimate the importance of features.**

*Extra Trees*## Prediction using our Random Forest model:

Finally, the Weekly Sales Prediction CSV file is generated

# Let’s upload our predicted CSV:

TADA!!!

- This is the complete story of this Kaggle competition which had things that were learned and applied.
- Thankfully, when uploaded the predicted sales file in Kaggle, I got a score of
**2762.09**which is close to rank**41**!!

## Future work:

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks.

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

For a better result score, we can use this information to make some brilliant features. I am sure, this will definitely work to improve the score.

So we have seen different methods that can be used to check the stationarity of a time series and a suite of classical time series forecasting methods that we can test and tune on our time series dataset.

**Here is the ****GitHub**** link of the project.**

# Reference

**EDA:**

- https://www.kaggle.com/yepp2411/walmart-prediction-1-eda-with-time-and-space
- https://www.kaggle.com/bnorbert/eda-walmart

**Date time features:**

- https://pandas.pydata.org/pandasdocs/stable/reference/series.html#datetime-properties
- https://stackoverflow.com/questions/33365055/attributeerror-can-only-use-dt-accessor-with-datetimelike-values
- https://stackoverflow.com/questions/25146121/extracting-just-month-and-year-separately-from-pandas-datetime-column

**Feature Selection:**

If you’ve got something on your mind you think this article is missing, leave a response below.

I hope this has helped you better understand the Walmart Recruiting — Store Sales Forecasting problem, and if you are interested, it helps you compete in a Kaggle data science competition. You can see the current active competitions at *kaggle.com*!

**Thanks for reading so far.**

I will see you in my next post**!!!**