A Data Science Project For Beginners (Exploratory Data Analysis (EDA))

Saicharan Kr

Published in

Analytics Vidhya

7 min readJun 14, 2020

Analysis of Walmart sales data

PROBLEM STATEMENT

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Dataset Description

This is the historical data that covers sales from 2010–02–05 to 2012–11–01, in the file Walmart_Store_sales. Within this file you will find the following fields:

Store — the store number
Date — the week of sales
Weekly_Sales — sales for the given store
Holiday_Flag — whether the week is a special holiday week 1 — Holiday week 0 — Non-holiday week
Temperature — Temperature on the day of sale
Fuel_Price — Cost of fuel in the region
CPI — Prevailing consumer price index
Unemployment — Prevailing unemployment rate

Holiday Events

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Download Dataset here

Analysis Tasks

Basic Statistics tasks

Which store has maximum sales
Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out co-efficient of variance
Which store/s has good quarterly growth rate in Q3’2012
Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together
Provide a monthly and semester view of sales in units and give insights

We will now start our analysis .

STEP 1 :- Start by importing required Libraries .

import pandas as pd 
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

STEP 2:- Import Data Set

data = pd.read_csv('Datasets/Walmart_Store_sales.csv')

NOTE:- read_csv is the function in Pandas used to import your dataset which is in csv format

Syntax :- pandas.read_csv(‘Your Dataset Location’)

STEP 3:- Understand dataset .

data.head()  #default displays the First Five rows from the dataset

Basic information about our dataset

data.info() #gives the basics information about our dataset like dimension, No. of nulls , datatype of the columns.

data.max() #Finds the Maximum value in each column

Above Steps-1,2,3 are common for Analysis of any Data ,Now we will Start with our Analysis tasks.

You can also perfo

QUESTION 1 :- Which store has maximum sales in this dataset?

Now we are going to find answer for this from our data .

data.loc[data['Weekly_Sales'] ==  data['Weekly_Sales'].max()] # used to find the row meeting the specific condition, Here we are checking in column Weekly_Sales which row or store in particular is having the maximum Weekly_Sales.

From above Figure we see that Store 14 has the maximum weekly sales.

QUESTION 2 :- Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of variance (C0V)

In simple terms CoV is ratio of standard deviation to the mean.

Checkout more What is CoV from Below Wikipedia link.

Coefficient of variation

In probability theory and statistics, the coefficient of variation ( CV), also known as relative standard deviation (…

en.wikipedia.org

#Here i am grouping by store and finding the standard deviation and mean of each store.
maxstd=pd.DataFrame(data.groupby('Store').agg({'Weekly_Sales':['std','mean']}))#Just resetting the index.
maxstd = maxstd.reset_index()#Now we know that CoV is std/ mean we are doing this for each store.
maxstd['CoV'] =(maxstd[('Weekly_Sales','std')]/maxstd[('Weekly_Sales','mean')]) *100#finding the store with maximum standard deviation.
maxstd.loc[maxstd[('Weekly_Sales','std')]==maxstd[('Weekly_Sales','std')].max()]

From above Figure we can conclude that sales in store 14 vary a lot

QUESTION 3 :- Which store/s has good quarterly growth rate in Q3’2012.

#Converting the data type of date column to dateTime 
data['Date'] = pd.to_datetime(data['Date'])#defining the start and end date of Q3 and Q2
Q3_date_from = pd.Timestamp(date(2012,7,1))
Q3_date_to = pd.Timestamp(date(2012,9,30))Q2_date_from = pd.Timestamp(date(2012,4,1))
Q2_date_to = pd.Timestamp(date(2012,6,30))#Collecting the data of Q3 and Q2 from original dataset.
Q2data=data[(data['Date'] > Q2_date_from) & (data['Date'] < Q2_date_to)]Q3data=data[(data['Date'] > Q3_date_from) & (data['Date'] < Q3_date_to)]#finding the sum weekly sales of each store in Q2
Q2 = pd.DataFrame(Q2data.groupby('Store')['Weekly_Sales'].sum())
Q2.reset_index(inplace=True)
Q2.rename(columns={'Weekly_Sales': 'Q2_Weekly_Sales'},inplace=True)#finding the sum weekly sales of each store in Q2
Q3 = pd.DataFrame(Q3data.groupby('Store')['Weekly_Sales'].sum())
Q3.reset_index(inplace=True)
Q3.rename(columns={'Weekly_Sales': 'Q3_Weekly_Sales'},inplace=True)#mergeing Q2 and Q3 data on Store as a common column
Q3_Growth= Q2.merge(Q3,how='inner',on='Store')

Growth rate formula is defined as the ratio of difference in present value to past value by past value whole multiplied with 100 (since it is in percentage)

((Present value — Past value )/Past value )*100

#Calculating Growth rate of each Store and collecting it into a dataframe  
Q3_Growth['Growth_Rate'] =(Q3_Growth['Q3_Weekly_Sales'] - Q3_Growth['Q2_Weekly_Sales'])/Q3_Growth['Q2_Weekly_Sales']Q3_Growth['Growth_Rate']=round(Q3_Growth['Growth_Rate'],2)
Q3_Growth.sort_values('Growth_Rate',ascending=False).head(1)

Q3_Growth.sort_values('Growth_Rate',ascending=False).tail(1)

From above tables we can infer that Q3 growth rate is in losses .

the Store 16 has the least loss of 3% compared the other stores and store 14 has highest loss of 18%.

QUESTION 4:- Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together.

#finding the mean sales of non holiday and holiday 
data.groupby('Holiday_Flag')['Weekly_Sales'].mean()

#marking the holiday dates 
Christmas1 = pd.Timestamp(date(2010,12,31) )
Christmas2 = pd.Timestamp(date(2011,12,30) )
Christmas3 = pd.Timestamp(date(2012,12,28) )
Christmas4 = pd.Timestamp(date(2013,12,27) )

Thanksgiving1=pd.Timestamp(date(2010,11,26) )
Thanksgiving2=pd.Timestamp(date(2011,11,25) )
Thanksgiving3=pd.Timestamp(date(2012,11,23) )
Thanksgiving4=pd.Timestamp(date(2013,11,29) )

LabourDay1=pd.Timestamp(date(2010,2,10) )
LabourDay2=pd.Timestamp(date(2011,2,9) )
LabourDay3=pd.Timestamp(date(2012,2,7) )
LabourDay4=pd.Timestamp(date(2013,2,6) )

SuperBowl1=pd.Timestamp(date(2010,9,12) )
SuperBowl2=pd.Timestamp(date(2011,9,11) )
SuperBowl3=pd.Timestamp(date(2012,9,10) )
SuperBowl4=pd.Timestamp(date(2013,9,8) )

#Calculating the mean sales during the holidays
Christmas_mean_sales=data[(data['Date'] == Christmas1) | (data['Date'] == Christmas2) | (data['Date'] == Christmas3) | (data['Date'] == Christmas4)]
Thanksgiving_mean_sales=data[(data['Date'] == Thanksgiving1) | (data['Date'] == Thanksgiving2) | (data['Date'] == Thanksgiving3) | (data['Date'] == Thanksgiving4)]
LabourDay_mean_sales=data[(data['Date'] == LabourDay1) | (data['Date'] == LabourDay2) | (data['Date'] == LabourDay3) | (data['Date'] == LabourDay4)]
SuperBowl_mean_sales=data[(data['Date'] == SuperBowl1) | (data['Date'] == SuperBowl2) | (data['Date'] == SuperBowl3) | (data['Date'] == SuperBowl4)]#
list_of_mean_sales = {'Christmas_mean_sales' : round(Christmas_mean_sales['Weekly_Sales'].mean(),2),'Thanksgiving_mean_sales': round(Thanksgiving_mean_sales['Weekly_Sales'].mean(),2),'LabourDay_mean_sales' : round(LabourDay_mean_sales['Weekly_Sales'].mean(),2),'SuperBowl_mean_sales':round(SuperBowl_mean_sales['Weekly_Sales'].mean(),2),'Non holiday weekly sales' : data[data['Holiday_Flag'] == 0 ]['Weekly_Sales'].mean()}list_of_mean_sales

From above Figure we can infer that the mean sales of thanks giving is more than the non holiday weekly sales .

QUESTION 5 :-Provide a Monthly,Quaterly and Semester view of sales in units and give insights.

#Monthly sales 
monthly = data.groupby(pd.Grouper(key='Date', freq='1M')).sum()# groupby each 1 month
monthly=monthly.reset_index()
fig, ax = plt.subplots(figsize=(10,8))
X = monthly['Date']
Y = monthly['Weekly_Sales']
plt.plot(X,Y)
plt.title('Month Wise Sales')
plt.xlabel('Monthly')
plt.ylabel('Weekly_Sales')#Quaterly Sales 
Quaterly = data.groupby(pd.Grouper(key='Date', freq='3M')).sum()
Quaterly = Quaterly.reset_index()
fig, ax = plt.subplots(figsize=(10,8))
X = Quaterly['Date']
Y = Quaterly['Weekly_Sales']
plt.plot(X,Y)
plt.title('Quaterly Wise Sales')
plt.xlabel('Quaterly')
plt.ylabel('Weekly_Sales')#Semester Sales 
Semester = data.groupby(pd.Grouper(key='Date', freq='6M')).sum()
Semester = Semester.reset_index()
fig, ax = plt.subplots(figsize=(10,8))
X = Semester['Date']
Y = Semester['Weekly_Sales']
plt.plot(X,Y)
plt.title('Semester Wise Sales')
plt.xlabel('Semester')
plt.ylabel('Weekly_Sales')

1. We can observe from the Monthly Sales Graph that highest sum of sales is recorded in between jan-2011 to march-2011.

2. We can observe from the Quarterly Sales Graph that higest sum of sales is recorded in Q1 of 2011 and 2012.

3. We can Observe from Semester Sales graph that at beginning of 1st sem of 2010 and 1st sem of 2013 sales are lowest .