How-To Guide on Exploratory Data Analysis for Time Series Data

Mansi Choudhary
Analytics Vidhya
Published in
14 min readSep 1, 2020

Why Exploratory Data Analysis?

You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory data analysis. A common question that pops in people’s head after listening to this as to why EDA?

· What is it, that makes EDA so important?

· How to do proper EDA and get insights from the data?

· What is the right way to begin with exploratory data analysis?

So, let us how we can perform exploratory data analysis and get useful insights from our data. For performing EDA I will take dataset from Kaggle’s M5 Forecasting Accuracy Competition.

Understanding the Problem Statement:

Before you begin EDA, it is important to understand the problem statement. EDA depends on what you are trying to solve or find. If you don’t sync your EDA with respect to solving the problem it will just be plain plotting of meaningless graphs.

Hence, before you begin understand the problem statement. So, let us understand the problem statement for this data.

Problem Statement:

We here have a hierarchical data for products for Walmart store for different categories from three states namely, California, Wisconsin and Texas. Looking at this data we need to predict the sales for the products for 28 days. The training data that we have consist of individual sales for each product for 1914 days. Using this train data we need to make a prediction on the next days.

We have the following files provided from as the part of the competition:

  1. calendar.csv — Contains information about the dates on which the products are sold.
  2. sales_train_validation.csv — Contains the historical daily unit sales data per product and store [d_1 — d_1913]
  3. sample_submission.csv — The correct format for submissions. Reference the Evaluation tab for more info.
  4. sell_prices.csv — Contains information about the price of the products sold per store and date.
  5. sales_train_evaluation.csv — Includes sales [d_1 — d_1941] (labels used for the Public leaderboard)

Using this dataset we need to make the sales prediction for the next 28 days.

Analyzing Dataframes:

Now, after you have understood the problem statement well, the first thing to do, to begin with, EDA, is analyze the dataframes and understand the features that are present in our dataset.

As mentioned earlier, for this data we have 5 different CSV files. Hence, to begin with, EDA we will first print the head of each of the dataframe to get the intuition of features and the dataset.

Here, I am using Python’s pandas library for reading the data and printing the first few rows. View the first few rows and write your observations.:

Calendar Data:

First Few Rows:

Calendar dataframe

Value Counts Plot:

To get a visual idea about our data we will plot the value counts in each of the category of calendar dataframe. For this we will use the Seaborn library.

Code-Snippet for Plotting Value Counts of Each Feature
Value_counts for each day of week
Value_counts for each month
Value_counts for each year
Value Counts for each event based on name
Value_counts for each event based on event_name
Value_counts for type of event in type_1
Value_counts for the type of event in type_2

Observations from Calendar Dataframe:

  1. We have the date, weekday, month, year and event for each of day for which we have the forecast information.
  2. Also, we see many NaN vales in our data especially in the event fields, which means that for the day there is no event, we have a missing value placeholder.
  3. We have data for all the weekdays with equal counts. Hence, it is safe to say we do not have any kind of missing entries here.
  4. We have a higher count of values for the month of March, April and May. For the last quarter, the count is low.
  5. We have data from 2011 to 2016. Although we don’t have the data for all the days of 2016. This explains the higher count of values for the first few months.
  6. We also have a list of events, that might be useful in analyzing trends and patterns in our data.
  7. We have more data for cultural events rather than religious events.

Hence, by just plotting a few basic graphs we are able to grab some useful information about our dataset that we didn’t know earlier. That is amazing indeed. So, let us try the same for other CSV files we have.

Sales Validation Dataset:

First few rows:

Next, we will explore the validation dataset provided to us:

First five rows of validation data

Value counts plot:

Code-Snippet for count_plot
Value_counts plot for each store
Value_counts plot for each state
Value_counts plot for each category
Value_counts plot for each department

Observations from Sales Data:

  1. We have data for three different categories which are Household, Food and Hobbies
  2. We have data for three different states California, Wisconsin and Texas. Of these three states, maximum sales are from the state of California.
  3. Sales for the category of Foods is maximum.

Sell Price Data:

First few rows:

First 5 rows for Sell Price Data

Observations:

  1. Here we have the sell_price of each item.
  2. We have already seen the item_id and store_id plots earlier.

Asking Questions to your Data:

Till now we have seen the basic EDA plots. The above plots gave us a brief overview about the data that we have. Now, for the next phase we need to find answers of the questions that we have from put data. This depends on the problem statement that we have.

For Example:

In our data we need to forecast the sales for each product on the next 28 days. Hence, for this we need to know if there are any kind of patterns in the sales earlier before that 28 days? Because, if that is so then the sales is likely to follow the same pattern for next 28 days too.

So, here goes our first question?

What is the Sales distribution in the past?

So, to find out the same, let us randomly select few products and see their sales distribution for 1914 days given in our validation data:

Code-snippet for plotting sales of a product
Sales Distribution plot for FOODS_3_0900_CA_3_validation
Sales Distribution plot for HOUSEHOLD_2_348_CA_1_validation
Sales Distribution plot for FOODS_3_325_TX_3_validation

Observations:

  1. The plots are very random and it is difficult to find out a pattern.
  2. For FOODS_3_0900_CA_3_validation we see that on day1 the sales were high after which it was Nil for sometime. After that once again it reached high and is fluctuating up and down since then. The sudden fall after day1 might be because the product got out of stock.
  3. For HOUSEHOLD_2_348_CA_1_validation we see that the sales plot is extremely random. It has a lot of noise. On some day the sales are high and on some it got lowered considerably.
  4. For FOODS_3_325_TX_3_validation we see absolutely no sales for first 500 days. This means that for the first 500 days the product was not in stock. After that the sales reached a peak in every 200 days. Hence, for this food product we see a seasonal dependency.

Hence, by just randomly plotting few sales graph we are able to take our some important insights from our dataset. These insights will also help us in choosing the right model for training process.

What is the Sales Pattern on Weekly, Monthly and Yearly Basis?

We saw earlier that there are seasonal trends in our data. So, next let us break down the time variables and see the weekly, monthly and yearly sales pattern:

Code-Snippet for Weekly Average Sales Distribution
Weekly Average Distribution for HOUSEHOLD_1_118_CA_3_validation

For this particular HOUSEHOLD_1_118_CA_3_validation we can see that the sales see a drop after Tuesday and hits minimum on Saturday.

Code-Snippet for Monthly Average Sales Distribution
Monthly Average Distribution for HOUSEHOLD_1_118_CA_3_validation

The monthly sales drop in the middle of the year. After which we can say that it reaches a minimum in 7th month that is July.

Code-Snippet for Yearly Average Sales Distribution
Yearly Average Distribution for HOUSEHOLD_1_118_CA_3_validation

From the above graph we can see that the sales just dropped to zero from 2013 to 2014. This means that the product might be have been updated with a new product version or just removed from this store. From this plot it will be safe to say that for days to predict the sales should still be zero.

What is the Sales Distribution in Each Category?

We have sales data belonging to three different categories. Hence, it might be good to see if the sales of product depend on the category it belongs to. The same we will do now:

Code-Snippet for Sales Distribution Category-Wise
Sales-Distribution for each Category

We see that the sales is maximum for Foods. Also, the sales curve for FOOD do not overlap at all with the other two categories. This shows that on any day the sales of Food is more than Household and Hobbies.

What is the Sales Distribution for Each State?

Besides category we also have state to which the sales belong. So, let us analyze if there is a state for which the sales follow a different pattern:

Code-Snippet for Sales Distribution State-Wise
Sales-Distribution for each State

What is the Sales Distribution for Products that belong to category of Hobbies on weekly, monthly and yearly basis?

Now, let us see the sales of randomly selected products from the categories Hobbies and see if their weekly, monthly or yearly average follows a pattern:

Code-Snippet for plotting sales distribution of products from Hobbies

Observations

From the above plot we see that in meed week usually for 4th and 5th day (Tuesday and Wednesday), the sales drop especially in the case when states are ‘WI’ and ‘TX’.

Let us analyze the results on individual states to see this more clearly, as we see different sales pattern for different states. And, this brings us to our next question:

What is the Sales Distribution for Products that belong to the category of Hobbies on weekly, monthly and yearly basis for a particular state?

Code-Snippet for selecting Sales of products from Hobbies category and state of Wisconsin
Code-Snippet for selecting few products at random and plotting their distribution

Observations:

  1. From the above plots, we can see that in the state of Wisconsin, for most of the products the sales decrease considerably in mid-week.
  2. This also gives us a little sense of life-style of people in Wisconsin, that people here do not shop much during day 3–4 which is Monday and Tuesday. This probably might be because are these are the busiest days of the week.
  3. From the monthly average we can see that, in first quarter the sales often experienced a dip.
  4. For the product HOBBIES_1_369_WI_2_validation, we see that the sales data is nill till year 2014. This shows that this product was introduced after this year and the weekly and monthly pattern that we see for this product is after the year 2014.

What is the Sales Distribution for Products that belong to category of Foods on weekly, monthly and yearly basis?

Now, doing analysis for Hobbies individually gave us some useful insights. Let, us try the same for the category of Foods:

Code-Snippet for Food making dataframe with only products of Food Category
Code-Snippet for plotting weekly, monthly and yearly average sales for food products

Observation:

  1. From the plots above we can say that, for food items categories the purchase is more in the early week as compared to the last two days.
  2. This is might be because people are habituated of buying food supplies during the start of the week and then keep it for the entire week. This curves shows us the similar behavior.

What is the Sales Distribution for Products that belong to category of Household on weekly, monthly and yearly basis?

Code-Snippet for plotting sales distribution of products from Houehold category

Observation:

  1. From the plots above we can say that, for Household items categories the purchase shows a dip for Monday and Tuesday.
  2. In the start of week people are busy with office work and hardly go for shopping. This is the pattern that we see here.

Is there a way to see the sales of products more clearly without losing information?

We saw plots for sales distribution earlier for each products. These were quite cluttered and we couldn’t see the pattern clearly. Hence, you might be wondering if there is a way to do so. And, the good news is yes there is.

Here comes denoising in picture. We will denoise our dataset and see the distribution.

Here we will see two common denoising techniques. Wavelet denoising and Moving average.

Wavelet Denoising:

From the sales plots of invidual products we saw that the sales changes rapidly. This is because the sales of a product on a day depend on multiple factors. So, let us try denoising our data and see if we are able to find anything intresesting.

The basic idea behind wavelet denoising, or wavelet thresholding, is that the wavelet transform leads to a sparse representation for many real-world signals and images. What this means is that the wavelet transform concentrates signal and image features in a few large-magnitude wavelet coefficients. Wavelet coefficients which are small in value are typically noise and you can “shrink” those coefficients or remove them without affecting the signal or image quality. After you threshold the coefficients, you reconstruct the data using the inverse wavelet transform.

For wavelet denoising, we require the the library pywt.

Here we will use wavelet denoising. For deciding the threshold of denoising we will use Mean Absolute Deviation.

Code-Snippet for Wavelet Denoising

Observations:

We are able to see a pattern more clear after denoising the data. It shows the same pattern every 500 days which we were not able to see before denoising.

Moving Average Denoising:

Let us now try a simple smoothing technique.In this technique, we take a fixed window sie and move it along out time-series data calculating the average. We also take a stride value so as to leave the intervals accordingly. For example, let's say we take a window size of 20 and stride as 5. Then our first point will be the mean of points from day1 to day 20, the next will be the mean of points from day5 to day25, then day10 to day30 and so on.

So, let us try this average smoothing on our dataset and see if we find any kind of patterns here.

Code-Snippet for Moving Window Average Calculation

Observations:

We see that the average smoothing does remove some noise but not as effective as the wavelet decomposition.

Do the sales vary overall for each state?

Now, from a broader perspective let us see if the sales vary for each state:

Code-Snippet for Average Sales in Each state
Sales-pattern for each state
Box-plot for Sales distribution of each state

Observations:

  1. From the above plot we can see that the sales for store CA_3 lie above the sales for all other states. The same applies for CA_4 where the sales are lowest. For other sales the patterns are distinguishable to some extent.
  2. One thing that we observe that all these patterns follow a similar trend that repeats itself after some time. Also, the sales reaches a higher value in the graph.
  3. As we saw from the line-plot, the box plot also shows non-overlapping sales patternf for CA_3 nd CA_4.
  4. No overlapping between the stores of California and totally independent of the fact that all of these belong to the same state. This shows high variance for the state of California.
  5. For Texas the states TX_1 and TX_3 have quite smiliar patterns and intersect a couple of times. But TX_2 lies above them with maximum sales and more disparity as compared to the other two. In the later parts, we see that TX_3 is growing rapidly and is approaching towards TX_2. Hence, from this, we can conclude that sales for TX_3 increase at the fastest pace.

Conclusion:

Hence, by just plotting few simple graphs we are able to know our dataset quite well. Its just a matter of questions that you want to ask to the data. The plotting will give you all the answers.

I hope this would have given you an idea of doing simple EDA. You can find the complete code in my github repository.

To know how we can do forecasting on this data using basic statistical approaches, check out this blog.

References:

  1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
  2. https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models/output
  3. https://mobidev.biz/blog/machine-learning-methods-demand-forecasting-retail
  4. https://www.mygreatlearning.com/blog/how-machine-learning-is-used-in-sales-forecasting/
  5. https://medium.com/@chunduri11/deep-learning-part-1-fast-ai-rossman-notebook-7787bfbc309f
  6. https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling
  7. https://eng.uber.com/neural-networks/
  8. https://www.kaggle.com/mayer79/m5-forecast-keras-with-categorical-embeddings-v2

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Mansi Choudhary
Mansi Choudhary

Written by Mansi Choudhary

Certified Data Scientist and blogger who look forward to learn new inventions happening daily in the field of Machine Learning and Data Science.

No responses yet