In this ‘story’ I want to show my process through this Kaggle’s begginer competition. I’ve applied some of the main concepts from the book “Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow” and from the Time Series course from Kaggle.
Goal of the Competition
In this “getting started” competition, you’ll use time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer.
Specifically, you’ll build a model that more accurately predicts the unit sales for thousands of items sold at different Favorita stores. You’ll practice your machine learning skills with an approachable training dataset of dates, store, and item information, promotions, and unit sales.
- Look at the big picture
- Framing the problem: Since all the data is already labeled and we are expected to predict numerical values we can tell this is a supervised learning and regression task.
- Selecting a performance measure: We are already given the specific evaluation metric for this competition which is Root Mean Squared Logarithmic Error (RMSLE)
2. Get the data
- This is one of the most crucial parts of any ML project. Thanks to Kaggle we are provided with 6 csv files (train, test, store, oil, holidays_events, transactions) containing data about products, stores, promotions, sales, oil prices, holidays and more. Our target: sales.
- We don’t need to set aside a test set. We can start working directly on the dataset from train.csv
We start fetching the training data from a csv file and creating a Pandas Dataframe called train_df:
train_df = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv', parse_dates=['date'], infer_datetime_format=True)
3. Take a Quick Look at the Data Structure
For now we will be just considering the data from train.csv to get some basic understanding of its structure, but all the data provided in the other csv files will be useful and necessary.
Let’s start taking a look at the top five rows of the training dataset using the head() method:
Each row represents the sales of a specific product (‘family’) in a specific store (‘store_nbr’) on a single day (‘date’). The entire dataset comprises almost 1800 series recording store sales across a variety of product families from 2013 into 2017.
Let’s get a quick description of the data (total number of rows, each attribute’s type)
All attributes are numerical, except the family field (and date, whose type is datetime). The family type is object, in this case a text attribute. And the values are repetitive which means they are categorical values for the products: “automotive”, “baby care”, “beauty”, etc. We will later have to deal with these attributes in order for the ML models to learn from them.
We can notice 33 different families with 90936 values each.
4. Discover and visualize data to Gain Insights. Prepare the data and train a simple Linear Regression Model
So far we have only taken a quick glance at the data to get a general understanding of it. Now the goal is to go a little bit more in depth.
To begin with, we create a copy of the train_df dataframe and structure it with its date, store number and family as indexes:
Then, compute the average sales grouped by day and store it as a series:
Let’s see if we can discover some kind of trend in the average sales for the period between 2013–2017.
Time-step Features
We add a time-step feature on the average_sales dataframe and fit a Linear Regression Model. We then plot the fitted values over time:
Time-step features let you model time dependence. In the previous Sales series, we can predict that sales later in the month are generally higher than sales earlier in the month
Lag features
We now change the time-step feature with a lag feature and again fit a Linear Regression Model. Plot it:
Generally, lag features let you model serial dependence. You can see from the lag plot that sales on one day (sales) are correlated with sales from the previous day (Lag_1). When you see a relationship like this, you know a lag feature will be useful.
More generally, lag features let you model serial dependence. A time series has serial dependence when an observation can be predicted from previous observations. In Sales, we can predict that high sales on one day usually mean high sales the next day.
Adapting machine learning algorithms to time series problems is largely about feature engineering with the time index and lags. We will use linear regression for its simplicity, but these features will be useful whichever algorithm you choose for your forecasting task.
The best time series models will usually include some combination of time-step features and lag features.
Moving Average Plots
To see what kind of trend a time series might have, we can use a moving average plot. The idea is to smooth out any short-term fluctuations in the series so that only long-term changes remain.
Let’s make a moving average plot to see what kind of trend this series has. Since this series has daily observations, let’s choose a window of 365 days to smooth over any short-term changes within the year. First use the rolling method to begin a windowed computation and then follow this by the mean method to compute the average over the window.
As we can see, the trend of average sales appears to be about cubic.
Trend Feature
We’ll use DeterministicProcess to create a feature set for a cubic trend model. Also create features for a 90-day forecast. To make a forecast, we apply our model to “out of sample” features. “Out of sample” refers to times outside of the observation period of the training data. Here’s how we could make a 30-day forecast:
You can see the a plot of the result:
The trend discovered by our LinearRegression model is almost identical to the moving average plot, which suggests that a cubic trend was the right decision in this case.
These trend models we learned about turn out to be useful for a number of reasons. Besides acting as a baseline or starting point for more sophisticated models, we can also use them as a component in a “hybrid model” with algorithms unable to learn trends (like XGBoost and random forests).
Seasonality
We say that a time series exhibits seasonality whenever there is a regular, periodic change in the mean of the series. Seasonal changes generally follow the clock and calendar — repetitions over a day, a week, or a year are common. Seasonality is often driven by the cycles of the natural world over days and years or by conventions of social behavior surrounding dates and times.
We will learn two kinds of features that model seasonality. The first kind, indicators, is best for a season with few observations, like a weekly season of daily observations. The second kind, Fourier features, is best for a season with many observations, like an annual season of daily observations.
We will include in our analysis the data from the holidays.csv file, which includes information about holidays and important events, with metadata:
Seasonal Plots
A seasonal plot shows segments of the time series plotted against some common period, the period being the “season” you want to observe.
We will examine the following seasonal plot to try to discover seasonal patterns:
There is a clear weekly seasonal pattern in this series, higher on weekend.
Periodgrams
The periodogram tells you the strength of the frequencies in a time series.
Let’s also take a look at the following periodgram.
From left to right, the periodogram drops off after Monthly, twelve times a year. That was why we’ll choose 12 Fourier pairs to model the annual season. The Weekly frequency we ignore since it’s better modeled with indicators.
Both the seasonal plot and the periodogram suggest a strong weekly seasonality. From the periodogram, it appears there may be some monthly and biweekly components as well. (In fact, the notes to the Store Sales dataset say wages in the public sector are paid out biweekly, on the 15th and last day of the month — a possible origin for these seasons).
Seasonal Features
We’ll create our seasonal features using DeterministicProcess, the same utility we used to create trend features.
We then fit a Linear Regression Model
and finally a 90-day forecast to see how our model extrapolates beyond the training data:
Deseasonalizing or Detrending
Removing from a series its trend or seasons is called detrending or deseasonalizing the series.
Look at the periodogram of the deseasonalized series:
Holidays Dataframe
We create a dataframe with data about Regional and National Ecuadorian Holidays from 2017.
From a plot of the deseasonalized Average Sales, it appears these holidays could have some predictive power.
Create Holiday Features
We use the OneHotEncoder method to create holiday features
and then join them to the training data we already had.
Now we fit the seasonal model with the addition of the holiday features expecting the fitted values to improve. Plot it:
First Submission
Let’s create a seasonal model of the kind we’ve learned about for the full store sales dataset with all 1800 time series:
Let’s see some of its predictions:
Finally, let’s load the test data, create a feature set for the forecast period, and then create the submission file submission.csv:
We now have to follow some steps in order to successfully submit the results to the competition!
Conclusion
We have analysed the structure of the data, gained some important insights about trends and seasonality and for both of them created features to train basic Linear Regression Models.
There’s still more we can do with time series to improve our forecasts. In the next ‘story’ we’ll learn how to use time series themselves as a features. Using time series as inputs to a forecast lets us model the another component often found in series: cycles.
We will also improve our currents models by combining what we’ve got and create “Hybrid Models”.