We will be creating an end-to-end case study. I have divided this blog into a series of two blogs.
Table of Content-
- Problem Statement
- Real-world objective/constraint
- Data source and overview
- Performance Metrics
- Existing Solution
- First cut Approach
- Exploratory Data Analysis
1. Problem statement-
This is a Kaggle competition problem which was held nearly 4 years ago. In this case study, we will be focusing on a time series problem. Let’s quickly define the Time-series. Time series is a set of observations recorded over regular intervals of time. Time series can be beneficial in many fields like stock market prediction, weather forecasting. Time series can come in handy in many problems like analysis, classification, and most important forecasting, in this case, a study we will be focusing on analysis and forecasting.
This case study focuses on predicting future values for multiple time series problems. Each time series contains daily traffic on the Wikipedia page for a total of 803 days from 2015–07–01 to 2017–09–10. We have a total of 145k time series which means we have data for 145k pages, our goal is to analyze this data, build a model on it and predict future traffic on each of the pages for 62 days from 2017–09–13 to 2017–11–13(well, it is not actually future but when this competition was held, it was!)
2. Real-world objective/constraint-
Minimize the difference between actual and predicted values.
There are no particular latency requirements, but we should try that it shouldn’t take hours to predict for a particular date. Up to 20–30 seconds should be acceptable.
3. Data Overview-
The training dataset has a total of 145k time series for Wikipedia pages and it contains daily traffic on those pages from July 2015 to September 2017. We have to predict daily traffic on each of the pages from 13th September 2017 to 13th November 2017. We have given three files in total- train.csv,key.csv,sample_submission.csv.
Train.csv contains about 145k rows each of which represents a different Wikipedia page and it has 804 columns, except the first column each column represents a date and it has daily traffic on that particular Wikipedia page. The first column contains the name of the page which includes the language of the Wikipedia page(for example, for English en.wilipedia.org, for Spanish es.wikipedia.org,zh,wikipedia.org for Chinese), type of access(desktop,all-access, agent(spider). For example one name is- ‘AKB48_zh.wikipedia .org_all-access_spider’.
The second file is ‘key.csv’ which has a number of rows equal to the number of predictions we have to make. There are two columns in this file- ‘page name’ and ‘id’. For each page name that is present in the ‘train.csv’ file, we have 62 rows present in the key.csv file, these 62 rows correspond to 62 days of predictions for each page. For each page name, there is a corresponding id.
4. Performance Metrics-
We will be using SMAPE(Symmetric Mean Absolute Percentage Error) as our performance metrics. This is often used in forecasting problems. This has a range of [0,200]. As this is not inbuilt metrics in python, we will be implementing it from scratch and will use a custom metric to evaluate our models.
5. Existing solutions-
In this solution, all values have been transformed to log1p, this will help to make every time series similar i.e. smaller values and larger values will be similar and the distribution will be more towards normal.
A new feature is generated called ‘page popularity’, this is the median of time series values, this feature will help in understanding the scale of time series. For example, one time series can have values between 5–20 and others can have values between 50–100. This solution used an attention mechanism based RNN. Used RNN seq2seq model for prediction. Model has two main parts Encoder and Decoder use cuDNN GRU in the encoder rather than the usual Tensorflow RNN cell.
In this solution, a median of different window sizes was used to make predictions. Window size is decided by Fibonacci series starting from 6,12 and then 18,30,48,78 and so on. To make predictions, the median of the last 6days, 12 days, 18 days,…… is taken, and then the median of all these values is taken. If we don’t have enough data then the median of all available data is taken. This solution is simple but performed quite well.
6. First cut Approach-
We will try to focus on the feature engineering part as we will be solving this case study with machine learning models first and then we will try deep learning models that are powerful enough to learn features on their own but we might need to provide features explicitly for machine learning models. I will try to generate features that can capture weekly, quarterly and yearly patterns. We will start with a basic median based model as our baseline model. After that, we will try models like xgboost, gbdt as they have proved themselves to be a good option in any condition.
We have a total of 26 months of training data and we have to predict for the next 62 days. We will generate features for the last 15 days. Rolling window features will be generated. Other feature engineering techniques that will be tried are as follows-
(i) Median of (t-4 months,t-8 months,t-12 months,t-16 months) where each of these values itself is median of 2 days before and after values. This feature will help us to capture quarterly trends.
(ii) Median of last 5 days
(iii) Median of the last 5 same weekdays, this feature will help us to capture a weekly trend.
(iv) To capture yearly trend, we will take median of t-363,t-364,t-365,t-366,t-367
(v) Fourier transformation will also be used.
(vi) Taking the median of Fibonacci series window size can be good for our model, this will also be tested.
The all above features will be generated after going through the basic time series models first. Different methods will also be applied to make the time series more stationary. Detailed EDA will also be done and feature engineering will be changed accordingly. Preprocessing like missing values imputation will be performed.
7. Exploratory Data Analysis-
First, we will fill in all the missing values(if any). so, let’s check if there is any missing value in the data.
Above image shows a count of missing values for a few dates only but we can get an idea from it that there are a lot of missing values in the data. For some pages, data is present only for the last few months, the rest of the data is null. This can be because of the fact that this wiki page was created later, so data is present only for days after the page was created. So now we have two situation- one where the data is actually missing and the other where the page was not even created, now we know that when the data is actually missing it would be missing for few days for some of the dates but when the page was not even created we would not have data for that page right from the beginning and suddenly we will have all the data for that page.
To fill the missing values for data for the first situation we will use linear interpolation, for situation two missing values we will fill them with 0
As we have more than 145k time series, analyzing all of them individually is just impossible, we have to find out some way on which we can divide these time series and then combine each group to analyze. Remember, the name of the pages contains the language of the page, we can use it to combine different types of pages and then analyze them. So, let’s get started with it!
Is traffic affected by Page language?
Look for the different languages present on the Wikipedia page which might affect the dataset by using simple regular expressions. There are some non-Wikipedia pages, these are Wikimedia pages, so give them the code ‘na’ since haven’t determined their language. Many of these will be things like images that do not really have a language.
There are 7 languages plus the media pages. The languages used here are: English, Japanese, German, French, Chinese, Russian, and Spanish.
Create data frames for the different types of entries. After that calculate the sum of all views. The data comes from several different sources, the sum will likely be double-counting some of the views.
Now, let’s see what we get
English shows a much higher number of views per page. There is a lot more structure here than expected. The English and Russian plots show very large spikes around day 400 (around August 2016), with several more spikes in the English data later in 2016. My guess is that this is the effect of both the Summer Olympics in August and the election in the US.
There’s also a strange feature in the English data around day 200.
The Spanish data is very interesting too. There is a clear periodic structure there, with an approx 1-week fast period and what looks like a significant dip around every 6 months or so.
Periodic Structure and FFTs
Since it looks like there is some periodic structure here, plot each of these separately so that the scale is more visible. Along with the individual plots, I will also look at the magnitude of the Fast Fourier Transform (FFT). Peaks in the FFT show us the strongest frequencies in the periodic signal.
Plot for English Wikipedia-
For the first few months, there were only 1 or 2 spikes in the data but after July 2016 there was a rapid hike and after that, traffic has gone down by a large margin.
Small peaks were there in the end quarter
Plot for Japanese Wikipedia-
Japanese Wikipedia shows some large spikes during January 2016 and we can see that traffic has gone down by a very large margin during September 2015.
It is very similar to the plots we saw initially, peaks at 120 and 230 days.
Plot for German Wikipedia-
German Wikipedia has the most number of spikes among all the languages. We can see that this makes the ‘M’ shaped like structure, goes upward then downward and again upward then again downward.
As expected there are peaks at 120 and 230 but surprisingly at around 340 days, there is a peak.
Plot for Wikimedia-
This plot is for media pages like images. We can see that till April 2016, data doesn’t show any spikes and after that, it has regular spikes and there are also some very large spikes
Regular peaks are there in the data but no large peaks like in the previous graphs.
Plot for French Wikipedia-
French Wikipedia shows an upward trend. Spikes can also be seen regularly but there is a large spike around March 2016.
This is also very similar to the previous plot, peaks at 120,230 and 340 days. End of every quarter there were hikes.
Plot for Chinese Wikipedia-
Like other languages this one also has an upward trend during the first few months, we can see a very large spike during January 2016
Very similar to the previous language plot except that it has a comparatively smaller peak at 340 days
Plot for Russian Wikipedia-
Russian Wikipedia does not show a large upward or downward trend but there is a very large spike during Q3 2016. Other than that it has few spikes here and there but not as much as other languages.
This has a peak at 120 days but no major peak is seen at 230 days.
Plot for Spanish Wikipedia-
We can see the nice weekly pattern in the data, regular spikes are thereafter 7 days, and one interesting observation from this plot is that traffic usually goes up during late Q3 or early Q4 and in November 2016, it has a very large spike.
This plot shows peaks at 120 and 230 days which shows quarterly periodicity. Small peaks can be seen regularly, this can be because of the fact that 7 days lag shows a high correlation.
After seeing both the plots there is Spanish data has the strongest periodic features, most of the other languages show some periodicity as well. For some reason, the Russian and media data do not seem to show much. Plotted red lines where a period of 1, 1/2, and 1/3 week would appear. See that the periodic features are mainly at 1 and 1/2 weeks. This is not surprising since browsing habits may differ on weekdays compared to weekends, leading to peaks in the FFTs at frequencies of n/(1 week) for integer n. It shows that page views are not at all smooth. There is some regular variation from day to day, but there are also large effects that can happen quite suddenly. A model likely will not be able to predict the sudden spikes unless it can be fed more information required about what is going on currently.
Each article name has the following format: ‘name_project_access_agent’. That would be a good idea to separate out all these 4 features to get a better understanding of the data.
As we can see there are 9, 3, and 2 unique values for project, access, and agent respectively. looking at these values
Plot the project-wise monthly mean hits
Now, Let’s see the plot
English Wikipedia project hitting more than any other project. Also, Russian Wikipedia is having the same hike near august 2016 as English Wikipedia.
Now with English projects in graphs, it is hard to visualize other projects. Separated out English project and find some patterns if possible.
It clearly shows that media pages are rarely visited by users but here some hikes are showing in the Russian page in the 2016 year in June month.
All the pages are accessed but the desktop more rigorously than the mobile users. People prefer to access through the desktop.
We could not get the data pattern from the above graph. With just two values for agent why not get them plotted in two separate graphs and see how they behave.
All-access and all-agents value for access and agents are a summation of values for respective attributes. So each value other than all-access contributes in trend for all-access and all values other than all-agents contribute to the trend of all-agents.
We have seen every kind of pattern that our data exhibit. Now, we will try to answer some questions based on our data which can prove helpful at later stages of our project.
Which month has more visits on average?
From here onwards every question that we will try to answer, their average will be based on Median and not on mean.
So, index 4 is for May which shows on average May has the most number of visits. In comparison to July which has the least number of average visitors, May shows a significant 25 % growth.
We can see that on average Sunday has the most number of visitors but all of them are pretty close and don’t have much difference.
Does the holiday season affect the number of average visitors?
As last 3 month of the year has the most number of festivals all over the world. Let's check if it affects the average number of visitors.
There is some difference here. There is actually a 7.3 % dip in the average number of visitors during the holiday season.
Conclusion- That it actually make difference, during the holiday season, Wikipedia pages have less traffic in comparison to other months of the year.
This is the end of our EDA, we have analyzed our data in depth which will be useful at later stages of our case study. With this, we conclude this part of the series.