Hello readers, from the title you must have gotten the idea of what this blog is all about. We will be creating an end to end case study. This is my first time writing a blog, hope you all like it. I have divided this blog into a series of two blogs.
Table of Content-
- Problem Statement
- Real world objective/constraint
- Data source and overview
- Performance Metrics
- Existing Solution
- First cut Approach
- Exploratory Data Analysis
Let’s get started!
1. Problem statement-
This is a Kaggle competition problem which was held nearly 3 years ago. In this case study, we will be focusing on a time series problem. For those of you who are not familiar with time series, Time series is a set of observations recorded over regular interval of time, Time series can be beneficial in many fields like stock market prediction, weather forecasting. Time series can come handy in many problems like analysis, classification and most important forecasting, in this case study we will be focusing on analysis and forecasting.
This case study focuses on predicting future values for multiple time series problem. Each time series contains daily traffic on Wikipedia page for a total of 803 days from 2015-07-01 to 2017-09-10. We have a total of 145k time series which means we have data for 145k pages, our goal is to analyse this data, build a model on it and predict future traffic on each of the page for 62 days from 2017–09–13 to 2017–11–13(well, it is not actually future but when this competition was held, it was!)
2. Real world objective/constraint-
Minimize difference between actual and predicted values.
There are no particular latency requirements , but we should try that it should don’t take hours to predict for a particular date. Up to 20–30 seconds should be acceptable.
3. Data Overview-
As this is a Kaggle competition, so the source of the dataset is Kaggle itself, we are given three files in total- train.csv,key.csv,sample_submission.csv.
Train.csv contains about 145k rows each of which represent a different Wikipedia page and it has 804 columns, except the first column each column represent a date and it has daily traffic on that particular Wikipedia page. First column contains the name of the page which includes the language of the Wikipedia page(for example, for English en.wilipedia.org, for Spanish es.wikipedia.org,zh,wikipedia.org for Chinese) +type of access(desktop, all access) + agent(spider, actual traffic). For example one name is- ‘AKB48_zh.wikipedia.org_all-access_spider’.
Second file is ‘key.csv’ which has number of rows equal to the number of predictions we have to make. There are two columns in this file- ‘page name’ and ‘id’. For each page name that is present in ‘train.csv’ file, we have 62 rows present in key.csv file, these 62 rows corresponds to 62 days of predictions for each page. For each page name there is corresponding id.
Example data point-
4. Performance Metrics-
We will be using SMAPE(Symmetric Mean Absolute Percentage Error) as our performance metrics. This is often used in forecasting problems. This has a range of [0,200]. As this is not inbuilt metrics in python, we will be implementing it from scratch and will use custom metric to evaluate our models.
5. Existing solutions-
In this solution, all values have been transformed to log1p, this will help to make every time series similar i.e. smaller values and larger values will be similar and the distribution will be more towards normal.
A new feature is generated called ‘page popularity’, this is median of time series values, this feature will help in understanding the scale of time series. For example, one time series can have values between 5–20 and other can have values between 50–100. This solution used attention mechanism based RNN.
These two are key take away from this solution.
In this solution, median of different window size was used to make predictions. Window size is decided by Fibonacci series starting from 6,12 and then 18,30,48,78 and so on. To make predictions median of last 6days, 12 days, 18 days,…… is taken and then median of all these values is taken. If we don’t have enough data then median of all available data is taken. This solution is simple but performed quite good.
This solution is based on Kalman Filtering. The data was smoothed in two ways, data which was accessed by spider was smoothed by using Fibonacci median of medians which was already discussed in above solution. Now, the non-spider data was smoothed using Kalman smoothing. The predictions were made using Kalman filters. This is what we going to focus on. When I first read this solution, I had no idea about Kalman filters, so I started to read about it and found that this is complete classical approach which is mainly used in Aerospace field to detect location, to measure speed, to measure temperature of engine .Its basic method is like that of Gradient descent algorithm, like Gradient Descent, it updates its prediction over the iterations and converge after some iterations. We can also apply this to our problem. Covering this algorithm in deep would take a lot of time, so here is the basic idea of how this algorithm work-
Xkp = AXk-1 + BUk +Wk Pkp = APk-1AT + Qk
U- Control variable matrix(This matrix will take into account how our data is changing, for example if our data is changing linearly then this matrix will contain 0).
W is the matrix which contains noise in the predicted output.
Q- Covariance matrix of noise created by the process.
A- Transition Matrix, B- Input effect matrix
To make update after each iteration, kalman gain is used
Kalman gain(K) = Error in estimated value/(error in estimated value + error in measurement value) , 0<=K<=1
ESTt = ESTt-1 + K[Measurement — ESTt-1]
If K is high then ESTerror > MEASerror the equation will give more weightage to difference between measurement and estimated value and vice versa.
All the above matrix are also calculated according to the problem.
6. First cut Approach-
We will try to focus on feature engineering part as we will be solving this case study with machine learning models first and then we will try deep learning models which are powerful enough to learn features on their own but we might need to provide features explicitly for machine learning models. I will try to generate features that can capture weekly, quarterly and yearly patterns. We will start with basic median based model as our baseline model. After that we will try models like xgboost, gbdt as they have proved themselves to be a good option in any condition.
We have a total of 26 months of training data and we have to predict for next 62 days. We will generate features for last 15 days. Rolling window features will be generated. Other feature engineering techniques that will be tried are as follows-
(i) Median of last 5 days
(ii) Median of last 5 same week day, this feature will help us to capture weekly trend.
(iii) Median of (t-4 months,t-8 months,t-12 months,t-16 months) where each of these value itself is median of 2 days before and after values. This feature will help us to capture quarterly trend.
(iv) To capture yearly trend, we will take median of t-363,t-364,t-365,t-366,t-367
(v) Taking median of Fibonacci series window size can be good for our model, this will also be tested.
(vi) Fourier transformation will also be used.
Auto correlation for yearly, quarterly, monthly and weekly pattern will also be checked and will change the features accordingly. The all above features will be generated after going through the basic time series models first. Different methods will also be applied to make the time series more towards stationary. Detailed EDA will also be done and feature engineering will be changed accordingly. Preprocessing like missing values imputation will be performed.
7. Exploratory Data Analysis-
First, we will fill all the missing values(if any). so, let’s check if there is any missing value in the data.
Above image shows count of missing values for few dates only but we can get an idea from it that there are a lot of missing values in the data. For some pages, data is present only for last few months, rest of the data is null. This can be because of the fact that this wiki page was created later, so data is present only for days after the page was created. So now we have two situation- one where the data is actually missing and the other where the page was not even created, now we know that when the data is actually missing it would be missing for few days for some of the dates but when the page was not even created we would not have data for that page right from the beginning and suddenly we will have all the data for that page.
To fill the missing values for data for first situation we will use linear interpolation, for situation two missing values we will fill them with 0
As we have more than 145k time series, analysing all of them individually is just impossible, we have to find out some way on which we can divide these time series and then combine each group to analyse. Remember, name of pages contains language of the page, we can use it combine different type of pages and then analyse them. So, let’s get started with it !
These two letter words corresponds to different languages:
de-German,en-English,es-Spanish,fr-French,ja-Japanese,ru-Russia,zh-Chinese,nt refers to media pages(Wikimedia)
Now, let’s see what we get
As anyone would have expected, English Wikipedia has the largest traffic of all languages but there is a pattern around August 2016 nd interestingly Russian Wikipedia has the same pattern during same time. There is also a pattern around January 2016 in English Wikipedia and there some spikes can also be seen in Japanese wiki during the same period.
Conclusion- There is difference in traffic based on language of the data.
Plots for Spanish Language-
We can see the nice weekly pattern in the data,regular spikes are there after 7 days and one interesting observation from this plot is that traffic usually goes down during late Q3 or early Q4 and in November 2016, it has a very large spike.
Conclusion-Data has weekly seasonality
This plot shows a high correlation value for 7 days lag, so there is a weekly trend in the data.
Conclusion- We can conclude that using a lag of 7 days in our AR model can produce good results for Spanish language.
This plot shows peak at 120 and 230 days which shows quarterly periodicity. Small peaks can be seen regularly, this can be because of the fact that 7 days lag shows high correlation.
Conclusion- These peaks represents that there is quarterly seasonality in Spanish data.
Plot for English Wikipedia-
For the first few months there are only 1 or 2 spikes in the data but after that spikes can be seen regularly and we can see that after Feb 2017, traffic has gone down by a large margin.
This also shows high correlation value for 7 days lag. After that auto correlation value is for 7,14,21,28….. days is always higher than other days correlation which shows a strong weekly trend.
Conclusion- There is high weekly seasonality.
This also has peaks around same time like the previous one. Regular small peaks can also be seen.
Conclusion- Just like Spanish , this also has quarterly seasonality.
Plot for Russian Wikipedia-
Russian Wikipedia does not show a large upward or downward trend but there is a very large spike during Q3 2016. Other than that it has few spikes here and there but not as much as other languages.
This one is quite different than other plots, auto correlation value is decreasing regularly. Using lag of 7 days in this case might not be a good idea, lag of 1 or 2 days can provide better results. We will look at other language plots too and then decide the optimal lag value for our complete dataset.
Conclusion- This shows that people in Russia doesn’t care about weekends and there rate of accessing the Wikipedia pages remains uniform.
This has a peak at 120 days but no major peak is seen at 230 days.
Conclusion- Russian Wikipedia doesn’t show much seasonality.
Plot for Media Pages-
This plot is for media pages like images. We can see that till April 2016, data doesn’t show any spikes and after that it has regular spikes and there are also some very large spikes
This one also has better correlation for 1 or 2 days lag but notice that lag of 7 days is higher in comparision to 5,6 or 8,9 days lag.
Conclusion- It also shows some weekly seasonality but not as much as others.
Regular peaks are there in the data but no large peaks like in the previous graphs.
Conclusion- We can see that there is no seasonality in the media pages.
Plot for Japanese Wikipedia-
Japanese Wikipedia show some large spikes during January 2016 and we can see that traffic has gone down by a very large margin during September 2015 and there is also a downward trend in data after Feb 2017.
Similar to some other language plots, it also has good correlation for 7 days lag.
Conclusion- There is weekly seasonality in the data.
It is very similar to the plots we saw initially, peaks at 120 and 230 days.
Conclusion- Data has seasonality for 120 days.
Plot for German Wikipedia-
German Wikipedia has the most number of spikes among all the languages. We can see that this makes ‘M’ shaped like structure, goes upward then downward and again upward then again downward.
This one also follows the same trend- higher correlation for 7 days lag.
Conclusion- Like many other plots, it also has weekly seasonality.
As expected there are peaks at 120 and 230 but surprisingly at around 340 days, there is a peak.
Conclusion- Like previous plots, this also shows 120 days seasonality.
Plots for Chinese Wikipedia-
Like other languages this one also has upward trend during the first few months, we can see a very large spike during January 2016
This plot is no different than others, high value for 7 days lag.
Conclusion- There is weekly seasonality in this data also.
Very similar to previous language plot except that it has a comparatively smaller peak at 340 days.
Conclusion- There is seasonality of 120 and 230 days also.
Plots for French Wikipedia-
French Wikipedia shows upward trend first and then after January 2017 it goes downward. Spikes can also be seen regularly but there is a large spike around March 2016.
As expected, high correaltion for 7 days.
Conclusion- There is weekly seasonality in data.
This is also very similar to previous plot, peaks at 120,230 and 340 days.
Conclusion- This also shows seasonality for 120 and 230 days.
After analyzing traffic on all the languages, we see that most of the language wiki shows upward trend during first few months and similarly most of the language shows downward trend during last few months. This doesn’t mean that all our time series are similar, as we have already seen in our vary first plot that language do have an influence on traffic on pages. By analyzing this, we can say that using language as a feature can be helpful during the modelling part.
Coming to the correlation plots, we saw that most of the language plots shows better correlation value for the 7 days lag except the Russian wiki and Wikimedia but as we saw in the very first plot of the case study that Wikimedia has a very small traffic, so we can make exception for Russian wiki. So, we conclude that using a lag of 7 days for our basic statistics models can produce good results.
Fourier transformation also provided some useful insights, most of the language had peaks at 120 and 230 days while some also had peaks at 340 days. So, these 3 peaks information can later be used during modelling.
As we have already discussed that we also have the type of access of each page, so now we will combine traffic of different pages based on their type of access(spider/non-spider).
We can clearly see the pattern in non-spider data but spider data is almost a flat line but in actual it should not be.
Conclusion- This implies that the scale of spider and non-spider data is very different.
We saw from above plots that scale of spider and non spider data differ at a large scale, let’s check the median of both of these categories, this should give us an idea of at what scale they differ.
These differ at a scale of 10², it is a large difference. This may be due to the fact that generally whenever we need any information from the Wikipedia, we directly access the wiki page instead of using a web scrawler. Web scrapping or web scrawler is generally used only when we need to extract a large amount of data at once.
Now, let’s combine the traffic on pages based on the client used to access that page.
Now, let’s see the plot.
Before plotting this graph, everyone would have expected that all of them will have a pattern very similar to English language wiki or similar to non spider data pattern. Now, we can see that all-access and desktop traffic live up to our expectation but mobile data shows a very different pattern, around August 2016, all-access and desktop data shows a peak just like English wiki but mobile data doesn’t show any spike and if we look closely it is actually going down. This pattern is completely opposite of what we expected. Now, just take a look again at all of the language based plots, you will notice that Spanish and German Wikipedia shows similar structure around August 2016, their traffic also goes down.
Conclusion- From this we can conclude that most of the mobile traffic during Aug 2016 is due to the Spanish and German wiki and very less English wiki is accessed on Mobile.
We have seen the every kind of pattern that our data exhibit. Now, we will try to answer some question based on our data which can prove helpful at later stages of our project.
Which month has more visits on an average ?
From here onwards every question that we will try to answer, their average will be based on Median and not on mean.
So, index 1 is for February which shows on an average February has most number of visits. In comparision to October which has least number of average visitors, February shows a significant 25 % growth.
Which day of the week has more visits on an average ?
We can see that on an average Sunday has the most number of visitors but all of them are pretty close and don’t have much difference.
Does holiday season affects number of average visitors ?
As we know that last 3 month of the year has most number of festivals all over the world. Lets check if it affects the average visitors.
We can see there is some difference here. There is actually 7.3 % dip in average number of visitors during holiday season.
Conclusion- We can say that it actually make difference, during the holiday season, Wikipedia pages has less traffic in comparision to other months of the year.
This is the end of our EDA, we have analysed our data in depth which will be useful at later stages of our case study. With this, we conclude this part of series.