This is the 2nd part of our series, the series includes-
In the previous blog, we saw a detailed analysis of the data. Now, it’s time to start some feature engineering and building models.
Table of Content-
- Baseline Model
- Feature Engineering
- Future Work
1. Baseline Model-
First, we will make predictions only based on the medians of the data and we will use it as our baseline model.
Now, we will use the median of data to make predictions, for each page we have taken the median for each weekday and used it to make predictions for the next 62 days accordingly.
We have calculated medians for each page for each weekday, now we will make predictions for Kaggle and make a CSV file based on the format for sample_submission file.
After making a submission on Kaggle for the above predictions, here are our results.
For this submission, we are currently in 310th position out of 1095 total teams
2. Feature Engineering-
Let's start with some feature engineering !. We will create features for the last 15 days of data and through feature engineering techniques we will try to use the rest of them to create features for the last 15 days. As we saw that there was a high correlation for 7 days lag, so the first feature we are creating is the week-day feature. Other than this, we will create 3 features which have total visitors on the previous 3 days, threedays1 will have traffic for (d-1)+(d-2)+(d-3) days, threedays2 will have traffic for (d-4)+(d-5)+(d-6), similar for threedays3. Another feature will be weekday which contains as the name suggests a day of the week.
As we know that language, type of access and type of client do have an impact on traffic, so using these as features can be useful.
lang_feat — It will contain the language feature.
agent_feat — we are adding 1 for spider data and 0 for non-spider data.
access_feat — we are adding 0 for all-access, 1 for desktop, and 2 for mobile.
As we are creating data for the last 15 days, so for each page above features will be repeated 15 times.
Now, we will create features based on Fourier transformation. As we saw during analysis that there were some unnecessary peaks at the start, so, after removing some initial values, will take the top 3 peaks and use them as features.
Have created all the features that we have discussed till now, will use these to train the model.
Summary and key takeaways from our work till now
First, we loaded the required libraries and files. Handling Missing values :
We used linear interpolation for filling missing values that were not missing for many consecutive days. For filling consecutive missing values, we used the data from 180 days later.
- Saw that most of the traffic was on English language pages.
- There was very little traffic by bots as compared to actual humans.
- May has the most number of visitors.
- Weekdays and weekends don’t actually make much difference.
- The last three months of the year have more number of visitors on average than other months of the year.
- On average, the English language has the most number of visitors.
Feature Engineering :
Used rolling window technique to generate features, from every time series we used 15 days of data. These are the features that we generated- Day of the week
- Number of visitors on d-7 days
- Total visitors on the last 3 buckets of 3 days.
- Language of the page
- The client of the page (All-access, mobile, desktop)
- Spider or actual human traffic
- Top three peaks from Fourier transformed data.
Making data model ready-
Now, will start to build a model based on the features that have created. First, will convert all the categorical features into one-hot encoded features and normalize all continuous feature.
Now, will split our data into train and test. We will split our data into 75 and 25.
As have already discussed that for evaluating our model, we will be using SMAPE, as there is no inbuilt function for that in python we have to implement that and we will use custom metrics for evaluating our model.
Let’s start with model building.
These are not good results, should try a more powerful algorithm and see how these perform.
Results are good in comparison to Linear Regression but overall these do not seem capable of getting us in a good position on the leaderboard.
These results are far better than the Decision tree but keep in mind that we are using certain features for this prediction that we don’t have future data and we have to make prediction sequentially in order to make predictions and use these predictions as a feature for future predictions, the errors in our predictions will increase exponentially as we will try to predict for farther dates. So, we have to get a much better score here from the score that we expecting on the leaderboard because the leaderboard score will greater than this.
The score is even more than the random forest and Decision tree algorithms
Summary of performance of all the models
Liner Regression- 0.79
Decision Tree- 0.40
Random Forest- 0.37
XgBoost — 0.43
The best performing model here is Random Forest, so we will use it for predictions for Kaggle, and let's hope this can perform well on the leaderboard as well.
Predictions for Kaggle
Will create all the features for predictions that we created during feature engineering. Now, will combine all the features and start making predictions for Kaggle. Tried submitting with Xgboost algorithm, in spite of the maximum score then Random forest, xgb performed well than a Random forest in Kaggle.
Here are the results-
This is exactly what I was fearing, error indeed increased exponentially as we predicted for farther dates, this score is worse than our median submission. Now, we have to think of something else.
Deep Learning to the rescue-
Now, will try LSTM as well. Like before, will not use any feature that uses predicted data as it will create problems.
As we have more than 145k time series, we are using only 50% of the data to train the model.
For evaluating our model, will use Mean Absolute Error(MAE) on log1p of data. This is pretty similar to SMAPE on original data. After final predictions will use expm1 on data to convert it back to its original form.
Along with the time series, will use the three features as well that can extract from the page name
Above code creates all the features that we need for our model, now we will label encode all the categorical features, reshape them in proper shapes to feed them to the model, and split data into train and test.
Our data is ready and we can start building our model architecture. Model will have 4 input layers(one for each feature), time-series data will go through LSTM layer and all the label encoded features will go through an embedding layer, and then we will flatten the output so that we can concatenate all these layers. After that, we will use a dense layer and an output layer.
The architecture looks like below:
Now, let’s take a look at the architectural graph of our model-
As we have already discussed that MAE on log1p of data is somewhat similar to SMAPE, so we should get a pretty good score on the leaderboard also.
After making predictions and submitting the file on Kaggle, here are our results-
The score is good and it takes us in the top 10% on the leaderboard and 92/1095 position.
Tried to add a Conv1d layer to improve the accuracy but seems that Kaggle score increased. Also added some more layers of LSTM but still, the score was increased. It might possible due to overfitting of the model to overcome this added the dropout layer but it will not affect much on score still the score was not decreased further.
Comparative study of all the models that tried
Tree-based model was unable to perform well, their scores were worse than the median model but LSTM proved to be a good choice maybe the reason is that they are specialized in seq-seq prediction.
What good a model is if it doesn’t reach customers. Most of the case study blogs skip this part but, will leave no stone unturned. Will be using flask for deployment and Heroku for service.
In the modeling section, we were training models in such a way that they will predict for the next 64 days but here we need prediction just for one specified date. So, I have trained LSTM in such a way that it takes data for the last 5 days as input instead of 200 days and predicts for the next day. The code would be pretty similar but in case you still want to see my code you can check my GitHub repository. Now, we have to build a pipeline for the preprocessing, feature engineering, and prediction. We can’t ask users to enter the traffic for the last 5 days and the page name itself to make a prediction. So, we would be taking index and date as input, the index will correspond to a particular page name as mentioned in the dataset, and according to the date for which he wants to make prediction data will be fetched from the CSV file. Heroku gives us a limit of 500 MB and the Tensorflow module alone will take more than 300 MB so we can’t use our complete dataset for that. So, instead of 145k pages, we will be using 10k pages only for model deployment.
Let’s get started with the pipeline!
The output will look like this-
Ok, so now we are done with the pipeline and now it’s time to use Flask to deploy and we also have to build HTML pages( I will keep it very simple).
Now, let's proceed to Heroku. So, Heroku is free and pretty easy to use, we will use Github to deploy our model on Heroku. So basically there are two extra files that we have to create Procfile(without any extension) and Requirements.txt
Procfile contains the name of the file that needs to be run first and requirements.txt contains names of all the modules that need to be installed. Now, we just have to choose ‘connect to GitHub’ on Heroku and search our GitHub repository and click deploy. If all the code is correct, the model will be deployed and a link will be generated for your app.
After doing all the above steps, here is our app. It may take some time to open the link, have patience.
5. Future Work-
- Attention mechanism-based models can be tried in order to use information from the distant past to improve the results.
- Different types of smoothing and transformation of data can be tried in order to improve the results.
- As we discussed in Existing approaches, there are few simple statistics based method which proved pretty good, these methods can be used to provide extra features to the models to improve the results.