AI Saturdays Monterrey Chapter Final Project
The Monterrey chapter of AI Saturdays was formed by Eduardo Ramirez, who manages the Data Science & Engineering Monterrey Meetup group. The Data Science Meetup group was interested in starting a study group, and AI Saturdays provided a solid curriculum and a formal framework to implement this.
The participants of AI Saturdays Monterrey are an eclectic group with varied skill sets and interests. Among the group there are graduate students in applied mathematics, freelance developers, programmers, and IT consultants. So, the group has been a success as a networking and idea exchange hub.
The chapter as a group wants to understand not only the mathematical and programmatic foundations of Deep Learning, but also the business context where the technique is relevant. Thus, in order to leverage the broad skill-set of the group, we decided to work together in a project that would illustrate how Deep Learning is used in a business context. For the election of the project we considered the following:
- A business problem with real data.
- To have a quantitative measure of the performance of a given solution.
- Benchmark what was learned about Deep Learning using the Fast AI approach against statistical analytics, and random forest techniques.
We decided to enter a Kaggle competition, to challenge ourselves competing with other practitioners worldwide. We selected the competition Store Item Demand Forecasting Challenge. The description of the competition consist in the following:
This competition is provided as a way to explore different time series techniques on a relatively simple and clean data set.
You are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.
What’s the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA? Can either beat xgboost?
This is a great competition to explore different models and improve your skills in forecasting.
Forecasting Using Advanced Methodology
Time Series Analysis
In most branches of science, engineering, and commerce, there are variables measured sequentially in time. Reserve banks record interest rates and ex- change rates each day, Self-driving cars continuously collect data about how their local environment is changing around them, etc. These applications rely on a form of data that measures how things change over time. When a variable is measured sequentially in time over or at a fixed interval, known as the sampling interval , the resulting data form a time series.
The widely used statistical analysis methodologies typically address the forecasting exercises setting the time as a linear variable in the horizontal axis, and this methodology has been the base for the linear regression models. By using that technique, the results we would have obtained, would have looked like this:
This technique leads to a polynomial function f(t) = at² +bt + c, that will depend on how complex we would want to go, with results that will depend on the data provided only. Missing potential new interactions between factors.
The main advantage of the techniques learned in this course is that the Python Library has a module to treat the time as another independent input variable. This relatively simple setting represents a complete breakthrough; because it helps not only to simplify the analysis settings, but also to identify even more complex and complete types of relationships between data variables, complementing the traditional results leading to conclusions that would never had been found when using only the classical statistical analysis techniques.
In the next paragraphs, we’ll explain the new techniques: The pure statistical analysis is now a thing of the past. It has now a complement with the AI Tools.
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements Machine Learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way. The same code can run on major distributed environments (Hadoop, SGE, MPI) (generating high accurate models with billions of data points and beyond).
XGBoost has been widely recognized in a number of machine learning and data mining challenges as a very effective technique. For example, in the Kaggle site, in 2015, 17 out of 29 winning solutions used XGBoost techniques in some manner, either totally or in combination with neural nets. Among these solutions, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles.
For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success of the system was also witnessed in KDD Cup 2015, where XGBoost was used by every winning team in the top-10. Moreover, the winning teams reported that ensemble methods outperform a well-configured XGBoost by only a small amount.
In practice, use of traditional statistical time series tools requires considerable experience and skill to select the appropriate type of model for a given dataset. The great thing about neural networks is that you do not need to specify the exact nature of the relationship (linear, non-linear, seasonality, trend) that exists between the input and output.
The hidden layers of a neural network remove the need to pre-specify the nature of the data generating mechanism. This is because they can approximate extremely complex decision functions.
Our second approach was to use DNN (Deep Neural Network) using time series features, following the process described in the Rossman Lesson of Fastai (lesson 4: structured, time series and languages models). This process involves Data Analysis and cleansing, and creation of features or Feature Engineering. For the creation of features that capture stationality, the fastai library function add_partime simplifies the process by generating all kind of the time variables. Another important step is selecting which variable to use as continuous and which to use as categorical. Then to prepare the data for processing by the neural network, the entire data frame is converted to continuous variables by using embeddings which will be generated at the same time we train our entire model.
Another important step is to select our validation set and our test set, the validation set is recommended to be at the same size as the expected test set. Thus, we use the last 3 months as our validation set (In time series we don’t use the same cross-validation approach in which we use all the information in every step as validation set).
The third approach considered to solve the problem of competition Store Item Demand Forecasting Challenge involves using statistical models known as ARIMA (Autoregressive Integrated Moving Average).
ARIMA models emerged since 1970 with the work of G. E. P. Box and G. M. Jenkins (see reference 2), such models are based on the probabilistic concept of stochastic process, which is a mathematical object of some complexity of the field of probability but with many statistical applications in areas such as economics and biology.
Despite their age, ARIMA models are currently popular although they are not computationally simple (because the estimation of their parameters requires optimization techniques as well as neural networks) and sometimes (as in our case) they are not simple to interpret.
The ARIMA models (see reference 1), consider two aspects; an autoregressive model that basically means that the observation in a time depends on the previous observations, and an aspect of moving averages that means that the phenomenon does not depend strongly on its results previous ones but that is the result of chance in each moment.
In the project we tried first with particular ARIMA models such as AR and MA, however the results were not very satisfactory and we resorted to VARIMA models (vector autoregressive integrated moving average) which are an extension of ARIMA where instead of dealing with only a series of time it is treated how a vector with several time series (see reference 3).
“The strength of the team is each individual member. The strength of each member is the team.” — Phil Jackson
When defining the tasks to be carried out to execute the project, the work was distributed according to the interests and competencies of each participant: Jesus Martinez and Adrián Rodríguez (Adrián Alejandro R) worked on the XGBoost Algorithm. In the implementation of the Deep Learning version, Juan M. Chapa Z., Arnulfo Pérez (arnulfo perez), Karina Fernanda Pérez and Nazario Benavides worked on that activity. The statistical analysis using ARIMA was done by José Antonio García. Adrián Rodríguez publicized the kernels in Kaggle, achieving the 5th place for a couple of days with XGBoost.
The results are documented in the team’s repository available in https://github.com/ai-saturdays-monterrey/DemandForecasting
The Results are documented in this github repo.
The work done and future proposals using ARIMA models can be found in the document fastai.md (https://github.com/ai-saturdays-monterrey/DemandForecasting/blob/master/VARIMA-Demand-Forecasting/fastai.md). An error of 49.61861 was achieved using the competition’s metric and the team´s participation in the competition can be consulted in https://www.kaggle.com/foudifeomorfismo/vecautoreg49.
In the challenge description they mention to use the SMAPE metric which stands for Symmetric mean absolute percentage error, is an accuracy measure base on percentage error. The definition of the metric in the code was defined using this definition of Wikipedia:
Symmetric mean absolute percentage error (SMAPE or sMAPE) is an accuracy measure based on percentage (or relative) errors. It is usually defined as follows:
The best result obtained by the team in the competition was obtained with the technique Deep Neural Net (DNN), with an error of 14.00123. In second place we have XGBoost achieving an error of 14.22953.
For XGBoost, which is a unique adjusted implementation of a boosting method, parameters were tuned using a widely adopted method called GridSearchCV in an effort to find the optimal SMAPE score. A total of four attempts were performed, while the first three were primarily focused on the parameters previously mentioned, the last one was to prove if there were any improvements in ordering the data in a date ascendent order. It’s worth to mention that all of these attempts took no longer than 10 min in a typical CPU and using a GPU 1 min 30 secs as the maximum before submitting those to the Kaggle competition. It also came to our attention that another gradient boosting technique known as LightGBM was being employed by other teams with outstanding and better results.
When using ARIMA models we encounter the same difficulties as in the other two approaches. The estimation of parameters is computationally expensive because it requires to maximize likelihood functions in high dimensional spaces, in our case it was required to estimate 4900 parameters without considering the possible cost of smoothing the series. Also, the result is not easily interpretable because the sale of the items in the 10 stores is modelled as a phenomenon that continually depends on the sale in the ten stores and also in 49 days prior to the 50 item, at the same time… So using traditional techniques in our case has a high computational cost and it is difficult to interpret.
On the first meeting of the AI6 Saturdays chapter Monterrey, everybody had high expectations but not a clear understanding of what could be achieved as far as learning and applying Artificial Intelligence. The journey we started some months ago, left us with a satisfying experience that allows discussing AI at different levels, from a basic understanding to ninja-level. In a few weeks, some of us were explaining to others concepts that were previously unknown.
The collaboration was the main aspect of the workflow of the group. We applied many collaboration technologies like Slack, Coogle, Google Docs, Google Colaboratory, Github, and Medium.
Particularly, Google Collaboratory provided a free alternative to Paperspace, with full Deep Learning support including GPU support and the Fast AI library.
The Monterrey chapter of AI Saturdays (AI6) would like to acknowledge our gratitude to the Nurture.AI group for the AI Saturdays (AI6) program. We also want to express our recognition to Jeremy Howard, Rachel Thomas, and the whole team of Fast AI for the great work done that allowed us to learn about the theory and practice of Machine Learning and Deep Learning in a practical and straightforward manner. Moreover, it is necessary to acknowledge the contribution of the universities that make their academic content available for everybody on the Internet.
We named our AI Saturdays chapter Monterrey, yet we have members from different towns in the great Monterrey metropolitan area that travel every Saturday from their hometowns to participate in our activities.
We extend an open invitation to all interested in Artificial Intelligence to join us at the Monterrey chapter, or the nearest chapter of AI Saturdays and start this great journey towards a better world using AI for social good.
- Andrew V. Metcalfe, Paul S.P. Cowpertwait, Introductory Time Series with R, 2009, Springer, New York
- Leila Zoubir, A brief history of time series analysis, 2018, retrieved on 08/11/2018 in https://www.statistics.su.se/english/research/time-series-analysis/a-brief-history-of-time-series-analysis-1.259451.
- Lutkepohl, H. New introduction to multiple time series analysis, 2006, Springer, New York
- NEURAL NETWORKS FOR TIME SERIES FORECASTING With R — Dr. N.D Lewis