Scalable AutoML for Time Series Prediction Using Ray and Analytics Zoo
A time series is a series of data that is observed sequentially in time. Time series prediction takes observations from previous time steps as input and predicts the values at future time steps. Many real world applications (such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, etc.) leverage time series predictions. It can also be used as the first step for anomaly detection where alarms are triggered when the actual values divert too much from the predicted values (e.g., see an example here).
Classical time series forecasting methods usually make predictions by extrapolating from previous data using descriptive (statistical) models. Such methods often involve making assumptions about the underlying distribution and decomposing the time series into components such as seasonality, trend, noise, etc. New machine learning methods make fewer and less strict assumptions about the data; in particular, neural network models often perceive time series prediction as a sequence modeling problem and have recently been applied to these problems with success (e.g.,  and ).
On the other hand, building the machine learning applications for time series prediction can be a laborious and knowledge-intensive process. In order to provide an easy-to-use time series prediction toolkit, we have applied Automated Machine Learning (AutoML) to time series prediction. In particular, we have automated the process of feature generation, model selection and hyper-parameter tuning. The toolkit is built on top of Ray* (a distributed framework for advanced AI applications open-sourced by UC Berkeley RISELab), and is provided as a part of Analytic Zoo (a unified data analytics and AI platform open sourced by Intel).
What is Ray, and RayOnSpark
Ray provides a general-purpose cluster-computing framework that addresses the new and demanding system requirements for emerging AI technologies. For instance, Ray Tune* is a distributed and scalable hyper-parameter optimization library built on top of Ray, which allows users to easily run many experiments on a large cluster with efficient search algorithms.
Analytics Zoo has recently provided RayOnSpark support; it allows users to directly run new AI applications built on top of Ray in existing big data clusters, which can then be seamlessly integrated into the big data processing and analysis pipeline. In the remainder of this blog, we will describe how we implement the scalable AutoML framework and automatic time series prediction leveraging Ray Tune and RayOnSpark.
AutoML framework in Analytics Zoo
The figure below illustrates the architecture of the AutoML framework in Analytics Zoo.
The AutoML framework uses Ray Tune for hyper-parameter search (running on top of RayOnSpark). In our implementation, hyper-parameter search covers both feature engineering and modeling. For feature engineering, the search engine selects the best subset of features from a set of features that are automatically generated by various feature generation tools (e.g. featuretools*). For modeling, the search engine searches for hyper-parameters such as number of nodes per layer, learning rate, etc. For building and training the models, we use popular deep learning frameworks like Tensorflow and Keras. In addition, we use Apache Spark* and Ray for distributed execution where necessary.
There are currently four basic components in the AutoML framework, namely FeatureTransformer, Model, SearchEngine, and Pipeline.
- A FeatureTransformer defines the feature engineering process, which usually includes a chain of operations like feature generation, rescaling and selection.
- A Model usually defines a model (e.g. a neural net), and a fitting function using an optimization algorithm (e.g. SGD, Adam, etc.). A Model may also include the procedure of model/algorithm selection.
- During training, a SearchEngine searches for the best set of hyper-parameters for both FeatureTransformer and Model, and guides the actual model fitting process.
- A Pipeline is a convenient utility that integrates FeatureTransformer and Model into a data analysis pipeline. A Pipeline can be easily saved to file and loaded for reuse later elsewhere.
In general, a typical training workflow with the AutoML framework is as follows:
- A FeatureTransformer and a Model are first instantiated. A SearchEngine is then instantiated and configured with the FeatureTransformer and Model, along with search presets (which specify how the hyper-parameters are searched, the reward metric, etc.)
- The SearchEngine runs the search procedure. Each run will generate several trials at a time and distribute the trials in a cluster using Ray Tune. Each trial runs feature engineering and the model fitting process with a different combination of hyper-parameters and returns the specified metrics.
- After all trials complete, the best set of hyper-parameters and optimized model are retrieved according to the target metrics. They are used to generate the result FeatureTransformer and Model, which are in turn used to compose a Pipeline. The Pipeline can then be saved to file and loaded later for inference and/or incremental training.
Training a TimeSequencePredictor for time series prediction
Before training a TimeSequencePredictor, you need to initialize RayOnSpark first (either with Spark local mode or YARN mode on a cluster), and you can stop RayOnSpark after training is done. Please refer to the RayOnSpark blog for more details.
After successfully initializing RayOnSpark, you can train your time series prediction pipeline as illustrated in the example below. It first instantiates a TimeSequencePredictor object with necessary arguments, and then invokes TimeSequencePredictor.fit to automate your machine learning training process on history data in a distributed fashion, and finally obtains a TimeSequencePipeline object.
- The input data (train_df) to TimeSequencePredictor is expected to be a (Pandas*) Dataframe that contains a series of records, and each record contains a timestamp (dt_col) and a data point value (target_col) associated with that timestamp. Optionally each record can also contain a list of additional input features (extra_feature_col). TimeSequencePredictor will then train a TimeSequencePipeline to predict the corresponding target_col for future time steps.
- The recipe argument contains parameters for TimeSequencePredictor to control the search space, stop criteria, and the number of samples (i.e. how many samples are generated from the search space) during the training. Currently available recipes include SmokeRecipe, RandomRecipe, GridRandomRecipe and BayesRecipe.
After you obtain the TimeSequencePipeline which contains the best hyper-parameter configurations and the trained model returned by the AutoML framework, you may save it to a file and load it back later for evaluation, prediction or incremental training, as illustrated below.
For a more complex AutoML example for time series example, you may refer to the notebook at https://github.com/intel-analytics/analytics-zoo/blob/automl/apps/automl/nyc_taxi_dataset.ipynb, which uses historical taxi passenger volume in NYC to predict future demands (similar to the use case described in ). For instance, the figure below shows the predicted taxi passenger volume for the next time step using AutoML.
This blog provides a quick overview of the AutoML and time series prediction support in Analytics Zoo; for additional details, please see https://github.com/intel-analytics/analytics-zoo/tree/automl/pyzoo/zoo/automl.
 Guokun Lai, Wei-Cheng Chang, Yiming Yang, Hanxiao Liu. “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks” (link)
 Nikolay Laptev, Slawek Smyl, Santhosh Shanmugam. “Engineering Extreme Event Forecasting at Uber with Recurrent Neural Networks” (link)
*Other names and brands may be claimed as the property of others