Failing Fast with DeepAR Neural Networks for Time-Series

Matt Wheeler
Feb 26, 2019 · 6 min read

Harness machine learning algorithms on AWS to build out reliable time series models

Image for post
Image for post
Photo by Fatos Bytyqi on Unsplash

Fail-Fast Framework

In the world of machine learning, failing fast is crucial. When considering AWS’s DeepAR, a recurrent neural-network (RNN) time series algorithm, there are several questions to ask when sizing up the ability to fail fast:

  1. Is DeepAR appropriate for the problem at hand / are there enough individual time series (at least in the hundreds)?
  2. Can a pipeline be built quickly?
  3. Can you harness hyperparameter settings to create a quality model?
  4. Can the model performance be analyzed efficiently?
  5. Have you read forums to understand nuances of the algorithm?

Is DeepAR Right for Your Problem?

Deep Learning for time series analysis requires many separate series so the neural net can learn from the patterns in each series. As an example, let’s think of consumer purchases from a widgets website. Your organization has just over two years of data. You may want to run a time-series forecast for each zip code in your marketing database and predict purchases at the daily-level. Now, there are 42,000 US zip codes, so as long as your company sells to a few hundred zip codes, then DeepAR is optimal for “learning” from the multiplicity of series.

Can a Pipeline Be Built Quickly?

Using AWS Sagemaker algorithms can be a headache. DeepAR requires a rather niche data structure and that it be stored in JSON format. Hive and PySpark, both hosted on AWS EMR (Elastic MapReduce), could be your ideal tools for getting your data in the required format. Hive allows you to utilize MapReduce to parallelize large joins and perform minor feature engineering. PySpark, with it’s simple syntax and lightning speed, will allow you to quickly transform your data into the array-format that is digestible by DeepAR. Additionally, PySpark is where you can do the bulk of your feature engineering.

Image for post
Image for post

In the context of failing fast, your first model attempt should be as basic as possible (perhaps one categorical feature, such as median income for each Zip Code). Go through an entire modeling cycle and assess the results before returning to the feature engineering stage.

Choosing features should be directly related to the business context — don’t overthink this. DeepAR allows for categorical features (think regional category, median income, and median home price) and dynamic features (think Holiday feature, Sale Event Status, which would be different for each date). The categorical features should be contained within an array that is one-to-one with each series and continuous integer-encoded. All continuous integers need to be present. For instance, if you have zip codes with no income levels for “200k-249k”, then you would need to re-bucket at a higher level. Figure 1 shows how the Median Income value is recoded to integer. Once all three features are integer-encoded, the respective values for each series are housed in an array.

Image for post
Image for post
Figure 1. Categorical encoding example

The dynamic features should be housed in an array of arrays (multiple arrays housed within a master array) with a length equal to training period plus the prediction period. Let’s say we have two dynamic features (Holiday & Sale Event). The training period is four days with a prediction period of three days (not very realistic!), then the array of arrays would look like the following:

Image for post
Image for post
Figure 2. Dynamic Feature array of arrays example

Piecing the entire input together in PySpark, the overall JSON structure for one time series looks like the following:

Image for post
Image for post
Figure 3. Example of DeepAR input JSON structure

If you have a data engineer who is capable of designing the pipeline in this fashion, then the heavy lifting is almost over.

Can You Harness Hyperparameter Settings to Create a Quality Model?

For model training, you will certainly want to fail fast.

The Context Length, the number of time-points that the model gets to see before making the prediction, is something you should set to be about same as the prediction period to ensure optimal model performance. The average series length will need to be greater than the sum of the prediction length plus the context length, but your training job will fail with an error message after a few minutes if this is violated. The DeepAR neural networks are incredibly sensitive to changes in your data, so if you re-aggregate at a different level (say hourly instead of daily), then start from scratch with the default hyperparameter ranges — do not assume the previous parameters will perform well on refactored data.

You should be using a representative subset of your data (perhaps a few regions instead). For training resources, AWS ML-P3 instances (accelerated computing) are ideal for DeepAR since the algorithm is compute intensive and these instances utilize GPU’s for the fasted training jobs possible at this time.

Running a tuning job will maximize the odds of finding optimal hyperparameter ranges. For starting hyperparameter ranges, use the AWS recommended ranges. Choose somewhere between a three and ten training jobs for your overall tuning job. After your training jobs are complete, you can clone the job and choose “warm start” with either the “transfer-learning” option or the “Identical data and algorithm” option. Either option will utilize the results of previous tuning jobs for your cloned job and use Bayesian statistics to further refine the hyperparameters chosen.

Image for post
Image for post
Figure 4. Sagemaker interface for cloning tuning jobs

Can the Model Performance be Analyzed Efficiently?

DeepAR allows two methods for generating predictions. The first method involves setting up an Endpoint to call individual predictions, which is convenient if you want to do real-time analytics. This also requires writing some Python code to loop through all of your series and also post-process the results. The second method, which is used for generating predictions on a large batch of series, is known as Batch Transform.

Even when the use-case is a real-time API-driven, to fail-fast, use Batch Transform for quickly assessing performance across a large sample of time series.

Batch Transform is drastically cheaper since it will not deploy Endpoint instances, which could end up being expensive if left activated. Equally as important, this method will run in a fraction of the time. Another key feature, is the ability to forward your Series ID (in this case, Zip Code) via Environment variables. This is critical if you want to actually use your results without relying on consistent sort orders between data sources.

Image for post
Image for post
Figure 5. Batch Transform request example

After batching predictions, you can write some basic code to convert your batch results into a actionable format such as CSV. You will likely want to visualize the model results and compute some performance metrics such as RMSE (root-mean-square error).

Have You Read Forums to Understand Nuances of the Algorithm?

Since DeepAR is a recent algorithm release, AWS is consistently making improvements. Additionally, other consumers are likely running into the same issues as your organization may encounter. Read through relevant AWS forum posts and don’t be shy about posting your issue. AWS analysts usually respond within a week.


AWS DeepAR probabilistic forecasting can be a perfect solution for time-series forecasting if you have at least a few hundred separate series to predict. Utilizing Hive and Spark when necessary can assist in database joins and fail-fast feature engineering. Relying on the right EC2 instances will help keep your training times short. If you don’t need real-time low latency predictions, then stay away from using an inference endpoint and run a Batch Transform job to batch all of your predictions at once (especially during the model tuning phase). Rely on AWS Forums for support — this can be a game-changer for fail-fast data science using newer algorithms on AWS. Lastly, keep an eye out for the upcoming release of DeepAR+, with Learning rate scheduling, Model averaging, and Weighted sampling — these features will certainly improve your model results.

Special thanks to Mike Creeth.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store