Automated Machine Learning for time series forecasting

Francesca Lazzeri
Data Science at Microsoft
7 min readOct 19, 2021

By Francesca Lazzeri. This article is an extract from the book Machine Learning for Time Series Forecasting with Python, also by Lazzeri, published by Wiley.

Time series is a type of data that measures how things change over time. In a time series dataset, the time column does not represent a variable per se: Instead, it’s more useful to think of it is a primary structure for ordering the dataset. This temporal structure makes time series problems more challenging to work with, because data scientists must apply specific data preprocessing and feature engineering techniques to handle time series data. However, it also represents a source of additional knowledge that data scientists can use to their advantage. In my previous article, I showed how to leverage this temporal information to extrapolate insights from time series data, to make time series data easier to model so that it can be used for future strategy and planning operations in several industries.

Building Machine Learning (ML) models with time series data is often time consuming and complex, with many factors to consider, such as iterating through algorithms, tuning ML hyperparameters, and applying feature engineering techniques. These options multiply with time series data as data scientists must consider additional factors, such as trends, seasonality, holidays, and external economic variables. In this article, I demonstrate how to train a time series forecasting regression model using Automated ML in Azure Machine Learning.

Figure 1: Automated ML in Azure Machine Learning.

Azure Machine Learning

Azure Machine Learning is a cloud service for accelerating and managing the ML project lifecycle. Data scientists and engineers can use it in their day-to-day workflows to train and deploy models and manage MLOps.

For application developers, it provides tools for integrating models into applications or services. For platform developers, a robust set of tools, backed by durable Azure Resource Manager APIs, are available for building advanced ML tooling. Enterprises working in the Microsoft Azure cloud are provided access to familiar security and role-based access control (RBAC) for infrastructure, including setting up projects to deny access to protected data and specific operations.

Automated ML on Azure Machine Learning

Automated Machine Learning (Automated ML) is the process of automating the time-consuming, iterative tasks of ML model development. It allows data scientists, analysts, and developers to build ML models at high scale that are efficient and productive while sustaining model quality. Automated ML in Azure Machine Learning is based on work from Microsoft Research.

Traditional ML model development is resource intensive, requiring significant domain knowledge and time to produce and compare dozens of models. With automated ML, it’s possible to accelerate the time it takes to get production-ready machine learning models with greater ease and efficiency.

Azure Machine Learning offers the following two experiences for working with automated ML:

  • For code-experienced customers, Azure Machine Learning Python SDK.
  • For limited/no-code experience customers, Azure Machine Learning studio.

In this article, I show how to use the Azure Machine Learning Python SDK to leverage automated ML on Azure. If you are using a cloud-based Azure Machine Learning compute instance, you are ready to start coding by using either the Jupyter notebook or JupyterLab experience. You can find more information on how to configure a development environment for Azure Machine Learning at https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment

You can also visit the following links to configure your Azure Machine Learning workspace and learn how to use Jupyter notebooks on Azure Machine Learning:

Moreover, a compute target is required to execute a remote run of your automated ML experiment. Azure Machine Learning Compute is a managed-compute infrastructure that allows you to create a single- or multi-node compute. To learn more about how to set up and use compute targets for model training, you can visit: docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets.

Automated ML for time series scenarios

Designing a forecasting model is like setting up a typical regression model using Automated ML on Azure; however, it is important to understand the configuration options and pre-processing steps that exist for time series data: The most important difference between a forecasting regression task and a regression task within Automated ML is including a variable in your dataset as a designated time stamp column.

The following examples in Python show how to do the following:

  • Prepare data for time series forecasting with Automated ML.
  • Configure specific time series parameters in an AutoMLConfig object using “AutoMLConfig”.
  • Train the model using AmlCompute, which is a managed-compute infrastructure that allows the easy creation of a single- or multi-node compute.

This Automated ML example uses a New York City energy demand dataset (mis.nyiso.com/public/P-58Blist.htm). The dataset includes consumption data from New York City stored in a tabular format and includes energy demand and numerical weather features at an hourly frequency. The purpose of this experiment is to predict the energy demand in New York City for the next 24 hours by building a forecasting solution that leverages historical energy data from the same geographic region.

Figure 2: Historical energy demand in New York City region.

In case you are interested in exploring additional public datasets and features (such as weather, satellite imagery, or socioeconomic data) and adding them to this energy dataset to improve the accuracy of your ML models, I recommend checking out the Azure Open Datasets catalog (www.aka.ms/AzureOpenDatasetsCatalog). This catalog contains a collection of public datasets that data scientists can leverage for ML model solutions. Incorporating features from curated datasets into your ML models can improve the accuracy of predictions and reduce data preparation time.

For our automated ML experiment, we need to identify the target column, which represents the target variable that we want to forecast. The time column is our time stamp column that defines the temporal structure of our dataset.

For forecasting tasks, automated ML uses pre-processing and estimation steps that are specific to time series data. It first detects the time series sample frequency (for example, hourly, daily, weekly) and creates new records for absent time points to make the series continuous. Then it imputes missing values in the columns for target (via forward-fill) and feature (using median column values) and creates grain-based features to enable fixed effects across different series. Finally, it creates time-based features to assist in learning seasonal patterns and encodes categorical variables into numeric quantities (to learn more about this process, visit www.aka.ms/AutomatedML).

The AutoMLConfig object defines the settings and data necessary for an Automated ML task: Data scientists must define standard training parameters such as task type, number of iterations, training data, and number of cross-validations. For forecasting tasks, there are additional parameters that must be set and that affect the experiment. The following table summarizes each parameter and its usage:

Table 1: Automated ML parameters to be configured with the AutoML Config class.

The code below shows how to set those parameters in Python. Specifically, you use the blocked_models parameter to exclude some models. You can choose to remove models from the blocked_models list and increase the experiment_timeout_hours parameter value to see your Automated ML results:

# Automated ML configuration automl_settings = {'time_column_name': time_column_name,    
'max_horizon': max_horizon,}
automl_config = AutoMLConfig(task='forecasting', primary_metric='normalized_root_mean_squared_error',
blocked_models = ['ExtremeRandomTrees', 'AutoArima', 'Prophet'],
experiment_timeout_hours=0.3,
training_data=train,
label_column_name=target_column_name, compute_target=compute_target,
enable_early_stopping=True,
n_cross_validations=3,
verbosity=logging.INFO,
**automl_settings)

We now call the submit method on the experiment object and pass the run configuration. Depending on the data and the number of iterations, this can run for a while. You may specify show_output = True to print currently running iterations to the console:

# Initiate the remote run 
remote_run = experiment.submit(automl_config, show_output=False)
remote_run

Below we select the best model from all the training iterations using the get_output method:

# Retrieve the best model 
best_run, fitted_model = remote_run.get_output()
fitted_model.steps

Now that we have retrieved the best model for our forecasting scenario, it can be used to make predictions on test data. First, we need to remove the target values from the test set:

# Make predictions using test data
X_test = test.to_pandas_dataframe().reset_index(drop=True)
y_test = X_test.pop(target_column_name).values

For forecasting, we will use the forecast Python function, as illustrated in the following sample code:

# Apply the forecast function 
y_predictions, X_trans = fitted_model.forecast(X_test)

Conclusion

In this article, I showed how to train a time series forecasting regression model using autoregressive methods and Automated ML in Azure Machine Learning. Automated ML allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity while sustaining model quality.

Traditional ML model development is resource intensive, requiring significant domain knowledge and time to produce and compare dozens of models. With automated ML, it’s possible to accelerate the time it takes to get production-ready ML models with greater ease and efficiency.

I also showed how the Automated ML for time series technique iterates through a portfolio of different ML algorithms for time series forecasting while performing best model selection, hyperparameters tuning, and feature engineering for a given scenario.

In my next article in this series, I review Python libraries for time series data and how open source libraries can help with data handling, time series modeling, and Machine Learning.

References

Francesca Lazzeri is on LinkedIn and Twitter.

--

--

Francesca Lazzeri
Data Science at Microsoft

Principal Data Scientist Director @Microsoft ~ Adjunct Professor @Columbia University ~ PhD