Ensemble Models

Credit card receipts, footfall traffic, and web scrapes are early examples of alternative data. New, diverse data sources are rapidly becoming available, expanding the possibilities for analysis. Datamonster™ is a platform for discovering, quantifying, and visualizing relationships amongst the data, while managing the entire ETL process — Datamonster™ makes working with the data easy.

Diverse data sources provide the opportunity to explore, compare, and combine data. Different models, that produce different prediction, based on the data sources can be confusing. To provide a singular, coherent analysis from multiple data sources, we employ an averaging algorithm called the Ensemble Model.

Like the parable of a blind man misinterpreting parts of an elephant, models based on one data source can lead to biased predictions. The ensemble approach combines multiple perspectives to produce a more accurate prediction. Analogously, financial models built from foot traffic, credit card receipts, and website visits may each have their own biases but in combination become more reliable.

The Ensemble Model combines multiple models into one. In this post, we discuss the construction of the ensemble model and analyze its advantages in accuracy and in precision over using multiple, member models.

When to use an Ensemble Model

There are two extremes for creating models from multiple data sources:

  • Single, Complex Model: if there is enough data, one can create a single complex model that inputs the data sources and produces a single output. Optimizing the parameters in the model (making sure the model fits the data) can be difficult and computational expensive. If data is scarce, complex models are at risk of being overfit and lack predictive power.
  • Ensemble of Multiple, Simple Models: if data is scarce, then it is advisable to create a simpler model. Multiple models are required to incorporate multiple data sources., and afterwards we combine those simple models into a single, Ensemble Model.
Some models are complex like the neural network called AlexNet (left). Optimizing the large number of parameters in AlexNet requires extreme amounts of data and computing powers. In contrast, linear regression (right) has two parameters. The linear model may be too simple to describe the nuances of the data like seasonality, but it greatly reduces the risk of overfitting.

By judiciously combining simple models, the Ensemble Model increases accuracy, tightens confidence levels, and mitigates overfitting. The Ensemble Model is a good choice when data is scarce. The limitations of the Ensemble Model include the inability to model complex interdependencies of the data sources.

For the modeling application of investors, data is often scarce — think of quarterly earning reports. This scarcity imposes restrictions on the complexity of models. The Ensemble Model benefits from multiple data sets, mitigates overfitting, and provides one coherent prediction. 

Weights of the Ensemble Model

The Ensemble Model is a weighted average of the member (simple) models where more accurate models are given more weight and noisier models are given less weight. If the noise in the member models are independent then the optimal choice of weights is proportional to the inverse of the variance of the model— or inverse-variance weighting.

The Ensemble Model combines multiple models together to produce an averaged prediction. The variance of the prediction is represented by the radius of the prediction circles. The ensemble prediction theoretically has smaller variance than the those of the member models.

Ensemble Weights of correlated weights

For each member model that is part of the ensemble, we have a prediction of a future value and a variance associated with that prediction. 

The Ensemble is a weighted average of the member models. The weights in the Ensemble sum to one.

If the noise model for each member model is independent, then the optimal choice of weights in the Ensemble Model construction is inversely proportional to the variance of the member models:

In the case of correlation in the noise of the member models, the weights are chosen to minimize the variance in the Ensemble’s prediction.

The weights are the solution to the optimization problem.

The optimal weights can be difficult to compute exactly, but can be numerically approximated using a gradient descent method starting at the optimal weight assuming independent noise.

The variational choice of weights ensures the variance in the ensemble’s prediction decreases with more models —  the Ensemble Model benefits from including more models. 

Choice of Member Models in the Ensemble

To prevent overfitting when data is scarce, member models are typically on the simpler end of the spectrum: meaning the models have a very small number of parameters.

Simpler Models: Linear Regression

Linear regression models are less likely to overfit but have explicit weaknesses including:

  • No growth over time. Each data point is equally weighted whether the point was from 20 quarters ago or last quarter. For linear regression, the error tends to be smaller near the middle of the time window with larger errors near the ends.
  • Fails to capture acceleration (or deceleration). The relationship between the data source and the metric may be non-linear. 
  • Seasonality. Some relationships change with seasons. Modeling each seasonal component separately may require more data than is available and result in overfitting. A single linear regression will average the seasonal variations.
The same linear regression data is plotted in time series (left) and scatter plot (right). The scatter plot does not show the time dependency of data, aside from the quarter label. The time series shows the seasonal pattern to the errors and perhaps an acceleration in growth.

More Complex Models: LASSO and Seasonal ARIMA(X)

More complex models such as LASSO or ARIMA(X) can be used either alone or as member models in the ensemble. These algorithms generally have more parameters and consequently require significantly more data to build than simpler linear models. 

LASSO (Least Absolute Shrinkage and Selection Operator) is a regression model akin to linear regression (and sharing the same major weakness — a lack of time dependence). The difference is that the parameters in the model are not only chosen to minimize the mean-squared error (like linear regression) but also try to make the parameters small. LASSO relies on regularization to learn the order of importance of each data source included in the model, but LASSO is extremely susceptible to overfitting.

ARIMA(X) is collection of time-dependent models that generally have more parameters than simpler models. It is possible to integrate data sources as exogenous variables— the X in ARIMA(X) —but doing so increases the risk of overfitting.

While both of these models are alternatives to the Ensemble Model, they can also be included as member models in the Ensemble Model. We can improve accuracy and reduce variance in our predictions by including a simple seasonal ARIMA model (no exogenous variables) into the Ensemble Model. This choice benefits from the advantages of time dependent models while mitigating the risk of overfitting: including exogenous variables in the ARIMA model did not improve the accuracy of the model. LASSO is similar in spirit to the weighted averaging of additional data but chooses the weights differently. The choice of weights accounts for more complex relationship between different data sources but in our financial models has inferior results to the choise of weights in the Ensemble Model.

How does the Ensemble Model compare to choosing the best of multiple models?

Choosing the best of multiple models ignores less accurate models entirely. The Ensemble Model tempers, but still includes, the contribution from less accurate models. 

A direct comparison is difficult because there are a large number of variables including: the accuracy of the simple models, the number of models, the number of data points, and the diversity of the simple models. For a wide range of these variables, the Ensemble is demonstrably more accurate than the best of multiple models. However, the ensemble has diminishing returns against more complex models as the data become less scarce, but computationally the ensemble still scales better than more complex models.

Abstract Example

The first example that we discuss is to compare the Ensemble Model, created from five single-variable linear regression models, with two alternatives models:

  • Choosing the linear regression model with the lowest (best) Mean Squared Error (MSE). This computational complexity scales linearly with the number of data sources.
  • Choosing the two-variable multi-linear regression model with the lowest MSE. This computational complexity scales quadratically with the number of data sources.

For this experiment, we simulate a noisy, linear relationship between a dependent variable and five independent data sources. We build five linear regression models and all ten multi-linear regression models. All of the data sources are constructed using a normal distributed variance to model the relation between the dependent and independent variables. We partition the data into two halves: one for training and and one for testing. We compare the Ensemble Model with the best linear and multi-linear models. The plot shows the median ratio of mean squared error (MSE) for the Ensemble Models and the best individual models over 10,000 trials.

Comparing Mean Squared Error (MSE) between Ensemble and Best Single Model of Ordinary Least Squares (OLS). Values less than one mean the ensemble is better. The median ratio of MSE of the ensemble model over the test data is compared to the MSE of the regression models. The blue points compare the single variable regression models and the red points compare the two variable regression models.

The results of the experiment support the utility of the Ensemble Model over choosing the best of linear regression model. The computational cost of the Ensemble Model scales linearly with the number of models. The best two-variable models starts to consistently match the Ensemble Model’s accuracy when there are forty or more data points (for the noise models of our data sources) in our experiment.

Datamonster™ Example

In DataMoster, our clients often build models based on several different data sources to help predict a quarterly reported value. The scarcity in the data limits the complexity of models and fits well with the Ensemble Model. The member models comprise single variable linear models. In addition to the linear models, we include a seasonal ARIMA (without exogenous variable) model into the Ensemble. This ARIMA model is not based on data sources but rather gives a prediction of the metric based solely on historical values of the metric. 

One of the challenges with working with real data is that models are not perfect — the relationship between the data source and what you want to predict may have non-linear components and/or time dependencies that are not captured by the linear models. We include a seasonal ARIMA model as part of the ensemble to capture some time dependency. Another time dependency included in the Datamonster™ ensemble is the computation of weights is based on a recent model performance.

Datamonster™ Ensemble:

  • Step 1: Compute seasonal ARIMA model for metric, using no exogenous data, including the variance for the next two predictions. Adjust for the length of the quarter if appropriate.
  • Step 2: Compute linear regression models between data sources and metric, using all data available.
  • Step 3: Compute the means and covariance of linear model residuals over the last twelve quarters. The choice of twelve quarters is a compromise between enough data to build a good model and having the data be relevant to a prediction.
  • Step 4: Compute the weights proportional to the inverse of the variance in the models residuals over the last twelve quarters. The weight for the seasonal ARIMA model is inversely proportional to the variance in the model’s prediction. Normalize the sum of the weights to one.
  • Step 5: The ensemble prediction is the weighted sum of the models’ predictions (shifted for the linear models’ biases over the restricted time window).
  • Step 6: The variance of the prediction is the prediction variance of the weighted sum accounting for covariance of residuals.

For this particular example, we build a model to predict quarterly revenues of a company using several data sets. The company and data sources are anonymized throughout the sequence of graphs that demonstrate the ensemble.

First we create a seasonal ARIMA model to predict future values of the metric — we use the term KPI to represent this metric representing key performance indicator. The ARIMA model depends only on historic values of the metric.

Seasonal ARIMA model

The seasonal ARIMA model serves as good baseline of comparison. Comparing the ARIMA model to other member model is a good sanity check. If another member model’s prediction deviates significantly from the ARIMA model’s prediction then the member model maybe expressing valuable insight of the metric or exposing an issue with the member model (lacking time dependency) or with the underlying data source.

The next graph shows the predictions of the ARIMA model and the linear regression model for the “best” data source, determined by the variance of the residuals over the specified time window. We see during the holdout period, there is a significant difference between the ARIMA model and the best linear regression. This deviation is partially explained by the shifted 53-week year and shifted Q3 period that is part of the linear regression models but does not affect the ARIMA model. During the holdout period during this example, the linear regression model of the data source is in the right correctional direction.

Seasonal ARIMA and Linear Regression Models.

We plot the regressions for each of the ten “best” alternative data sources in DataMonster™. This plot provides visual context for the models and the deviations from the ARIMA model. During the holdout period, we can see that the “best” data source is on the extreme side of the range of regressions and that other “good” data sources effectively balance this prediction.

Seasonal ARIMA Model and Linear Regression Models. Some of the member models are visually more accurate than others.

When we create the ensemble, we compute the weights using the residuals from the last twelve quarters. The weights are inversely proportional to the variance of these residuals; the ensemble is the weighted average of each of the models shifted by the respective mean residual error over the last twelve quarters. 

Seasonal ARIMA model with Linear Regressions and Ensemble.

The weights in the ensemble model provide a more accurate prediction than the best regression model or the seasonal ARIMA model. The actual improvement in accuracy and variance depends on the metric and comprising member models. 

Seasonal ARIMA model with Linear Regressions and Ensemble. The blue region is plus/minus one standard deviation from the predicted mean. The light blue region is plus/minus two standard deviations from the ensemble mean. When the regression models are tighter then the variance is lower.
Seasonal ARIMA, Best Linear Regression, and Ensemble Models.

There are cases, unlike the presented one, where the Ensemble model is inferior (measured by larger variance) to a linear member model. Possible reasons for these cases include:

  • Inaccurate estimation of member model’s variance.
  • Dependence in residuals of member models lead to non-optimal choice of ensemble weights.
  • All the member models are noisy, including the ARIMA model.

The choice of weights in the Ensemble is typically a robust model if at least one of the member models is reasonably accurate.


The Ensemble Model combines multiple data sources and mitigates overfitting. Statistical calculations show that ensemble model improves with with more data sources (with some independence assumptions) in accuracy and variance. The robust nature of the ensemble properly accommodates noisy (and perhaps superfluous) models.