Forecasting@Meta: Balancing Art and Science

10 min readMay 31, 2024

By Steven Gin and Kevin Birnbaum

Forecasting is often touted as both a science and an art. On one hand, it requires technical modeling, and optimization. Modern algorithms find patterns with complexity and cut through human bias. On the other hand, forecasts can require deep domain knowledge and expertise not easily encoded into a model. Sometimes, this can be the only thing we can rely on in murky spaces with lots of unknowns.

As forecasters, we often find these two schools of thoughts at odds. Some people feel that black box algorithms abandon common sense, while others feel that disregarding the mathematical results is succumbing to bias. We believe that the key to generating consistent high quality forecasts is at the intersection of this dichotomy. In this article, we’re going to discuss how at Meta, we navigate balancing both the art and science of forecasting.

We’ll look at how we channel this philosophy through:

A. Validation of the Forecast

B. Incorporation of Product Impact

Validation

Validation is assessing the quality of a forecast. We often need to do this in a greater capacity than measuring the error once reality has caught up to our forecast. We often need to evaluate the quality of a forecast without waiting for reality to play out. In addition, even when it does play out, the amount of samples we have to assess the accuracy of our forecast are limited. Reality only plays out one way, and the data points within it are correlated.

For instance, let’s imagine we are working on the Facebook user forecast (DAU or Daily Active Users). We want to generate a forecast predicting next year’s DAU. We obviously don’t have the data to compare next year’s predictions to yet. We must leverage a combination of approaches to validate whether or not we have a good forecast. Making this determination is a mixture of both metric optimization and human judgment. In this section, we’re going to discuss the mechanics of this process.

Error Metrics and Backtesting

As just mentioned, the most obvious way we can assess quality is by measuring the difference between reality and the forecast. We can also leverage backtesting to compare against past data. This is useful in getting a check on if our forecasting approach is sensible.

With these two use cases in mind, there are actually many considerations in implementing them:

Error Metric Selection: What is the best error metric to use?
Identifying Representative Testing: Which periods are fair backtesting periods, and how should that inform our parameter/model choices?
Expert Opinions and Codifying Validation: Are there other things we know about this domain that are not being considered by simply measuring the error rate? For instance: We might know specific sports events like the world cup are occurring in the future, and have no historical record of similar events on our data. We might know that our company is about to launch a new product and make a judgment call on its impact on the future. These are things that aren’t manifested well in backtesting historicals.

Error Metric Selection
What is the best error metric to use? Over what aggregations will we measure that error rate? Is the error rate a fair assessment of the forecast objective?

There are a wide array of error metrics we can choose in time series modeling. Each with their own pros and cons. Here are just a few:

Identifying Representative Testing

Creating a representative test of your forecast often requires a lot of human judgment. Some potential pitfalls include:

Selecting a volatile or non-representative period

This can distort outcomes and create a bad forecast. At Meta since we know that 2020 and the spread of the pandemic was such a unique time, we generally exclude this time period not only from our backtesting set but also from our training data.

Choosing an error rate that doesn’t match your forecast objectives

An error rate that measures percent error might be great in some cases but potentially lead to suboptimal results if you value the accuracy of larger magnitude time series more. For example, if you’re forecasting the sales of a product, the accuracy of forecasting your highest volume/revenue product might matter way more than that of your smallest volume/revenue product. In these cases, an error rate that deals in absolutes might be more useful.

Testing at the right aggregation

Another important factor is what aggregation we measure error at. For instance, we might be forecasting the number of users in India. Our forecast might treat that forecast with the same weight as a country many orders of magnitude smaller. A possible course of action might be to bucket geographically similar countries together to create a more balanced error metric. A different possible course of action might be to measure the error rate of both the global user count forecast, as well as at the country level.

Ultimately, this decision where you employ human judgment to map a forecast choice to your use case is really a great example of how great forecasters reach beyond pure mathematics and incorporate nuanced choices. All of these choices need to map back to the practical use cases and decisions being based on these projections.

Expert Opinions and Codifying Validation

Gathering the opinions of domain experts and forecast stakeholders is critical as well. Ultimately, your forecast will be used and viewed by stakeholders, so understanding their concerns and use cases can help you optimize on the parts of the forecast that matter most.

In general we are looking for 3 things from a domain expert/stakeholder:

Understanding the Use Case
Stakeholders use, interpret and action forecasts very differently. If the forecast is used to monitor potential anomalies, granular precision might be the thing you optimize on. If stakeholders are doing long range planning, then your forecast cares more about confidence in the trend’s trajectory. This human element dictates concrete methodology choices.
Do we have the right factors/assumptions?
What things we account for in the forecast (launches, events, outages etc.) are an important choice we make. We often can’t account for everything, and the choice of which critical factors to include often comes from domain experts. These pieces of information might be critical to the forecast but not present in the historical data alone.
Is the forecast reasonable?
In general, this is reflected in comments or questions about the reasonableness of the forecast. For instance, does the forecast look similar to a prior year of growth?

These sorts of questions can be built into the validation process with automation to help turn human judgment into a calculated process.

Returning to the example of Facebook DAU, let’s think on how some of these considerations impact our decisions.

We need an error metric that punishes the forecast equally for over or underpredicting. Therefore SMAPE (See above), is a reasonable option.
In addition, we know that during 2020 due to the pandemic, people spent more time on the internet. This is therefore not a reasonable testing period as it’s not representative of the future.
We might also note that rapid growth of internet infrastructure in a region was the primary drivers of their growth over the past 2 years. Knowing that the infrastructure growth will slow down due to new laws or economic climate builds intuition that it’s unlikely the region will grow at the same previously high rates.

Product Impact

Unlike forecasting the weather, at Meta we have the ability to influence the metric we’re forecasting. For instance, we know our decisions on features we plan to launch can dramatically influence a metric’s outcome. Sometimes, we have a good idea on the impact that a given launch will have. Other times, we have a lot of uncertainty around the impact’s magnitude, sizing, shape and duration. For instance, this might cause a permanent change to a series’ trajectory or it might instead cause a level shift of the entire series. We need to carefully assess the impact and how we plan on integrating it.

Inclusion Criteria
When including new factors into the forecast, such as product impact, we need to have a structured inclusion criteria. There’s an inherent tradeoff between accounting for additional factors and balancing the noise/uncertainty we might potentially introduce to the forecast.

Having clear criteria allows us both to have systematic logic for how to proceed with the forecast, as well as transparency on the assumptions that the forecast is built upon. For instance, Meta is very strict about what impacts are included in each forecast, based on the type of experiment used to generate the estimate. Estimates generated with high uncertainty are generally excluded from the forecast. These estimates reduce the usefulness of the forecast by smothering the precision of other well known events. On the other hand, very precise estimates (magnitude, timing, and shape) are baked directly in.

Integrating Holdouts

When we have an experiment holdout we generally have fairly high confidence in the impact of a product feature. These cases generally will be incorporated directly into the forecast in addition to whatever the model predicts.

There is some complexity in disambiguating the model’s predicted trend and what we project we will achieve in product impact. These cases require some assessment of our confidence in the measurement so that we don’t double count the impact. In some cases, we will subtract prior product impact from the historical training data, in others we will try and measure the rough incrementality and add that as to minimize disruption to the fidelity of the model.

Let’s think about a hypothetical product feature we want to launch. The feature causes a small celebratory animation when you log in during your birthday. Currently, for a small number of countries we have this feature turned on. Within those countries, we have a proportion of users in a holdout. This means that some users will not see this feature. We can leverage the differences between these two groups to measure the impact this feature has.

Now, we are using this information to project the impact this will have when we launch to every country. Even with a holdout, these estimates might still have very large confidence bands. All that uncertainty carries itself downstream to our forecast. All the complexity of sizing this feature’s impact is therefore also carried into the complexity of the forecast. Maybe some countries might respond differently to the feature. Maybe there are novelty impacts and this feature’s impact might regress over time.

Additionally, regardless of how we include that additional growth, some of that growth will be natively captured in the forecasting algorithm. Do we train the model on data that has the impact of this feature stripped out? These are all complex decisions we need to make based on the product, the way the product impact is measured and are expectations of how those will carry into the future. Often times, the uncertainty is simply too high a cost to pay, and therefore we must exclude many things from the forecast.

High Confidence Cases

Assuming you have strong confidence in anticipated product impact (across magnitude, timing, shape etc.), there are multiple ways you could incorporate that anticipated product into your forecast.

There are two methods we tend to use:

Over the Top: Sometimes we prefer simpler approaches. An over the top adjustment compares the difference between the historical contribution of a product vs. the expected. The delta is added on top of whatever the model generates. Many times, this simplicity creates safe robust assumptions that focus on the impact to a metric’s trend. For example, we might plan on shipping features next year that will have 15% impact on our metric, whereas normally we would only have 5% impact. The extra 10% will be added onto the output of whatever the model would predict.
Modifying History: In cases where product impact has caused severe disruption to a metric’s history, we are concerned with more than just how it will impact future trends. We have additional concerns that the model’s learned seasonal patterns will worsen as a result of changes we might make to the product. In these cases, where training data might be compromised, we might consider removing product impact from a trend’s history and forecasting “organically”. This is contingent on having sufficient data on the decomposed element. The decision of which method to use really depends on the volatility of your past product impact. Some of our metrics have big swings in the amount product impact influences them, while some are very stable. We also need to be aware that the estimate for past product impact is not always perfect. Some problems we face in getting accurate estimates include: imperfect measurement of network effects and too large of confidence bands for our experiment for the specific forecasted region.

With the above context in mind, in order to decide whether or not it will be best to remove past product impact from our time series we conduct a simple test. We try stripping out the past impact and measure the new volatility of the remaining time series (may need to de-trend/deseasonalize, and perhaps difference but that will depend on your forecasting method and time series).

If we do not see a strong reduction in variance of our time series we will go with the first method of assuming product impact is implicitly captured in the original time series and add the incremental product impact to the top. (ex: Notifications are responsible for +10M DAU/half, but this half they expect to ship +12M. +2M is added to the top).
If we do see a strong reduction in the variance of our time series, we will proceed to strip out the impact and forecast using the adjusted time series.

Low Confidence Cases

We have to sometimes account for low confidence events. For example:

We might be less confident if a specific trend is driven by a one time external event.
We might not be confident in the timing of an important launch.
We might know a specific event is coming but be uncertain on the magnitude of the impact.

These cases are particularly tricky because attempting to incorporate these factors could jeopardize the usefulness of the rest of the forecast. Hopefully the event we are assessing is something captured by our inclusion criteria. However, events like this are often unique and need to be uniquely assessed.

The key thing we consider is the use case of the forecast. For instance, if the forecast is just being used for its end of year value, incorporating a factor with unknown launch timing might not damage the usefulness of the forecast. However, if the forecast is used for weekly monitoring, predicting a trend-warping event wrong would sabotage the utility of the forecast. Weighing these different costs/benefits is key to making the forecast as useful as possible.

Closing Thoughts

Producing quality forecasts is a careful mixture of both technical methods and thoughtful judgment. Neither side alone will allow us best to predict the future. We emphasize the need to recognize where mathematics has blind spots and needs human intervention, as well as where technical methods can allow us to overcome human bias. Hopefully, these examples can also hope you better generate your own forecasts.

Forecasting@Meta: Balancing Art and Science

Validation

Product Impact

Closing Thoughts

Written by Analytics at Meta

Responses (1)