EXPEDIA GROUP TECHNOLOGY — DATA

Using Synthetic Search Data for Flights Price Forecasting

Expedia Group’s method for ensuring clean and consistent pricing data for machine learning applications

Andrew Reuben
Expedia Group Technology

--

A couple overlooks a city at dusk
Photo by Nathan Dumlao on Unsplash

Background

To provide accurate flight price forecasting for our travelers, the Travel Insights and Interactions ML team at Expedia Group™ uses synthetic search data for model development and evaluation.

Travelers are presented with a plot of historical prices along with forecasted prices with their search results

What is synthetic search data?

Synthetic search data is data generated by searches performed by an automated process that allows us to predefine the parameters of a search. The data generated is the same as the organic search data generated when a real traveler searches for a flight on the website. The screenshot below illustrates the parameters of a search on the www.expedia.com site that are considered when generating the synthetic search data.

The search screen of expedia.com with the synthetic search parameters highlighted(roundtrip, origin, destination, travel dates, passenger count).
Screenshot of expedia.com illustrating the parameters of a flight search

Why use synthetic search data?

While travelers perform millions of searches a day on our site generating valuable organic search data, this data is suboptimal for our forecasting purposes due to its sparsity. Consider the number of possible routes, trip date and passenger-count combinations. Even though our organic search data will cover many of these possible combinations, due to the sheer size, many of these combinations will not have consistent search data. In our ideal forecasting dataset, we would have at least one search a day for all possible routes, trip dates and passenger-count combinations. This would allow us to have a complete understanding of how flight prices change from day to day, which is invaluable when forecasting prices. Unfortunately, our organic data does not provide this. Even popular routes with high search volume can have data gaps once all possible trip and search dates are considered. The plot below demonstrates how a set of trip dates for a popular route like LAX to JFK can even have a decent amount of missing pricing data when using organic search data to build a time series dataset of prices. Points without connecting lines indicate missing days of pricing data between those points.

Gaps in organic search data can result in inconsistent pricing data

Our solution to overcome this issue is to use synthetic search data. Since we have control over the search parameters of the synthetic searches and when the searches occur, we can ensure that we have a consistent dataset of flight prices for forecasting.

How do we define our synthetic searches?

As explained above, the reason why we use synthetic search data is that it allows us to control what data is being generated to provide a consistent dataset of flight prices. Our current process performs synthetic searches on a daily basis for a predefined set of our most popular routes for all trip start dates within a certain advance purchase window for both one-way and roundtrip. For roundtrip, trip length is also considered when defining what searches to perform. The table below illustrates the search and booking coverage provided by the synthetic dataset.

The percentage of searches and bookings covered by the scope of the current synthetic dataset

Disadvantages of synthetic search data

While using synthetic search data has many advantages for our flight price forecasting use case, this approach has drawbacks as well. The biggest drawback of choosing to use synthetic search data over organically generated search data is that it limits the number of routes and trip dates in your dataset since the system performing the synthetic searches only has the capacity to do so many searches in a day. The synthetic search data is very valuable, but at the end of the day, the main job of the flight search service is to provide a good experience to our travelers, so we need to consider the strain that synthetic searches place on the system and ensure that it is not negatively affecting the onsite experience of travelers searching for flights. In addition to the cost associated with performing the synthetic searches, there is also a cost associated with storing all of the data generated by the synthetic searches that should be considered. It was decided that for our initial forecasting models, the advantages of having a consistent and reliable dataset for these routes outweighed the drawbacks of generating, storing and using the synthetic search data.

Moving forward

Now that we have launched our initial flight price forecasting model that leverages the synthetic search data, we have begun to consider how to move forward to increase forecasting coverage beyond the routes currently being forecasted. One obvious way to do this would be to increase the number of routes included in the synthetic search data. Modeling-wise, this would be the easiest way to scale to additional routes but would result in increased costs as described above. Another option is to move beyond a modeling framework that relies so heavily on a predefined set of routes. This more generalized modeling approach is the only way to scale to all searches performed by travelers, making it the likeliest path forward. This doesn’t mean that the synthetic search data will no longer be used. In fact, it will likely be a critical tool in achieving the goal of having a more generalized model. To have a forecasting model that can provide accurate forecasts for all searches performed on the site, we will need to build a dataset that is representative of all searches. The ability to curate which routes are included in the synthetic search dataset will help us achieve this goal.

--

--