Large Scale Ad Data Systems at Booking.com using the Public Cloud
Booking.com’s mission is to make it easier for everyone to experience the world. To help people discover destinations, we are a leading travel advertiser on Google Pay Per Click (PPC). Booking Holdings, as a whole, spent $4.7 billion in marketing across all brands in the first nine months of 2022. How do we run PPC at our scale, and efficiently? In this article, we want to illustrate our extensive use of the public cloud, specifically Google Cloud Platform (GCP). From data ingestion, data science, to our ad bidding, GCP is an accelerant in our development cycle, sometimes reducing time-to-market from months to weeks.
What are PPC’s Challenges?
PPC as a business represents a global optimization problem. From a technical perspective, solving this requires machine learning and operational infrastructure at scale, which is processing performance feedback, assessing historical performance and after running algorithms, communicating results back to a search engine provider. We’ll touch on some of the key aspects of our technology stack enabling this capability.
Data Ingestion and Analytics at Scale
Ingestion of performance data, whether generated by a search provider or internally, is a key input for our algorithms. We make extensive use of Google BigQuery in PPC due to the scale of our business (up to 2 million room nights booked per day). Our team collectively runs more than 1 million queries per month, scanning more than 2 PB of data. BigQuery saves us substantial time — instead of waiting for hours in Hive/Hadoop, our median query run time is 20 seconds for batch, and 2 seconds for interactive queries.
BigQuery also offers native support for nested and repeated data schema. We take advantage of this feature in our ad bidding systems, maintaining consistent data views from our Account Specialists’ spreadsheets, to our Data Scientists’ notebooks, to our bidding system’s in-memory data. This feature eliminates code to parse data, lowers our technical debt, and shortens our development time.
A Unified View for Operational Data
We kept most of our operational data in relational databases, like MySQL. As we evolved, we increasingly needed to interweave that data with analytical data in one unified view. We now use Spanner, a globally distributed SQL database, because it has a number of unique features that strongly fits our use case:
- Spanner and BigQuery force multiply each other via Federated Queries: we store rapidly changing, relational data in Spanner and write append-oriented, analytical data into BigQuery. When our queries span both datasets, Federated Queries fetch real-time, snapshot-consistent data from both stores immediately. We don’t need to move data, or spend development cycles to set up (and maintain) data pipelines.
- Spanner has powerful schema evolution capabilities: for large tables with evolving advertisement data, Spanner can reduce the time of schema changes to minutes or seconds, compared to days in a more common relational database solution.
- Interleaved tables allow us to store Ads API data separate from our internal data without degrading performance. For structured and hierarchical data, this feature lets us place related data close together, maximizing spatial locality.
Data, delivered with quality at the required moment, is at the heart of our system performance. For workflows spanning multiple data systems, we opt to use Google Dataflow, a managed batch/streaming hybrid framework, with native integration to BigQuery and Spanner.
We see two major advantages with this approach:
- Dataflow implements Beam — an API that effectively abstracts big data processing from the runtime platform implementation details: it lets us focus on solving for business logic and data parallelism, making our development iteration faster.
- It runs as a managed service, with minimal maintenance.
We use Dataflow to run large advertisement data pipelines with complex business logic stages. In one Image Ad Extension project, we had a production-ready version in about 3 weeks, and iterated continuously to handle new business features and corner cases. The same implementation without GCP or Dataflow could have taken us at least three months. In a streaming report project, our Dataflow solution cut the running time from 6 hours on our legacy system, to 10 minutes.
Evolving our Infrastructure
Optimizing our GCP operation comes with its unique set of challenges. At our scale, small costs add up and this requires us to think deeply about the infrastructure cost dimension early, at design time, which was as much a cultural change as a process change. As an example, in one of our first BigQuery aggregations, we had a large query that joined statistics data with metadata, then aggregated over it. Because of the large join and the use of CURRENT_DATE() as a dynamic function, a naive BigQuery implementation would have cost $1 per run, and over a day, it would cost $1,500. Looking to lower the cost, and increase performance, we used BigQuery for aggregation, kept the metadata on MySQL, and joined the two tables in application memory. This approach allowed us to cache BigQuery results with the presence of dynamic functions. We lowered the cost per query run to 1 cent, or about $10 per day, a 150x reduction.
In PPC, we have had tremendous success building our systems on the public cloud, specifically GCP. BigQuery, Dataflow, and Spanner have accelerated our development cycles, shortened our time-to-market, and proved their business value. We like to thank the fantastic teams at Google, for their collaboration and partnership with us. At Booking.com, we have complex, challenging problems to solve. We have an incredibly talented and hard working PPC team, and we use innovative technology to help people discover travel in fun, creative ways. If this excites you, please consider working with us — apply at jobs.booking.com!
 Booking Q3 2022 10Q: https://s201.q4cdn.com/865305287/files/doc_financials/2022/q3/9fdc627b-3967-4970-aae0-b9581971f96d.pdf
 (Internal document) PPC GCP Projects Retrospective
 (Internal document) PPC BigQuery summary
 Description of Protobuf in Dremel: https://research.google/pubs/pub36632/
 Dremel: A Decade of Interactive SQL Analysis at Web Scale: https://research.google/pubs/pub49489/
 Spanner: Google’s Globally-Distributed Database: https://research.google/pubs/pub39966/
 Spanner: Becoming a SQL System https://research.google/pubs/pub46103/
 F1 Query: https://research.google/pubs/pub47224/
 F1 Lightning HTAP: http://www.vldb.org/pvldb/vol13/p3313-yang.pdf
 F1: A Distributed SQL Database That Scales: https://research.google/pubs/pub41344/
 BQ results caching: https://cloud.google.com/bigquery/docs/cached-results
 (Internal document) BigQuery cost control in PPC: BigQuery Cost Control
 (Internal document) PPC — Moving aggregations from onprem to BigQuery