Rapid Batch Inference in Google Cloud

How we tweaked Google’s new SQL-based ML Framework to scale our inference stack and accelerate our product roadmap

Published in

Tinyclues Vision

9 min readApr 1, 2022

With BigQuery ML (BQML), Google aims to make Machine Learning even more accessible. The platform allows for the creation and execution of machine learning models within an SQL environment and leverages BigQuery’s serverless architecture to provide scalability and performance.

Since its initial release in 2018, BQML has steadily expanded its capabilities. With added support for TensorFlow, we saw the chance of leveraging it to deploy our pre-trained custom TF models and making BQML a main component of our production inference stack.

As we will show, with BQML, we created a more agile inference stack allowing for faster scoring, easier maintenance, and accelerated feature roll-out.

What’s unique about us

Tinyclues is a leading marketing solution to put the right marketing offer in front of the right people. To do so, we give every user in the customer base a propensity score to buy (to like, to view, …) a certain “offer”.

What constitutes an offer will vary depending on the client’s use case. For retailers, it could be defined as a specific SKU, a product category, a brand. For hospitality clients, it could be a destination or fare category. Offers can also be formed as a combination of other offers, such as a group of products, brands, or destinations.

A popular use case of these propensity scores is the creation of target audiences for customer marketing campaigns, where offers can be promoted to only the most likely customers to purchase them.

Example for a CRM campaign plan with Tinyclues, showing suggested audience size and revenue potential (%) for each campaign.

For each segment we create, we deliver additional insights, such as Suggested Audience Size and Revenue Potential.

Once multiple offers are scored, we can further visualize the overlap in audiences among them and let marketers optimize pressure, i.e., the number of messages received per customer.

Fatigue Management can help reduce unsubscribes caused by over-promoting certain offers.

Providing limitless flexibility in the selection of offers means that we cannot score all offers ahead of time, yet to allow marketers to use our insights in an agile way, scores should be available shortly after they are requested.

In other words, our solution needs to be able to score tens of millions of users for an offer we don’t know in advance in the time one gets a cup of coffee.

Note: The following article is part of a series describing how we revamped our tech stack to leverage the power of fully-managed services on GCP.
In a previous piece, we described how we transitioned from a non-SQL homegrown data stack to a SQL-native, fully-managed one. This transition forms a pre-requisite for the following SQL-only BigQuery ML architecture.

Inference in GCP without BigQuery ML

Before jumping to BigQuery ML, let’s first look at the conventional way of running inference on pre-trained TensorFlow models in GCP using Vertex AI’s prediction service.

Using Vertex AI, our inference loop would look something like this, strongly resembling our legacy infrastructure before switching to BQML.

Inference Stack with Vertex AI Batch Inference

GCP offers two online prediction services for inference on models trained within Vertex AI:

Vertex Online Predictions are synchronous requests (VMs for inference are constantly up and available) but only deliver one prediction at a time
Vertex Batch Predictions are asynchronous requests (VMs for inference must first be provisioned by GCP) but deliver many predictions simultaneously.

This leaves us with a dilemma. Our use case can involve scoring tens of millions of customers, so clearly, we don’t want to score one customer at a time.

However, using Batch Predictions comes with its own caveats:

Each time we run an inference, we would have to wait for the provisioning of machines
Data from BQ cannot be directly pushed into the Batch Predictions; thus, for each inference call, we have to export data (and transformations?) and push back scores into BQ
Keeping virtual machines up all the time is not an option since it would be highly costly and would still require data transfer each time.

Thus, using Batch Predictions would harm the flexible nature of our platform.

Our workflow with BigQuery ML

How can we improve upon this with BigQuery ML?

BQML enables us to call pre-trained TensorFlow models within an SQL environment, thus avoiding exporting data in and out of BQ using already provisioned BQ processing power.

Using BQML removes the layers between our models and our data warehouse and allows us to express the entire inference pipe as a number of SQL requests. We no longer have to export data to load it into our models.

Instead, we are bringing our models to our data.

The approach creates three key benefits :

Speed

Avoiding the export of data in and out of BQ and the provisioning and start of machines saves significant time. In the Diagram below, we compare inference times for a given dataset in BQML to our old stack.

Inference times in BQML compared to our legacy stack (similar to Vertex AI Batch Inferences) — (*) Prediction in the old stack includes provisioning and starting up the inference workers

As can be seen, the reduction in the number of steps leads to a 50% decrease in overall inference time, improving upon each step of the prediction.

Offer and Audience feature engineer steps are more than twice as fast as in a custom script, and prediction times are cut due to the automatic provisioning of workers. Most importantly, however, as scores are automatically saved in BQ, we completely avoid the Data Export altogether, making up for most of our speed improvements.

Instantaneous Scalability

Thanks to BigQuery’s Serverless Architecture, BQML natively and instantaneously scales to large data workloads. By automatically provisioning the right amount of computing power, BQ guarantees reliable execution for any dataset size.

Below, we charted execution times vs. the log of audience sizes for a sample of campaigns created on a subset of client datasets.

As the graph shows inference times for various offers and inference times change depending on offer complexity, there is significant noise in the benchmarks for small audience sizes.

However, for large audience sizes, inference times seem to scale very nicely for audience size beyond 1M customers: a 10x increase in audience size roughly corresponds to a 1-minute increase in inference times.

Single Runtime + SQL everything

Beyond the speed and scalability improvements, having the entire process in a single runtime makes it much easier to maintain and evolve the inference pipeline. Instead of a complex pipe of scripts and services, the inference is run entirely within a SQL environment, which in turn is made possible by two features:

First, BQML lets users call their pre-trained models with a SQL query. As in the example below, models can be loaded directly into BQ with a simple command and score previously queried input data.

To make this logic work for us, we need a second building block: BigQuery scripting allows us to send multiple statements to BigQuery in one request and build complex, custom queries to get our data ready for inference.

We implemented all the feature engineering directly in BigQuery, the feature transformation (StringLookup… ) is directly attached to the model in TensorFlow: This script starts from warehouse client data and returns the score directly with no additional runtime.

Together, these features allow you to express your entire inference loop from reading data from the warehouse dataset to writing the score into the campaigns_score dataset in a single SQL query.

As a result, we can replace our legacy orchestrator with BigQuery scripting:

BQ scripting can also be useful to do post-processing on computed scores, such as removing customers with low propensity-to-buy or providing additional insights, such as aggregated propensity scores for pre-defined customer segments (for instance, loyalty card members).

Current Constraints of BigQuery ML

So far, so good — but what were our experiences implementing BigQuery ML in practice? Let us look at the two biggest challenges we faced and explain how we solved them.

Cross-Join Bottleneck

An offer is defined as an SQL query over the client catalog that returns all the products that match the criteria.

As we want to predict using the same offer for all users, we copy-paste the same offer to every user in the user table: In SQL, this copy can be written as a cross join between the user table and the offer table made of a single row.

Cross-Join for the offer “Products by Apple.”

This might sound like a trivial task but an offer that consists of many products, can have a relatively large feature representation. If you want to cross-join to a data table of 50M customers with an offer row that weights 20Kb, you will produce 1TB of data in the process!

An initial test across several large datasets showed that such runs could take several minutes to finish or even fail entirely as all operations are executed on shared resources.

Luckily, there exists an easy workaround: The solution is to materialize these tables before the join, i.e., storing them in a table to disk. While doing so is more costly, it produces very consistent results. We believe that this difference results from more computation slots being allocated to data materialization.

For proprietary models, GCP offers an alternative solution to this via ml.recommend, which takes two tables as input: the user and offer and thus doesn’t require a cross-join beforehand. However, this feature is not (yet) available for custom models.

Memory Limitations

Naturally, given the distributed architecture of BigQuery, individual workers will have certain limitations regarding the workloads they can handle. When first testing our models in BQML, we observed that some of the most complex setups raised the following error:

RuntimeError: BigQuery ML Push failed: 400 Error while reading data, error message: TensorFlow model cannot be parsed within the memory limit; try reducing the model size

However, as our are small in terms of number of parameters the reason we are hitting this limit is not a large model size but rather model complexity.

For example, in a computer vision model, there is a single feature, the image. Model size thus comes primarily from the number of learnable parameters. In our recommender systems, the model size in ram complexity won’t come from the weight but from the large number of features we use in our models that create a very complex model architecture.

Reducing the number of features won’t be possible without harming the predictive performance, so we needed to find a fix that would not change the model itself. We managed to achieve this by optimizing our models’ Tensorflow graphs, discussed in great detail in this companion article.

Conclusion

All in all, BigQuery ML offers great opportunities for deploying TensorFlow models in a streamlined environment. The platform has evolved substantially in recent years and can now enable state-of-the-art Machine Learning pipelines with rapid-speed execution and easy maintenance.

To overcome certain current limitations of BigQuery ML, we presented two workarounds that enabled us to leverage it for our inference pipelines.

We believe that as Google continues to work on the performance and feature set of BigQuery ML, native solutions to such limitations will become available, and we are excited to see the potential of upcoming releases and the use cases they could enable.