Unifying Machine Learning Model Inference at Thumbtack

Our journey building a scalable model inference service

Published in

Thumbtack Engineering

11 min readMay 14, 2024

At Thumbtack, we use machine learning (ML) for helping customers find the right professionals that can help with home services or for problems such as recommending home service categories (like Plumbing and House Cleaning) to customers. An important step of deploying an ML model to production involves creating an endpoint that our backend can call.

In this blog post, we will share our journey around building a scalable machine learning inference service that works well for Thumbtack. We will share several options we considered.

Previous State

Overview

At the end of 2021, each Thumbtack team was working on machine learning independently and everyone built their own infrastructure. For example, the Marketplace Matching team (read about their amazing work here: 1, 2, 3) was using AWS SageMaker (Amazon’s ML service) to deploy their models. Several other teams used a different internally built microservice called Prediction Service. A lot of teams didn’t use machine learning at all. Some of them struggled with the high cost of deploying the first model.

In January 2022, Thumbtack created a Machine Learning Infrastructure team. We were tasked to build company-wide machine learning infrastructure for all the teams to use. Our first goal was to address the problem of online ML inference (running predictions for models) at scale. Online, in this context, means that teams wanted to be able to deploy a wide range of models (for example, GBDTs built using XGBoost) as endpoints and be able to call them from our backend microservices and get results in a timely fashion (typically within 30–300 milliseconds or less).

We wanted to support a range of standard model types (logistic regression, GBDTs and neural networks) using typical runtimes (for example, scikit-learn, XGBoost, TensorFlow). We expected to have approximately 20–50 active models in the following 2 years. Typically a model used a dozen or so features (with some notable exceptions) and took less than 1 Gb of memory. Several teams used micro batching — i.e. running the inference for around one hundred different data rows per call.

At this point we had a legacy Prediction Service used for a few smaller models. We also successfully integrated SageMaker for one team. So the question was, why not use one of these solutions for everyone?

How was the AWS SageMaker integration implemented?

The Marketplace Matching team built a small library to encapsulate the SageMaker call. It worked well for them because they fully owned the library and no one else needed it. For example, if the Matching team needed to adjust the retry policy they could just make modifications to how the client is used and redeploy their service.

A year of operation highlighted one major drawback: infrastructure fragmentation.

SageMaker endpoints by default report metrics to AWS CloudWatch and send all logs to AWS CloudWatch logs. At the same time the rest of Thumbtack backend used InfluxDB/Grafana for reporting metrics and ElasticSearch/Kibana for logging. SageMaker and the rest of the backend had separate deployment pipelines.

Prediction Service

Prediction Service, the other major ML service used at Thumbtack, was a legacy internal service for ML inference.

This service followed the two layered design.

The first layer was just an API layer. It implemented a common routing logic for traffic. It referred to the model registry and depending on model type called one of two backend services

The second layer were ML Backends:

Sciki-learn serving was used for running the ML inference for scikit-learn based ML models. Mainly used for GBDTs.
TensorFlow Model Server was used for TensorFlow based models.

Notable design choices:

Each ML backend had a fixed runtime. Any update would be propagated to all deployed models. For example, a scikit-learn upgrade was only possible if all deployed models supported a new version of scikit-learn.
Python was used as a language of choice for both layers despite Go being a standard Thumbtack backend language.
There was no way to add any custom model-specific logic. The only thing an applied scientist had control over is the ML artifact they deployed.
For scikit-learn serving, an applied scientist could deploy pickled files as artifacts which made it very hard to upgrade the scikit-learn library used.

Two years of operation with the Prediction Service highlighted several drawbacks:

Hard to upgrade runtime

All runtimes had been tied together. This worked for the Tensorflow Model Server backend. It was backward compatible: a new version could load artifacts from the old.

One could not say the same for the Scikit-learn backend which heavily relied on using pickled files. To work correctly pickles need fully matching environments. So the inference runtime needed to match the training runtime. To make matters worse, all existing scikit-learn production models in Thumbtack had been tied to the same environment.

A major version change would require redeployment of all scikit-learn models. It would also need collaboration between several teams. Because of the high cost of such upgrades they were performed extremely rarely. As a result, by the end of 2021, we were lagging behind a few years on upgrading dependencies. And this also affected the training environment for new models, inhibiting an applied scientist’s productivity.

2. Prediction Service API was implemented in Python

At Thumbtack, we use Go as the primary language for our backend. This means that any developer who used golang to build their microservices had great support. They got it from Thumbtack’s Online Services, SRE and Developer Experience teams. Thumbtack Engineering has also accumulated a lot of expertise on how to build and operate scalable golang based microservices.

The Prediction Service was the only surviving production microservice using Python. It was up to the owners of the Prediction Service, with their limited resources, to figure out how to make scalable microservices in Python. While this was feasible it put an extra strain on the owners.

Design considerations for a unified solution

We wanted to build a new unified solution and we had the following considerations:

Our solution should be unifying: everyone should be able to use the same API to solve the same problem. We did not want to support 2–3 different solutions for the same problem which led to spreading thin our infrastructure investment. Because of that we also wanted a way to migrate all existing legacy ML inference applications to this new solution.
We also wanted the new solution to be fairly flexible and be able to support new ML frameworks when the need arose.
We wanted to avoid building new infrastructure as much as possible. If we already had a solution for the problem at Thumbtack we wanted to reuse it. For example, we already had a system to define and continuously deploy microservices. We wanted to reuse it instead of building something from scratch.

Designing an architecture for the new solution

Client library vs service

There were two ways forward. The first was to make a client library for product teams to use (like we did with AWS DynamoDB). The second was to make a separate service with RPC to proxy all the inference calls.

The diagram above shows how these two options compare side by side. Here an ML Engine is a service which runs a specific model. This can be a SageMaker endpoint, TensorFlow Model Server or a custom implementation. We use color coding. Components in green would be owned by the ML Infra team. Everything else would be owned by an appropriate product team.

The benefits of creating a separate service were:

Clear division of ownership: The ML Infra team could have a different release cycle compared to the product teams.
If calling the ML Engine needed special permissions, like permission to call SageMaker, we could grant it to the service. This would be instead of asking a Product Engineer to add this permission to all the services.
Updating the logic would require only 1 deploy vs N deploys (as long as we do not change APIs).

The downsides included:

Added complexity: one more service to manage.
Potentially added latency: one more network hop.

Looking at the mentioned pros and cons we decided to build a separate service. Building it also permitted us to automatically integrate ML inference with other services that relate to MLOps (more on this below).

Inference Service Design

The inference service architecture diagram

Green components would be owned by the ML Infra team. Blue components would be owned by the product teams. The solid line shows an example call flow in our backend. The dashed line used on the right shows a deployment pipeline.

These were key elements of our design:

A. All product services call a single entry point called inference service. It would include all common logic including but not limited to MLOps.

B. Each model would be deployed as a separate endpoint in a separate container. This allowed different initiatives to be fully isolated:

Clients could use different runtimes
Clients could have different resource allocation (memory, CPU, GPU)
Clients could have different number of containers serving their traffic
Failures in one endpoint would not affect another
Poor latency in one endpoint would not affect another

C. Each model endpoint would be instantiated from one of very few templates owned by the ML Infra team. This would let us pass good engineering practices to product teams. It would ensure engineers do not waste time on common code such as:

reporting a sample of all inputs/outputs to a third-party MLOps solution to ensure data correctness and measure data drift
reporting all inputs/output to our data lake so we could build a re-training pipeline for model refreshes
reporting all latencies to InfluxDB/Grafana to ensure we could reply in a timely manner
different latency optimizations like parallelization, request hedging etc

D. Model initialization and deployment would be performed via our Deploy job (in our case it would be just a special Jenkins job)

E. Models would be stored in the model registry

The inference service was built and supported by a small team. That is why we preferred standard Thumbtack solutions even if it was not the most efficient solution. We preferred simplicity. And we only optimized when needed.

Most Thumbtack use cases did not need latency or throughput optimization. It was almost always easier and cheaper to over-provision our endpoints. It wasn’t worth trying to optimize them.** The team still ensured we had enough logging and monitoring. We could use them to find and fix performance issues if needed. For example, we added extensive garbage collection tracing to ensure we can answer the question around whether Python garbage collection is responsible for poor long tail latency or not.

**Note: this tradeoff may be different for other companies and applications.

Why we didn’t choose SageMaker

The main reason was because SageMaker was not the only solution used at that point in Thumbtack. We had several models deployed to a Prediction Service. Potential migration from the legacy Prediction Service to SageMaker would have been hard. It would have involved deploying legacy models to SageMaker and changing client code on the backend at the same time.

By building a new inference service we had the option to implement a facade which for certain model names would just route calls to the Prediction Service as illustrated below.

One way to migrate from Prediction Service to the Inference service

So a potential complex migration would be broken down into two technical steps

Migrating the client
Migrating (if needed) the ML engine

Ease of migrations was important for us. We did not want to build a third way to run ML inference at Thumbtack!

Another factor was that SageMaker did not integrate with any Thumbtack internal tools out of the box. And the team deemed implementing this integration to be as difficult as implementing the ML engine from scratch using a previously developed Python web service skeleton.

Besides, our design still permitted us to use SageMaker as an ML engine if needed. All we needed to do was to implement the SageMaker client inside our inference service.

Our design also allowed integration with third party services. For example, the inference service could call OpenAI. We did this for the Tensorflow ModelServer instance. We also use it to run several legacy models.

Why we didn’t choose the internal Prediction service

Earlier in this post, we described a few drawbacks in how the Prediction service was designed and implemented.

The main one was the decision to have a fixed runtime environment for all clients. This made it impossible to upgrade runtimes. We wanted our new system to be more flexible. This was fairly central to the Prediction Service design and it was deemed to be too expensive to change it in an iterative manner.

That’s why we decided to implement the new Inference service from scratch.

There is sometimes a mental bias that some of us (including me) suffer from. Building something new from scratch seems more interesting. Who wouldn’t want to design and implement a new system from scratch instead of incrementally improving the legacy one to meet the same goal?!

To ensure we did not fall into this trap we estimated the effort needed to build a new service vs modify the existing one. The latter was deemed more risky and at least as expensive.

API

The last design consideration was coming up with a good API. We wanted something which would be easy to map on existing legacy systems. We wanted it to be flexible and future proof but we also did not want to over engineer this

After reviewing the existing system we noticed a pattern. A model inference can be viewed as a function: Dataframe -> Dataframe

Solutions which used similar APIs include:

Scikit-learn pipelines
Spark ML transformers

So we decided to build our input and output data model inspired by, and compatible with, the very popular pandas Dataframes.

Conclusion

Building a standard system for online ML inference permitted us to have a single service to integrate all the logic one needs to run ML successfully in production. It also ensures that each new infrastructure improvement benefits all engineers in Thumbtack. Currently all new models are being deployed to this inference service. Major clients migrated their legacy models while others are scheduling the migration.

If these types of problems seem interesting to you, check out Thumbtack’s career site!

Acknowledgement

I would like to thank Navneet Rao and Richard Demsyn-Jones for their critical feedback on this post. I would like to thank the entire ML Infra team (Kevin Zhang and James Chan) at Thumbtack for helping design & build this service. It would have been impossible to build such a solution without you.