Millions of customers visit Thumbtack to find professionals to help maintain and improve their homes. Our search ranking algorithm enables this by ranking professionals based on who best fits the customer’s job. In this post, we’ll discuss how we built the ability to create features with historical search data. We’ll also outline how this sped up iterations on machine learning models that power our search ranking algorithm.
We iterate on machine learning models by retraining with new features. As mentioned in our blog post on transitioning to machine learning, this involves implementing the feature in production, logging it in events data, and waiting for enough sample. We illustrate this process in the diagram below. The search ranking service loads feature data from DynamoDB and logs them in search events that feed into training data. A downside is that accumulating training data can take weeks or months, slowing down ranking improvements.
How Did We Address Slow Iteration Cycles?
To address these slow iteration cycles, we used historical data to simulate what new features would be in past searches. For example, suppose we want to use how long a professional has been using Thumbtack as a new feature in our search ranking model. Even though we don’t log this feature, we can calculate it for each professional in past search results. We can then train new models on these simulated features, as if we had logged them in search events from the very beginning. If any of these features improved our models, we’d re-implement them in our Spark jobs to make them available for online use.
This approach was promising. But, we noticed discrepancies between feature distributions in offline training and online ranking. As we thought of causes, there were two explanations that came to mind. First, the SQL that computes the features offline and the Spark jobs that compute them for online use might have different logic. Second, while nightly Spark jobs update the datastore, we can’t determine whether we served a search with today’s updated feature data or yesterday’s data.
Implementing new features offline, instead of accumulating data, sped up our iteration cycles. But, to make machine learning iterations more successful, we needed to maintain feature consistency between online and offline environments.
What Solutions Already Exist?
Other companies address these issues by building what’s known as a feature store. Before deciding what we wanted to build, we spent time understanding other companies’ solutions. One great resource for this was featurestore.org, which you can visit to learn more about individual companies’ feature store implementations. Here, we’ll summarize some of them before diving into what we chose to do at Thumbtack.
Uber (Michelangelo’s Palette):
Uber’s feature store Palette serves as a centralized database of features for the entire company. That way, different teams can leverage all the data cleaning and transformation work done to create a reliable machine learning feature.
It’s backed by two datastores, one offline (Hive) and one online (Cassandra). Data gets synchronized between the two stores, so if you add new features to the offline store, they also get added to the online store. You can read more about this from this transcript of a 2019 QCon.ai presentation given by Uber Engineers.
Zipline is AirBnB’s declarative feature engineering framework. Using Zipline, data scientists can specify how exactly they want their features calculated and all the necessary offline jobs and systems will be set up for them. To motivate Zipline, their engineers mention that only 5% of production machine learning is the actual model implementation. The other 95% ends up being glue to plumb data necessary to power the model.
You can learn more about Zipline through this 2020 Spark+AI Summit presentation from AirBnB.
If you want to learn more about other companies’ feature store solutions and their motivations, this blog post covers it in more detail.
What Did We Choose to Build?
At Thumbtack, one of our core values is “lead with why”. Whether it’s improving the UI of our customer iOS app or tweaking spam detection algorithms, we keep our work’s purpose top of mind. Our main goal was to be able to create new features using historical data to speed up feature engineering. Another priority was having consistent features values between offline and online environments. This is so that models that perform well offline are more likely to also do so online. Thus, we built what allowed us to most quickly achieve those goals.
In particular, we restructured our nightly Spark jobs. We created a daily BigQuery snapshot of the feature data based on events data (see “Offline Features” in the diagram below). Offline model training then uses data from this snapshot. For online access, another Spark job takes those features and stores them in DynamoDB. Because we populate DynamoDB from “Offline Features”, feature distributions are identical offline and online. We also added a version field to the feature data we upload to DynamoDB, which is the date we aggregated the data on. We log this version field in search events, letting us know whether the feature data for the search was for today’s aggregation or yesterday’s.
To create new features, we can simply backfill these historical snapshots and model training data will include them.
One final piece of functionality we built is a quality-of-life improvement for engineers/data scientists that weren’t as comfortable working with Spark jobs. Thumbtack already has a BigQuery SQL-driven events pipeline that supports backfilling data. Using the offline features BigQuery table (“Offline Features” in the above diagram), we had our Spark jobs store all this table’s columns in a schema-less key-value field in DynamoDB. That way, even if we wanted to add a new feature (i.e. a new column in the “Offline Features” table), we wouldn’t need to edit the offline Spark job. This not only speeds up the process of adding new features, but also democratizes the ability to add/edit new features to those less familiar with our offline Spark jobs.
Did We Benefit from What We Built?
So far, we’ve started to use our version of a feature store to engineer new features for ranking experiments, resulting in many benefits to the team. Once, we experimented with features about customer engagement while viewing a professional’s profile. We already had all this information in events data. Thus, with this functionality, we backfilled these values into historical search data and trained models based on that.
Another benefit was that feature implementation bugs became less disastrous to our velocity. One recent ranking iteration made our models aware of a search’s filter selections (e.g. 1 bedroom, 1 bath for house cleaning). The idea was to optimize our ranking algorithm for the customer’s filter selections. Since the logic for this feature is complex, we encountered several bugs in initial implementations. In our old system, encountering bugs like this would mean having to wait again for data to accumulate after fixing the bug. Now, we can fix the logic bug and backfill historical events, thereby preventing more delays.
All this use of the feature store infrastructure revealed various limitations and opportunities for improvement. In the filter selection ranking experiment, we hit DynamoDB read capacity thresholds. We also worried about the latency of serving all this data. In some sense, this functionality enabled us to train on much more ambitious features offline. Now, we need to augment our online infrastructure to keep pace.
Lastly, our decision to focus on the ability to backfill new features from raw events did have its limitations. For instance, we designed some important features (e.g. a pro’s settings or complete text profile) such that they can’t be easily backfilled. This is because we don’t always track enough information to recreate those from events data. We’d like to support features like this eventually. But, since these are a minority of features, we prioritized support for features that we can recreate from events data.
In summary, we’ve benefited a lot from prioritizing the subset of feature store functionality that we needed most. As our need for better machine learning infrastructure grows, we will consider building systems like those found at other companies or using open source solutions like Feast. If you’re interested in thinking and building systems to power machine learning in production, come join us!