Unlocking our data with a feature store

João Santiago
Billie Engineering Crew
6 min readNov 10, 2021

Machine learning can do amazing things, but all that data science is useless if the models are not fed correctly. Consistency between training and serving in production is critical; millisecond-responses are non-negotiable. Originally developed by Uber (Michelangelo), feature stores are an emerging solution for these concerns. This is our take on the concept.

Feature what?

Feature stores calculate and store features, pieces of data used in machine learning models. A feature can already exist in some database (e.g. an invoice's timestamp), or derived and aggregated (e.g. the average invoice amount for a certain merchant, over a certain period). Since it is common for different machine learning projects to use the same features, centralisation makes sense to reduce code duplication, and time creating them. A feature store also speeds up transfer of data because instead of e.g. sending three months-worth of transactions over the wire to a model-serving API, we can collect the features already aggregated from the feature store. The magic of caches 😌.

Our feature store

When we first approached this challenge, off-the-shelf solutions relied on heavy infrastructure (most still do). Machinery such as Kafka and Spark require careful maintenance due to their complexity, and introduce new DSLs and concepts orthogonal to a data scientist's mission. We are a small team of data scientists and data engineers, so we opted for simplicity and a “use what we already have” approach:

  • Historical/batch features are described as SQL transformations — a common language between engineers and data scientists, thus enabling cross-team code-review
  • Feature code is tracked using git and Snowflake
  • Transformations happen within Snowflake, without the need for Spark, using Snowflake Tasks
  • Events would be handled by Fivetran and Snowflake using Snowflake Streams, without the need for Kafka
  • Features would be cached in Redis via an AWS Lambda function (triggered by Snowflake Tasks)
  • (WIP) Just-in-time features (those dependent on data of the current transaction) are calculated by an HTTP server synchronously, not Spark
Our feature store: Snowflake+Lambda+Redis

That’s a total of three pieces: Snowflake, AWS Lambda, and Redis — all familiar, simple enough and easy to maintain, plus Fivetran which we already use to sync our production MySQL database to Snowflake.

Snowflake is at the core of our feature store, but the interfaces it provides were not enough. Creating Tasks and Streams involves writing boilerplate that our data scientists instarejected 🙅. Instead, we created a system in which features are defined in YAML files, with minimal configuration and some conventions:

This filename is also the "full feature name" n_orders_per_customer_id.yml, and is saved in the feature store GitHub repository. More generically, the full name of a feature has the form <feature_name>_per_<entity_1>_per_<entity_n>, meaning we can easily create exotic entities on a whim. The declarative nature of this system means it is also easy to inspect and discover new features.

Snowflake, the colossus in the room

Snowflake is doing all of the heavy lifting in our system, and we lean heavily on Snowflake Streams and Snowflake Tasks:

  1. for each feature, we create a stream containing all INSERTs or UPDATEs to the source table
  2. we create a corresponding task that computes the feature based on the UUID of the source table (e.g. an order UUID selected; we call them trigger values)
  3. each task writes the data into a feature_history table that, as the name implies, stores the value of each calculation
  4. additionally, another stream and task combo collects all INSERTs to the feature_history table and, via an AWS Lambda, push the latest values of each feature to Redis (aka online store)
  5. Finally, to ensure point-in-time accuracy, we calculate the LAGged value for each feature, and update valid_from and valid_to columns. We need to do this because in some situations, such as the first time ever an entity is requested from Redis, the value the model gets and the value Snowflake stores is different (NULL vs an actual value, in this case)

Creating features

To add a new feature to the feature store, we first experiment offline. Once we are happy with the results, the query used during experimentation is wrapped in a CREATE FUNCTION statement and we create a feature definition (aka the YAML file). After the feature is added to Snowflake via a Jenkins job, a backfilling procedure is run automatically. It calculates the value of the new feature for each row in the source table (see code snippet) with point-in-time accuracy i.e. we calculate what the value of the feature would be if it had been calculated exactly when the order was created. Finally, a new feature is added to the features table via a Jenkins job and Redis is updated.

The only additional code we had to write were the Snowflake procedures and some parsing code in Python. Now, the only steps which need manual intervention are writing the features themselves, code review and pressing "Merge to main branch". Very lean😎.

Training models

Our model-training pipeline (perhaps a future post) uses the feature store as the source for historical/aggregate features. As said earlier, point-in-time accuracy is the name of the game here, so every feature is saved with valid_from and valid_to attributes. These make it easy to join the correct value of a feature to the transaction we are interested in — again just SQL, no new DSL to learn. At some point these queries may be wrapped for convenience (see e.g. https://github.com/jcpsantiago/pargo).

If a model uses any of these features, it checks first if the feature exists in the feature store i.e. is available in the features table, and stores a mapping between model~features from the store (currently in DynamoDB). Thus, we avoid deploying models dependent on non-existent features and achieve an almost "hands-free" deployment.

Real-time inference

Our models are served using custom R Sagemaker containers, using a fully serverless architecture.

In production, retrieving a feature from Redis is a two-step process within the model Lambda (would still work in a non-serverless architecture, of course):

  1. Get list of features to collect from the feature store from DynamoDB (saved during training, see above) with models_table.get_item
  2. If any are needed, redis.mget(<list of keys>) all the values from Redis (such a beautiful interface!)

That's it! The historical features are then sent to the respective model for the request.

But what is it good for, really?

Since its launch in production, our feature store enabled us to use more historical data from the entities placing requests, add velocity features to our anti-fraud models, and even cross-use data from different products. Now the center around which data orbits is not the product anymore, but the entity — if we see the same email, or the same company using multiple of our products, we can be more efficient and reuse that data. This is invaluable not only in fighting fraud, but also for product analytics in general.

Limitations & future plans

  • As seen in the architecture diagram, there is a delay between a datum being created in the prod database, and a derivative feature being available in Redis (usually under three minutes). Thankfully, most of our use-cases cope well with such delay, and there are still optimisation avenues open before thinking about streaming of any kind.
  • There is a delay between testing a feature and training a model with it. This is true of all feature stores, to a varying degree.
  • There is no "feature registry", beyond exploring the feature store GitHub repository. There is no monitoring either. Yet.
  • We are still working on just-in-time features. Once that is released, a data scientist training a model should not need to go anywhere else beyond the feature store to fetch data, nor need to write any code for cleaning it up.

Signing off

It took us, the rockstar data engineer and co-author of this post Bogdan Ile, our VP Data Igor Chtivelband and me (João Santiago) six months work prototyping until the first feature was live and in use by a model.

Now, more data scientists are using it, and we started to see the type of "cross-pollination" between projects we hoped would happen, while keeping the Data Engineering workload stable.

We hope our "lightweight" feature store will inspire you to build awesome things too! Reach out if you have questions or comments, we love to hear about what others are doing in this exciting topic. 🙂

Acknowledgements

We drew inspiration from Feast, an open-source feature store started by the engineering team at GoJek.

--

--