Chronon — A Declarative Feature Engineering Framework

Published in

The Airbnb Tech Blog

8 min readJul 11, 2023

A framework for developing production grade features for machine learning models. The purpose of the blog is to provide an overview of core concepts in Chronon.

Nikhil Simha Raprolu

Background

Airbnb uses machine learning in almost every product, from ranking search results to intelligently pricing listings and routing users to the right customer support agents.

We noticed that feature management was a consistent pain point for the ML Engineers working on these projects. Rather than focusing on their models, they were spending a lot of their time gluing together other pieces of infrastructure to manage their feature data, and still encountering issues.

One common issue arose from the log-and-wait approach to generating training data, where a user logs feature values from their serving endpoint, then waits to accumulate enough data to train a model. This wait period can be more than a year for models that need to capture seasonality. This was a major pain point for machine learning practitioners, hindering them from responding quickly to changing user behaviors and product demands.

A common approach to address this wait time is to transform raw data in the warehouse into training data using ETL jobs. However, users encountered a critical problem when they tried to launch their model to production — they needed to write complex streaming jobs or replicate ETL logic to serve their feature data, and often could not guarantee that the feature distribution for serving model inference was consistent with what they trained on. This training-serving skew led to hard-to-debug model degradation, and worse than expected model performance.

Chronon was built to address these pain points. It allows ML practitioners to define features and centralize the data computation for both model training and production inference, while guaranteeing consistency between the two.

Introducing Chronon

This post is focused on the Chronon API and capabilities. At a high level, these include:

Ingesting data from a variety of sources — Event streams, fact/dim tables in warehouse, table snapshots, Slowly Changing Dimension tables, Change Data Streams, etc.
Transforming that data — It supports standard SQL-like transformations as well as more powerful time-based aggregations.
Producing results both online and offline — Online, as low-latency end-points for feature serving, or Offline as Hive tables, for generating training data.
Flexible choice for updating results — You can choose whether the feature values are updated in real-time or at fixed intervals with an “Accuracy” parameter. This also ensures the same behavior even while backfilling.
Using a powerful Python API — that treats time based aggregation and windowing as first-class concepts, along with familiar SQL primitives like Group-By, Join, Select etc, while retaining the full flexibility and composability offered by Python.

API Overview

First, let’s start with an example. The code snippet computes the number of times an item is viewed by a user in the last five hours from an activity stream, while applying some additional transformations and filters. This uses concepts like GroupBy, Aggregation, EventSource etc.,.

In the sections below we will demystify these concepts.

Understanding accuracy

Some use-cases require derived data to be as up-to-date as possible, while others allow for updating at a daily cadence. For example, understanding the intent of a user’s search session requires accounting for the latest user activity. To display revenue figures on a dashboard for human consumption, it is usually adequate to refresh the results in fixed intervals.

Chronon allows users to express whether a derivation needs to be updated in near real-time or in daily intervals by setting the ‘Accuracy’ of a computation — which can be either ‘Temporal’ or ‘Snapshot’. In Chronon this accuracy applies both to online serving of data via low latency endpoints, and also offline backfilling via batch computation jobs.

Understanding data sources

Real world data is ingested into the data warehouse continuously. There are three kinds of ingestion patterns. In Chronon these ingestion patterns are specified by declaring the “type” of a data source.

Event data sources

Timestamped activity like views, clicks, sensor readings, stock prices etc — published into a data stream like Kafka.

In the data lake these events are stored in date-partitioned tables (Hive). Assuming timestamps are millisecond precise and the data ingestion is partition by date — a date partition ‘2023–07–04’, of click events contains click events that happened between ‘2023–07–04 00:00:00.000’ and ‘2023–07–04 23:59:59.999’. Users can configure the date partition based on your warehouse convention, once globally, as a Spark parameter.

— conf “spark.chronon.partition.column=date_key”

In Chronon you can declare an EventSource by specifying two things, a ‘table’ (Hive) and optionally a ‘topic’ (Kafka). Chronon can use the ‘table’ to backfill data — with Temporal accuracy. When a ‘topic’ is provided, we can update a key-value store in real-time to serve fresh data to applications and ML models.

Entity data sources

Attribute metadata related to business entities. Few examples for a retail business would be, user information — with attributes like address, country etc., or item information — with attributes like price, available count etc. This data is usually served online via OLTP databases like MySQL to applications. These tables are snapshotted into the warehouse usually at daily intervals. So a ‘2023–07–04’ partition contains a snapshot of the item information table taken at ‘2023–07–04 23:59:59.999’.

However these snapshots can only support ‘Snapshot’ accurate computations but insufficient for ‘Temporal’ accuracy. If you have a change data capture mechanism, Chronon can utilize the change data stream with table mutations to maintain a near real-time refreshed view of computations. If you also capture this change data stream in your warehouse, Chronon can backfill computations at historical points in time with ‘Temporal’ accuracy.

You can create an entity source by specifying three things: ‘snapshotTable’ and optionally ‘mutationTable’ and ‘mutationTopic’ for ‘Temporal’ accuracy. When you specify ‘mutationTopic’ — the data stream with mutations corresponding to the entity, Chronon will be able to maintain a real-time updated view that can be read from in low latency. When you specify ‘mutationTable’, Chronon will be able to backfill data at historical points in time with millisecond precision.

Cumulative Event Sources

This data model is typically used to capture history of values for slowly changing dimensions. Entries of the underlying database table are only ever inserted and never updated except for a surrogate (SCD2).

They are also snapshotted into the data warehouse using the same mechanism as entity sources. But because they track all changes in the snapshot, just the latest partition is sufficient for backfilling computations. And no ‘mutationTable’ is required.

In Chronon you can specify a Cumulative Event Source by creating an event source with ‘table’ and ‘topic’ as before, but also by enabling a flag ‘isCumulative’. The ‘table’ is the snapshot of the online database table that serves application traffic. The ‘topic’ is the data stream containing all the insert events.

Understanding computation contexts

Chronon can compute in two contexts, online and offline with the same compute definition.

Offline computation is done over warehouse datasets (Hive tables) using batch jobs. These jobs output new datasets. Chronon is designed to deal with datasets that change — newly arriving data into the warehouse as Hive table partitions.

Online, the usage is to serve application traffic in low latency(~10ms) at high QPS. Chronon maintains endpoints that serve features that are updated in real-time, by generating “lambda architecture” pipelines. You can set a parameter “online = True” in Python to enable this.

Under the hood, Chronon orchestrates pipelines using Kafka, Spark/Spark Streaming, Hive, Airflow and a customizable key-value store power serving and training data generation.

Understanding computation types

All chronon definitions fall into three categories — a GroupBy, Join or a StagingQuery.

GroupBy — is an aggregation primitive similar to SQL, with native support for windowed and bucketed aggregations. This supports computation in both online and offline contexts and in both accuracy models — Temporal (realtime refreshed) and Snapshot (daily refreshed). GroupBy has a notion of keys by which the aggregations are performed.

Join — Joins together data from various GroupBy computations. In online mode, a join query containing keys, will be fanned out into queries per groupBy and external services and the results will be joined together and responded as a map. In offline mode, joins which can be thought of as a list of queries at historical points in time, against which the results need to be computed in a point-in-time correct fashion. If the left side is Entities, we always compute responses as of midnight.

StagingQuery — allows for arbitrary computation expressed as Spark SQL query, that is computed offline daily. Chronon produces partitioned datasets. It is best suited for data pre or post processing.

Understanding Aggregations

GroupBys in Chronon essentially aggregate data by given keys. There are several extensions to the traditional SQL group-by that make Chronon aggregations powerful.

Windows — Optionally, you can choose to aggregate only recent data within a window of time. This is critical for ML since un-windowed aggregations tend to grow and shift in their distributions, degrading model performance. It is also critical to place greater emphasis on recent events over very old events.
Bucketing — Optionally you can also specify a second level of aggregation, on a bucket — besides the Group-By keys. The output of a bucketed aggregation is a column of map type containing the bucket column as keys and aggregates as value.
Auto-unpack — If the input column contains data nested within an array, Chronon will automatically unpack.
Time based aggregations — like first_k, last_k, first, last etc when a timestamp is specified in the data source.

You can combine all of these options flexibly to define very powerful aggregations. Chronon internally maintains partial aggregates and combines them to produce features at different points-in-time. So using very large windows and backfilling training data for large date ranges is not a problem.

Putting Everything together

As a user, you need to declare your computation only once, and Chronon will generate all the infrastructure needed to continuously turn raw data into features for both training and serving. ML practitioners at Airbnb no longer spend months trying to manually implement complex pipelines and feature indexes. They typically spend less than a week to generate new sets of features for their models.

Our core goal has been to make feature engineering as productive and as scalable as possible. Since the release of Chronon users have developed over ten thousand features powering ML models at Airbnb.

Sponsors: Dave Nagle Adam Kocoloski Paul Ellwood Joy Zhang Sanjeev Katariya Mukund Narasimhan Jack Song Weiping Peng Haichun Chen Atul Kale

Contributors: Varant Zanoyan Pengyu Hou Cristian Figueroa Haozhen Ding Sophie Wang Vamsee Yarlagadda Evgenii Shapiro Patrick Yoon

Partners: Navjot Sidhu Xin Liu Soren Telfer Cheng Huang Tom Benner Wael Mahmoud Zach Fein Ben Mendler Michael Sestito Yinhe Cheng Tianxiang Chen Jie Tang Austin Chan Moose Abdool Kedar Bellare Mia Zhao Yang Qi Kosta Ristovski Lior Malka David Staub Chandramouli Rangarajan Guang Yang Jian Chen