Using Log-Time Denormalization for Data Wrangling at Meta
At Meta, we generate and interact with data at a uniquely large scale. Our workforce is also highly data literate, even outside of analytics teams. In addition, “big data” infra has evolved to a point where building and maintaining large-scale data pipelines is very accessible for the average internal data consumer at Meta. While this evolution has made analytics more efficient and accessible, one consequence can be unwieldy data sprawl — so much data that it requires deep contextual knowledge to know what data is high quality and how to avoid any potential pitfalls of using data that you might not fully understand.
We’ve developed a variety of patterns to address this at Meta, but by far the most impactful has been log-time denormalization. Below, we’ll share some examples of how (and why) we’ve migrated more of our data sources to this approach in order to build rich, descriptive single-sources of truth that can serve use cases that previously required highly customized data pipelines.
Our Example: Facebook Impressions
One of our larger logging datasets at Meta is “impressions” — we log a row each time a person sees a piece of content on Facebook: posts, photos, videos, comments, and so forth. With billions of users, that ends up being a lot of rows per day.
This data serves many use cases, a few of which I’ll highlight below:
(Use Case 1) Experimentation (A/B test) metrics
Teams filter and aggregate impression event data into metrics used to evaluate experiments. For instance, our Reels team might create a metric counting the number of Reels impressions per user per day. These metrics are ingested by our experimentation platforms and allow teams to run A/B tests and understand the performance of those tests relative to these metrics. Conceptually, it looks like this:
SELECT
day,
userid,
COUNT(*)
FROM impressions
…
WHERE <<impression was on a Reels video>>
GROUP BY day, userid
Dozens or hundreds of teams at Meta need to measure the performance of their products and features (videos, groups, marketplace, etc). As a result, there are dozens of variants of pipelines like this one, each tailored for a specific purpose, team and product.
(Use Case 2) Longitudinal reporting
In order to understand the performance of Facebook products over time, we create time-series reports and dashboards that support dimensional slicing. For instance, we need to understand what types of content (links, photos, videos) and which types of posters (users, pages) get the most impressions over time. This knowledge empowers us to deliver the content and connections our users care about when they use our products. We also need to slice by additional dimensions like geography, app surface (News Feed, Groups, Marketplace), and more.
SELECT
day,
poster_type,
content_type,
<<<dozens more>>>,
COUNT(*)
FROM impressions
…
GROUP BY day, poster_type, etc
(Use Case 3) Training Machine Learning (ML) Models
We use ML at meta as one way to create a better user experience. ML models help us determine the degree to which a piece of content is relevant to a user, and as a result, how much value they’ll get out of seeing it. The ML models use hundreds of features about the post — content type, topic predictions, poster attributes, and more — so we need to join our impressions events with these features to train the models.
(Use Case 4) Ad hoc analytics
At Meta, thousands of employees write ad hoc queries on a daily basis in order to answer specific questions and hypotheses about how people use our products. With so many people focused on the safety of our platform, integrity queries are a common example of ad-hoc analytics — we’re always identifying and anticipating new ways in which people may attempt to abuse our platforms. To do this job effectively, we require rich and descriptive data about how Facebook users interact with our platforms.
For instance, when dealing with foreign interference in elections, we look for posts being created by actors who claim that their home location is in the US, but where their IP address or other signals indicate they’re actually from another country.
SELECT
*
FROM impressions
WHERE
<<actor claims they’re from the US>>AND <<models predict user is NOT from the US>>
AND <<post contains political content>>
AND <<viewer is in the US>>
A Hypothetical Snowflake Schema for Impressions
The examples above aren’t exhaustive — we use data in many other ways in order to create the best experiences for the people using our technologies — but a common thread between most of them is the need for rich, descriptive dimensional metadata about the content that was seen.
In a traditional snowflake schema, we’d start with an event table with all impressions and a variety of dimension tables representing interesting features of the impression: the viewed posts, the author of the post, the attached photos/videos/urls, and so forth. It would look something like this:
However, this classical approach quickly breaks down at Meta’s scale.
1) Performance and practicality — most use cases require merging many features about posts, photos, authors, attached comments, and so forth. This, in turn, leads to many dimension joins in common analytical queries. For instance, to find impressions of US-based posts that contain cat photos, we require at least a handful of dimension joins:
SELECT
…
FROM impressions
LEFT OUTER JOIN posts p…
LEFT OUTER JOIN photos …
LEFT OUTER JOIN photo_topic_predictions …
LEFT OUTER JOIN users ON p.author_id …
Because impressions are huge and there are billions of posts per day on our platform, leading to trillions of records being compared in the first join alone.
In transactional databases such as those used by Facebook’s production applications, these joins are typically backed by indices and perform well for single point lookups such as fetching information about a single post to show to a user.
However, for large scale analytics queries, such joins require re-shuffling data. Even in the best case scenario with well-constructed bucketing, this re-shuffling is expensive and time-consuming. As a result, these multi-join queries are inefficient and impractical for both ad hoc analytics queries as well as data pipelines.
These use cases would benefit from denormalization.
2) Duplication: Teams create many copies of partially denormalized data
As a result of the performance and practicality constraints described above, teams will typically pre-join commonly accessed dimensional attributes into denormalized tables serving their use cases. Several years ago, it was common to see “copies” of our impressions log, one serving each team.
Here, each team builds data pipelines to enrich impressions with the metadata they frequently query. This is can be problematic for a few reasons:
Wasted compute + storage resources — all three pipelines are performing the same join to our Users table. Storing similar data N times is inefficient.
Anonymized Footprint — Storing N copies of impression events multiplies the effort required to ensure these copies are properly anonymized.
Silos — Multiple tables for a range of topics can make it inefficient to compare two different but related topics. Suppose I have a table to understand Comment impressions and a table to understand Ads impressions. How do I find impressions of comments on ads?
Duplication creates inconsistencies — in the Comment and Ads pipelines, the two teams have performed the “User” join slightly differently. The Comments pipeline is reporting our “country prediction” model as the user’s country. The Ads pipeline is reporting the country that the user stated in their Facebook profile. For a variety of reasons, many of which are benign but some of which may be malicious, the home country listed in a user’s profile may not accurately reflect where they are actually located.
An average user of these tables is highly unlikely to spot the difference. As a result, a user querying Comment impressions for an integrity-related analysis may make a critical error by using the user’s stated location instead of a more trustworthy location prediction.
Full denormalization and its pitfalls
At this point, several teams asked themselves — why don’t we perform all of these joins once, upstream, and combine all of the data into a wide, denormalized impressions dataset that everyone can use?
This is the right idea. However, a traditional warehouse enrichment pipeline would take dozens (or hundreds!) of dimensional tables owned by many teams and join them into our impression logs. This is slow, expensive, and creates an unacceptable dependency risk — when one upstream is late, it delays the whole process:
Full log-time denormalization
Instead, we had to consider alternatives. Denormalizing data right when we log it would be a simple approach, but oftentimes it can be incredibly expensive. Fortunately, we had a clue as to how we might be able to implement a scalable solution — when a user sees a piece of content on Facebook, the impression log kicks off a number of downstream processes that already perform point lookups to our production databases to fetch attributes of the viewed content. With a bit of careful caching, we could re-use these point lookups to fetch the metadata we needed for analytics with only modest additional cost.
We built a configuration system that allows Data Engineers, Product Engineers, and Data Scientists to configure which metadata attributes they need from logging, using a config language similar to GraphQL. When the production backend queries our Ent framework during log handling, we also query for the configured analytics metadata, and we append the analytics metadata onto the impression log. Because the point lookups were already being performed, the incremental cost to enrich analytics metadata is modest.
The final result is an impression event log in the data warehouse with all required analytics metadata denormalized. And it doesn’t require complex data pipelines or large and risky upstream dependency graphs.
How this affects common analytics queries
As a result, most common analytics queries can be performed directly against our impression logs without any joins. Using a toy example, to find impressions by US-based users of Beyonce + Jay Z’s with their new kitten, we may write something analogous to:
SELECT *
FROM impressions
WHERE
// one of the post’s attached photos is predicted to contain cats, AND
// it is also predicted to contain Beyonce + Jay ZANY_MATCH(
post.attached_photos,
photo -> CONTAINS(photo.topic_prediction, ‘cat’)
AND CONTAINS(photo.entity_match.name, ‘Beyonce’)
AND CONTAINS(photo.entity_match.name, ‘Jay Z’)
)
AND viewer.predicted_home_country = ‘US’
The data is represented as deeply nested structs that are similar to what you may find in Google Bigquery. The structs closely mirror the structure of Meta’s Social Graph (similar to the open-source Ent ORM for go).
Because the social graph is infinitely recursive (think “viewer.best_friend.best_friend.best_friend …”), the logging projections define how to break that recursion into a denormalized schema for analytics (e.g. we might denormalize viewer.best_friend but not viewer.best_friend.best_friend). It’s very similar to how GraphQL projections break recursion of Ent Schemas.
The nested data model ensures that highly complex relationships across our social graph can be extracted in SQL without performing expensive joins. For instance, the example above traverses the one-to-many edge between posts and photos without any joins.
Conclusion + Takeaways
The key feature of log-time denormalization is that the so-called “raw” logs can be used to serve most analytics queries directly, with minimal post-hoc enrichment.
Compared to traditional approaches, this provides several benefits that are critical at our scale:
- By combining features that were previously scattered among single purpose datasets, we can answer more nuanced questions than we could before. This, in turn, enables data scientists to focus more time on analysis and less time data wrangling.
- Meta employees can perform more of their analyses directly from the logs instead of relying on customized, single purpose forks of the logs. This helps us build a community of data users who can speak the same language, share knowledge, and re-use each other’s work.
- By reducing the number of forked logs, it’s easier to ensure we’re appropriately anonymizing our data, and we also save on warehouse compute and storage.
- By allowing teams to enrich a single common logging source, we dramatically improve developer productivity.
The approach also presents challenges. We’ve developed (and are actively improving) a variety of techniques for handling these constraints.
- “As-is” semantics + backfills: Denormalization provides an “as-was” view of the data — what was the state of a post at the time it was viewed? For instance, what happens when we want an “as-is” view instead — what is the state of the post now, even if it changed after being viewed? This is where star and snowflake schemas traditionally shine. We’ve developed systems to seamlessly fall back to “as-is” semantics via more expensive snowflake joins when needed.
- Aggregations + Sampling: It’s not practical to perform every analysis directly against event logs. We use standard patterns like sampling and pre-aggregated data + cubes with a variety of dimensional cuts. These are tailored for specific analytics patterns. For instance, we maintain “narrow” unsampled versions of our impressions data, with minimal analytics enrichment, which we retain for long time periods to support forensic analysis. And we maintain “wide” sampled versions of our impressions data which we use for long-term longitudinal reporting.
- Logging consistency: We must ensure that other denormalized logging events beyond impressions use the same dimensional models and enrichment methodology. We’ve built direct integrations with our logging framework to ensure that product engineers can easily and consistently represent dimensional attributes across many logging datasets.
We’ll cover these in more detail in future posts.
The approach outlined above is somewhat atypical vs traditional data warehousing techniques. However, the scale at which we operate introduces non-traditional constraints and optimization opportunities. A key tenet of Data Engineering at Meta — and something we look for in DE candidates — is flexibility and creativity. Standard approaches often break down at our scale, and we need to flex our creativity and go outside of the box to find solutions.
Visit our Analytics Career Portal to learn more!
Author: Jeremy Nuger