Perspectives: Talabat’s Data Aggregation Framework

Published in

talabat Tech

8 min readMar 8, 2022

Introduction

At Talabat, we rely on metrics to inform decisions in all aspects of our product, operations, and business. Consequently, there is a large set of metrics that needs to be computed reliably and consistently at the appropriate scope (e.g. per city) and frequency (e.g. daily).

While we have a reliable data platform that contains well modeled and high-quality data, people needed to write aggregation queries to compute metrics for their reporting, analysis, and experimentation needs. This had two drawbacks:

The metric definitions were written in every aggregation, leading to inconsistency in results and difficulty in modifying definitions globally.
Developing aggregations manually was time-consuming and error-prone (e.g. joining fact and dimension tables on the wrong keys).

To alleviate these drawbacks, we designed and developed an in-house system built by the Talabat Data Team to:

Centralize the definitions of metrics into a single source of truth.
Reduce the tedious and error-prone process of developing aggregation queries into a simple config file.

We named this system: Perspectives. This is inspired by the following observation: An aggregation is a slightly different perspective on the data aimed at a certain purpose. Two aggregations can be for the same metrics but each with different granularity for instance, hence they’re different perspectives of the same data.

Perspectives takes a config file that tells it which fact and dimension tables to work with, and which metrics to compute. It leverages a centralized metric library and a central repo of how different fact and dimension tables should be joined to automatically produce optimized aggregation code and schedule a corresponding Airflow DAG.

In this post, we will describe Perspectives and its core features in more detail. In publishing this we hope to give others some useful ideas on how to solve similar problems at their organizations.

Workflow Overview

The Perspectives workflow can be summarized as follows

In the view above, the system is described as follows:

Input:

YAML file (which we call View File) containing the required information for Perspectives to auto-generate an aggregation query. The contents of which will be described in this document.

Assets:

Metric Definition Library: Config files containing all metric definitions.
Join Logic Repo: Config files describing how fact and dimension tables should be joined.

Output:

The result of the Query generated by Perspectives is inserted in a BigQuery table.

Flow:

Perspectives will automatically generate an Airflow DAG for each view file added and schedule it to run based on the time scope defined in this view file.
Each DAG contains a task that will use perspectives code to parse its view file configurations and generate the aggregation query.
The DAG will execute the query and insert the results into a BigQuery table defined in the view file metadata.

Metric Definition Library

Metrics are defined as YAML files that are structured into groups for easier reference to a set of related metrics (e.g. conversion rates across the funnel).

Each metric should have a name, abbreviation (the way it appears on output tables), description, data type, and SQL logic.

This is an example of a metrics file for KPIs needed in aggregation queries.

Any change to these metrics will automatically reflect on all queries generated using it without the need for any manual change. Following this approach helps with centralizing the metric definition across the data platform.

Join Logic Repo

It is very common in any aggregation query that we need to join the fact table with some other dimensions to add conditions/specifications to our metrics.

To unify the join logic with the dimension table across all the aggregation queries, we follow a practice of providing a dimensioned file which is an SQL template that describes how to join a base table with the dimension table (we may join with other helper tables that are only necessary from the primary join to happen). We use the ”{{base_table}}” macro to generalize the base table name. We can also define which columns we want to include from the dimension table to be available for use in other parts of the query.

This is an example of a dimensioned file (SQL template) that contains the join logic for joining base table with dim_location and dim_type taken from a Talabat sample dataset.

Any change to the join logic in these files will automatically reflect on all queries generated using them without the need for any manual change.

View Development

To generate an aggregation query, Perspectives depends on a view config which is the base input of the system. It is a YAML configuration file that contains the required information for Perspectives to be able to accomplish its goal. This file represents a table that would be updated daily with the KPIs and is made up of several parameters that are mapped through Perspectives to the desired outputs.

Before we go through the process of writing a view file. It would help a lot if we already have a sample draft for the query that describes the aggregation logic.

So, we are going to use the following query as an example of how we can write a view file step by step in the coming parts.

Step 1: Base table

From the example query, we can see that tlb_data.fact_order is the central entity that we are computing metrics for. So, we will add it as the base table of the aggregation.

Step 2: Dimensions

Now, we have the central entity coming from the fact table tlb_data.fact_order but there are other entities also needed to complement the picture of the main entity. That’s what we call dimension tables.

We simply refer to the join definitions that had been defined earlier as part of the Join Logic Repo.

Step 3: Metrics

After adding the required entities for KPIs, we can use them by referring to the metrics defined earlier as part of the Metric Definition Library.

Step 4: Narrow downs

Sometimes we need to narrow down the entities that we’re creating the metrics for. We want to make sure that there isn’t any order with a NULL value for the user_id and exclude is_test orders. This gets translated as a WHERE clause that filters the entities based on some conditions.

Step 5: Breakdowns

This is probably the part where most of the aggregates differ from each other. It represents all the columns that we would like to GROUP BY on. As you probably expected, we can use any columns from the base table or the joined dimension tables. Any column in the breakdowns will automatically appear on the output table.

So, we will add all columns in the GROUP BYclause as a breakdown. We’d also like to group by country_code, which will be covered in Step 6: Enrichments. Working with dates will be covered in Step 7: Time Scope.

A couple of interesting features with breakdowns:

We automatically add a wildcard “all” value to the possible values of each column to make a more rich aggregate table. For example in our query we are adding country_name as a breakdown to the order aggregates table, the order will be grouped by (‘Egypt’, ‘UAE’, ‘Kuwait’, etc.., ‘all’).
We can implement hierarchies on breakdowns, so that wild card values don’t appear on parent columns without appearing on child columns. For example, if we are adding country_name, city_name as breakdowns, we need to specify that city_name is a child to the country_name breakdown, so that if country_name = ‘all', city_name has to be ’all' as well.

Step 6: Enrichments

Enrichments are mostly like breakdowns, however, they don’t change the granularity of the aggregation. They should have a 1:1 mapping with one of the breakdowns we explained above. In our example, we are having country_name as a breakdown, then probably country_code is an enrichments rather than breakdowns. One can argue that enrichments can be treated as regular breakdowns, however, this distinction enables us to significantly optimize queries. There is no limit on the number of enrichments. However, as mentioned, they need to have a 1:1 relation to one of the breakdowns and must be in the base table or the dimensions.

Step 7: Time scope

A common component of most aggregations is time, the most common being aggregates, but we still have use cases for weekly, monthly aggregates as well. So for any aggregation query, there is a component that gets a primary timestamp column for time scope definition.

In our case, we want to do a daily aggregation on tlb_data.fact_order using the date column.

Step 8: Metadata

An important part of any view file is the metadata which contains the dataset and view name of the aggregates table. Also, description, owner name, and email.

And that’s it! We now have a view file ready to be used within Talabat's Perspectives Framework.

DAG Generation

For each View File in the system, Perspectives generates an Airflow DAG with scheduling intervals based on the Time Scope step specified in the View File. This DAG will use Perspectives (implemented in Python) to parse the View File configuration with the help of the Metric Definition Library and the Join Logic Repo to generate the SQL aggregation query. Then we run a load job to the BigQuery table with the name provided in the view config metadata.

Also, we used Airflow Sensor tasks to ensure aggregation pipelines wait for the source data to be updated before running the aggregation logic.

So, for the example used, the following query will be generated after parsing the View File.

Then BigQuery load job will be triggered daily to insert query results into the table tlb_data.agg_order_facts, or create it automatically if it wasn’t created yet.

Here is a sample for the Airflow DAG generated by Perspectives.

Perspectives parses the configuration file in run-time, such that any changes to the View File configuration will be applied automatically in the upcoming run. But also we have to keep in mind that these changes should not violate the nature of the database we are using, for example, BigQuery column type can’t be altered after table creation.

Perspectives is developed in Python 3.x and is currently compatible with Apache Airflow ≥ 1.10 and Google Big Query. To use Perspectives, we simply add its package to the same environment where Airflow runs.

Conclusion

At Talabat, we’ve built our in-house aggregation framework and metric library, names Perspectives, to tackle the problems of inconsistent, duplicated, and error-prone metric definitions.

Perspectives was designed and implemented to enable a great level of flexibility, convenience, and consistency. Data Analysts, Data Scientists, and metric owners can create a wide variety of metrics effectively with a few configurations and centralized common definitions. This has been widely adopted by our teams and has transformed we define our aggregations.

Acknowledgments

Thanks to everyone who contributed to making Perspectives become a reality and deliver value to its end-users. Thanks to Ahmed Ossama for founding the design and implementation, Salma Amr and Yusuf Saber for the guidance, Amr Abouhussein, Amr Habib, and Iyad Al Aqel for gifting feedback to improve the system.