YTsaurus: The open-source analysis platform for 42 million monthly active users in Yandex Go

Published in

Yandex

12 min readDec 6, 2023

Yandex Go, an integrated service within the broader Yandex ecosystem, poses a data management challenge as tough as you will likely encounter in any industry. It’s a single app that brings multiple essential services together. Users can request rides, rent scooters, and order food, groceries, and other items for delivery. From a data management perspective, this translates to 42 million MAU (monthly active users) to handle and to process. So how do we manage it?

I’m Maxim Pchelin. I led the Product team for Yandex Go’s Data Management Platform (DMP). I co-authored this article with Vladimir Verstov, the Head of DMP development at Yandex Go. Our goal is to illustrate our process for managing massive data flows and serve as a guide for those looking to do the same. We’ll discuss our data architecture and tech stack and introduce you to YTsaurus — the data management core that made this impossible task possible for Yandex Go. We’ll explore how we used YTsaurus, what advantages it brought to our work, and how it can benefit yours, regardless of the scale of your operation.

Our data management architecture is built on open-source foundations. After reading this article, we invite you to experiment and discover how these technologies can serve your unique requirements.

Here’s a brief structure of the article to help you navigate:

Yandex Go challenges
Our DWH architecture and ETL framework
The structure of the DMP
Solutions except YTsaurus
YTsaurus benefits
Spark for YTsaurus
Conclusions

A closer look at Yandex Go and its challenges

At Yandex Go, we deal with a lot of raw data and need pipelines to collect information from over 2000 sources. Plus, we manage data for multiple business units, such as taxi or delivery services. And each of them has its unique architecture and requirements. We serve hundreds of analysts, DWH and BI developers, and managers who rely on data and data services. We are expected to make data available to them quickly and cost-effectively.

The history from problem to solution

Why did we decide to do something about our data management system in the first place? Initially, our analysts queried the data they needed from MongoDB backend replicas using Jupyter notebooks. However, Yandex Go services grew exponentially, and the old approach needed to be revised.

We decided to implement Data Lake. We used Python scripts and set up a cron scheduler on a Linux virtual machine. Although this approach worked well for a while, the business was becoming increasingly complex, and the number of data sources was growing. We realized we needed a new architecture. It was time to implement a Data Warehouse (DWH).

Our data warehouse (DWH) architecture

The DWH architecture we settled on has a layered structure. Each layer has its responsibilities and requirements.

Raw data

We replicate all source data into the RAW layer, applying only the essential modifications required. We accumulate all the data from the sources, similar to performing a simple SQL query like “SELECT * FROM some_table.” We use this approach because we usually need to figure out exactly what we need when working with a new data source. The source data is unstructured, so we can load it onto the RAW layer even with changes that would break the schema. Analysts can then use raw data in ad-hoc queries, so they don’t need DWH support for new launches. Additionally, we can use raw data as “cold” storage for backend services.

Operational Data Store (ODS)

The Operational Data Store (ODS) serves as the projected domain of source data within our Data Warehouse. Here, we partially normalize and clean the RAW data and enforce a consistent naming convention for its fields. Analysts can use the ODS for near real-time analytics and ad-hoc.

Detail Data Store (DDS)

When dealing with multifaceted entities, we build a domain in the Detail Data Store (DDS), where we draw data from various sources and track changes. This domain adopts a high degree of normalization that supports the Data Vault and Anchor models and their hybrids in our highly normalized hybrid model (hNhM).

Common Data Mart (CDM)

Highly normalized data models are easy to maintain but difficult to use. So, we also build data marts to make analysts’ lives easier. The majority of analytical queries should use the data stored on this layer. Data marts can be normalized (like star or snowflake models) or denormalized for performance and usability improvements.

Done with the DWH architecture, let’s say a few words about our ETL framework.

ETL framework

Given the complexity of our data model, the multitude of data sources, and the size of our DWH team, we needed to create our own framework. We made it code-driven and designed it to cover the entire data flow from various sources, such as backend services, databases, message queues, and many APIs and dashboards. The framework includes features for data quality, continuous integration/continuous deployment (CI/CD) processes, maintenance, and others. It also provides a uniform approach to working with different data storage solutions and computational engines.

Framework Concepts

Our framework uses two main entities: table and task. The task represents a process that moves data from one table to another. Tables are where data engineers set storage parameters and schema. They are defined similarly to how Object-Relational Mappers (ORMs) typically function.

We’ve described what’s under our data management platform’s hood. Let’s talk about the technology that keeps it up and running.

The structure of the DMP

Our architecture has several layers, each with unique requirements. How does it work? For basic Data Lake functions, our RAW, ODS, and CDM layers, we use the core of our tech landscape — YTsaurus.

YTsaurus

YTsaurus (YT) is a distributed storage and processing platform developed and open-sourced by Yandex. It’s like Hadoop but better for scalability, multitenancy, and optimization for GPU compute scheduling.

We use it in Yandex Go for many reasons. The key reason is that YT stores all the data in Yandex. Yes, all data — the largest clusters are more than an exabyte large. Also, it’s cheaper and easier for us to maintain YT than our Hadoop cluster. In short, YTsaurus is huge, inexpensive, and easy to use.

DataLens

The second technology in our stack is DataLens — the Business Intelligence (BI) system for our analytical dashboards.DataLens was developed by Yandex as part of cloud services and recently became open-source. It has a user-friendly, no-code interface and easy-to-use data modeling features. Our team prefers DataLens because it natively integrates with YTsaurus, ClickHouse, and other databases. It is also easy for newcomers to get started with this technology.

Why we don’t limit ourselves to using YTsaurus alone

As you might have noticed, we don’t use YTsaurus for all our DWH layers. That’s because we don’t rely on YT for all use cases. While YT is a powerful cornerstone of our data infrastructure, we want users to have the solutions that best fit their specific needs, so we’ve added some systems to our stack.

Greenplum

The first technology is Greenplum, which we deploy within our Data Warehouse. Greenplum is a Massive Parallel Processing database (MPP) for analytics. In other words, it’s a robust and high-performance version of Postgres. We use Greenplum to host our DDS layer, with many highly normalized tables requiring enormous joins during data processing.

ClickHouse

ClickHouse is an open-source, column-oriented database for OLAP use cases. We added it to our stack because we needed a high-performance database to handle the ‘hot’ DWH segments. We use ClickHouse for our CDM layer for filtering and aggregation tasks. It lets us generate BI reports faster and ensures fault tolerance through its cross-data center clusters.

YTsaurus benefits

Let’s explore the advantages of YTsaurus from the viewpoint of ETL developers, DWH maintainers, and analysts.

YTsaurus for ETL developers

The first useful YTsaurus feature for ETL is Cypress. Cypress is a distributed file and metadata storage system and a core component of YT. It supports tables, including column-oriented tables, which provide a significant performance boost for analytical workloads. Cypress also supports transactions and enables users to modify several Cypress objects within a single, continuous transaction so we can update or even alter a group of tables in an atomic manner. Cypress is an excellent feature for enhancing reliability and consistency.

Static tables

YTsaurus has two types of tables: static and dynamic. Static tables are ideal for storing data that rarely changes. They are physically divided into parts or chunks. You can update data in a static table only through actions, such as merging or deleting a table or adding new entries at its end. Static tables can be sorted or unsorted. Sorted tables help you scan ranges efficiently because you can filter data based on the sort key field. This is why we implement sorted tables on the CDM layer and use them for BI reports.

Dynamic tables

On the contrary, dynamic tables function as key-value stores, so you can easily update them. We use dynamic tables for the RAW and ODS layers in our DWH. Sorted dynamic tables are great for uploading captured data changes based on a primary key.

Data schemas

Every YTsaurus table has a schema. YT offers a rich type system with both primitive and complex types. The schema can be strict or weak. We prefer the strict approach and employ constraints to improve the reliability of data pipelines. We use schemas on all layers, even the RAW layer. In the RAW layer, each table consists of a primary key, which can be composite, and a doc field in YSON format. YSON is a JSON-like data format that we widely use in YTsaurus. The main differences between YSON and JSON are scalar types, binary representation, and the ability to set up attributes on any literal inside a document. We use attributes to annotate a source data type if it doesn’t have native support in YSON.

Partitioning

YTsaurus doesn’t have native partitioning, but you can simulate it using Cypress folders. Each folder can represent an entity, and tables can be treated as partitions. When a sort key aligns with a primary key, partitioning on entry creation date becomes a practical approach to avoid full table scans.

YTsaurus for DWH maintainers

Cypress and scheduler offer valuable features for DWH maintenance.

Access control

The first one is an enhanced enterprise-level access control system. With it, we can easily set up a detailed access control list (ACL) for each object in Cypress. We can even apply ACLs to columns and reduce the risk of storing vulnerable data.

Compute pool hierarchy

YTsaurus uses a hierarchical fair-share scheduler, meaning resources like CPU or RAM are dynamically allocated between users based on their guarantees and priority parameters in the pool tree, such as weight, maximum operation count, and others.

There are various types of pools, from ordinary strong quotas to integral quotas. For example, with integral quotas, you can allocate a certain amount of CPU time daily to users, who can choose when to use it. This flexibility is beneficial for heavy but not permanent ML workloads.

Account hierarchy

The second feature is a similar functionality for storage quotas. These quotas are called accounts. With accounts, you can set limits for HDD or SSD storage, Cypress nodes (data folders), and chunks (files). Just like compute pools, accounts have a hierarchical structure.

For a DWH maintainer, easily configuring such pools and accounts is marvelous. It allows us to use best practices in the DWH, such as guiding users to efficient data storage and ETL methods. After reviewing their content or pipeline, we can assign good pools and accounts to users who meet the requirements and bad ones to those who don’t.

Compression and TTL parameters

Last but not least, features, from the maintainer’s point of view, are a set of capabilities for effective compression, erasure codecs, and time to live (TTL) parameters. We combine these features with partitioning to optimize disk space usage.

For instance, we store several of the latest partitions without significant compression to enhance performance. We apply erasure codecs and compression for older partitions to save disk space. We can also delete old and unimportant logs from YT using TTL parameters.

YTsaurus for analysts

We have three primary options for using data stored in YTsaurus.

YQL

The first one is SQL-based YQL. Here, you can use the common and simple Map Reduce with additional features. Or, you can use the in-memory engine called DQ (Distributed Query). DQ offers significantly faster performance but has limitations, such as restrictions on the dataset size. We typically use DQ for ad-hoc and regular fast queries like data quality checks. DQ isn’t currently in the open-source version of YTsaurus, but would be added in future.

CHYT

The second one is CHYT — a ClickHouse cluster integrated within YT. With CHYT, you can keep all the advantages of ClickHouse, use the same compute quotas available in YTsaurus, and there’s no need to transfer data anywhere. However, data reading speeds in CHYT are a bit slower than standard ClickHouse’s.

We use CHYT for high-performance scenarios, such as fast ad-hoc or BI dashboards. We do, however, also have another option for high-performance use cases. It’s called Spark.

Spark over YTsaurus

One of the most widely used engines for scalable computing. It supports numerous storage options, deployment configurations, programming languages, and workloads.

The critical difference between Spark and the conventional Map Reduce model is that it uses memory to store intermediate results. This approach minimizes I/O costs, typically the slowest part of the entire computation.

Why Spark over YTsaurus?

The deployment option was the first thing we had to decide on at the start of our DMP journey. We chose YTsaurus. Why? Firstly, YT can function like Kubernetes and has a scheduler similar to Yarn. Secondly, every Yandex DWH or analytics department typically has computing resources within YT, enabling teams to easily switch between YQL, MapReduce, or Spark, depending on their requirements.

How we contributed to SPYT

The idea was born in our team in Yandex Go, and after three months of coding Spark over YT, it was brought to Yandex. We called this project SPYT.

How SPYT works: client mode

A standalone Spark cluster comprises a master, workers, and several other services. All Spark parts run inside the YT Vanilla operation. Vanilla provides an easy way to run distributed processes that contain several job types. Spark components use Cypress to discover everything they need to work.

How SPYT works: cluster mode

There is another startup option called cluster mode. In this option, the driver launches in the cluster along with the executors. To work in cluster mode, the Spark standalone master has to launch the driver from the application code. The application code must be preloaded to YT before you call the spark-submit-yt utility.

New YTsaurus features today and tomorrow

As a result, Spark has become an integral part of the YTsaurus project. The SPYT project has many features, like performance optimizations and basic YT scheduler integration.

The DMP team has no plans to halt improvements to YTsaurus on our platform. Currently, we are deepening the integration of YTsaurus and Greenplum to enhance transfer speed and access control features. We will achieve this through the use of SPYT or Greenplum PXF. We also plan to integrate YTsaurus and Apache Flink to evolve our streaming scenarios for real-time analysis.

Are you interested in giving it a try?

You might think, “All this sounds great, but aren’t these Yandex-exclusive technologies out of my reach?” Fortunately, that’s not the case.

Greenplum, ClickHouse, DataLens, and even YTsaurus are now open-source. You’re free to adopt and experiment with these tools immediately. If you want to know more about our ETL framework or Flink connectors to YTsaurus, please contact Vladimir. If you’re interested in Managed YTsaurus for cloud, contact me.

Key points

We have discussed our data stack and platform, and the best part is that all the main components are open and free to use. We have also looked into YTsaurus, a versatile technology noted for its scalability and multitenancy support. You can use YTsaurus’ convenient API to easily extend and integrate YTsaurus capabilities with other big data tools. We encourage you to embark on your explorations within the data stack landscape — it’s the optimal approach to uncover solutions that perfectly align with your unique requirements.