YTsaurus: Exabyte-Scale Storage and Processing System Is Now Open Source

Published in

Yandex

15 min readMar 20, 2023

Hello, my name is Maxim Babenko, I’m the head of the distributed computing technologies department at Yandex. Today we’re pleased to announce that we have released the YTsaurus platform as open source. YTsaurus is one of the key infrastructure big data systems developed at Yandex and was previously known as YT.

After almost a decade of hard work, we want to share YTsaurus with the world. In this article, we’ll take you through the history of YT’s development, explain why YTsaurus is needed, describe its main features, and outline the areas for which it is best suited.

The GitHub repository contains the server code for YTsaurus, the deployment infrastructure using k8s, a web interface for the system, and client SDKs for popular programming languages such as C++, Java, Go, and Python. Everything is Apache 2.0-licensed, which means that anyone can download and modify it to suit their needs.

How YT became Yandex’s primary big-data system

The story begins in 2006. By that time, Yandex had become a fairly large company. The question of where to store and how to process the company’s data was no longer a simple one. At that time, the focus was on logs from multiple services. Log processing involved a variety of analytics that could address a wide range of tasks, from improving machine learning models to analyzing user behavior when functional or interface changes were made to the services.

The idea of a scalable and elastic data storage system that could perform parallel computations without worrying about the physical location of the data and the fault tolerance of the physical components of the cluster was already in the air.

In 2004, Google’s Jeffrey Dean and Sanjay Ghemawat published MapReduce: Simplified Data Processing on Large Clusters. It largely predicted the evolution of the distributed computing industry for the next decade. It’s no surprise that a similar implementation of the MapReduce paradigm emerged at Yandex, called YAMR — Yet Another MapReduce.

YAMR was built from scratch in record time and undoubtedly had a tremendous impact on the development of the company’s internal infrastructure. Over time, however, it became clear that many of the design choices originally made in YAMR were not allowing the system to evolve and scale effectively. For example, the YAMR master server was a single point of failure and did not scale.

At first glance, it might seem that the decision to build our own infrastructure is a typical case of NIH syndrome, and that the option of using an out-of-the-box solution like Apache Hadoop was not even considered. But that is not entirely true. In September 2015, a group of Yandex engineers went to California to meet with those who were using the Hadoop stack in production. They asked questions about limitations, operational peculiarities, and how Hadoop was expected to evolve.

But then it became clear that the Hadoop stack was significantly behind, even compared to YAMR, which already supported erasure coding and IPv6 connectivity. These were by no means the only issues.

After analyzing everything, we decided to abandon the idea of using Hadoop. At the same time, we had to choose between the evolutionary development of YAMR and the revolutionary writing of a new system, and we chose the latter solution. Five years before these events, a small group of enthusiasts, of which I was fortunate enough to be a part, began working on a project codenamed YT. With proper refinement, YT had every chance of replacing YAMR.

It is important to understand that there was no immediate way to replace YAMR. At its peak, this system managed clusters totaling thousands of nodes, and a large amount of application code was based on the YAMR API. As a result, the process of refining YT and migrating from YAMR took many years. The details of this story are interesting in their own right and probably warrant a separate post.

Since 2017, Yandex has had a single MapReduce system, the development of which, both in terms of scale and capabilities, continues to this day. Today, the company operates several YT clusters, ranging in size from a few machines to tens of thousands of servers. The largest installations store exabytes of total data, using millions of CPU cores and thousands of GPU cards for round-the-clock computing.

YTsaurus: origins of the name

It has taken us almost seven years to answer the question, “Will YT be open source?” But here it goes: YT will not be open source, but YTsaurus will be!

The system we originally developed was called “YT.” The same abbreviation appears in many parts of the code base. Word of mouth within Yandex has it that the abbreviation “YT” was meant to stand for “Yandex Table,” possibly inspired by Google’s well-known Big Table system, but we haven’t been able to find any reliable evidence to support this theory.

When we decided to release the system as open source, we found it difficult to keep the original name. The problem was not only that this two-letter combination is often associated with a certain popular video hosting platform, but also that it is difficult to find short names for products that are up for grabs.

We finally settled on the name “YTsaurus.” It has the same dear and familiar “YT” prefix, and our team has always treated the project as a living creature. Now we finally know its race!

In our codebase and texts, we’ll often shorten “YTsaurus” to “YT.” We’re still getting used to the full name ourselves :)

Capabilities of the system

We designed the system to be flexible and scalable, and currently, its capabilities are not limited to classic MapReduce technology. In this section, I will describe the main technical capabilities available in the open-source version of YTsaurus, from low-level storage to high-level compute primitives.

Cypress: reliable and efficient data storage

The core of any big data system is the storage of various logs, statistics, indexes, and other structured or unstructured data. YTsaurus is built on top of Cypress, a fault-tolerant tree-based storage whose capabilities can be briefly described as follows:

A tree-like namespace with directories, tables (structured or semi-structured data), and files (unstructured data) as nodes
Transparent sharding of large tabular data into chunks, allowing the table to be treated as a single entity without worrying too much about the details of physical storage
Support for columnar and row-based storage mechanisms for tabular data
Support for compressed storage using various encoding codecs, such as lz4 and zstd, with varying levels of compression
Support for erasure coding using various erasure codecs with different control sum calculation strategies that have different redundancy parameters and allowable loss types
Expressive data schematization with support for hierarchical types and data sortedness signs
Background replication and repair of erased data without manual intervention
Transactional semantics with support for nested transactions and snapshot/shared/exclusive level locks
Transactions that can affect many Cypress objects and last indefinitely
A flexible quota accounting system

At the heart of Cypress is a replicated and horizontally scalable master server that stores metadata about the Cypress tree structure and the composition and location of chunk replicas for all tables on the cluster. Master servers are implemented as replicated state machines based on Hydra, a proprietary consensus algorithm similar to Raft.

Cypress implements a fault-tolerant elastic data layer that is used in virtually all aspects of the system described below.

MapReduce computing and a general-purpose scheduler

Despite the fact that MapReduce technology is no longer considered new and unusual, its implementation in our system is worth some attention. We still use it for computations on petabytes of data where high throughput is required.

MapReduce in YTsaurus has the following features:

A rich base model of operations: classic MapReduce (with different shuffle strategies and support for multi-phase partitioning), Map, Erase, Sort, and some extensions of the classic model that take into account the “sortedness” of input data
Horizontal scalability of computations: operations are divided into jobs that run on separate servers
Support for hundreds of thousands of jobs in a single operation
A flexible model of hierarchical compute pools with instant and integral guarantees, as well as fair-share distribution of underutilized resources among consumers without guarantees
A vector resource model that allows different compute resources (CPU, RAM, GPU) to be requested in different proportions
Job execution on compute nodes in containers isolated by CPU, RAM, file system, and process namespace using the Porto containerization mechanism
A scalable scheduler that can serve clusters with up to a million concurrent tasks
Virtually all compute progress is preserved in the event of updates or scheduler node failures

YT supports not only the execution of MapReduce operations but also the deployment of arbitrary user-provided code on the cluster.

In YT terminology, running arbitrary code with unspecified side effects is achieved using “vanilla” operations. We use this capability for a number of other components of our platform, which I will discuss down below.

Dynamic k-v storage tables

The MapReduce paradigm is virtually unsuitable for building interactive compute pipelines with sub-second response times. The problem lies not only in how data is processed but also in how it is stored.

YT’s static tables, like a set of files in HDFS, can serve as inputs and outputs for MapReduce computations. However, they cannot be used in an interactive scenario because they are tied to a slow persistent storage medium. For interactive scenarios, applications typically use key-value stores. They can scale horizontally while providing low-latency read and write access.

Fortunately, in 2014 we started working on dynamic tables within the YT framework. They are partly based on the Apache HBase model. They scale horizontally and use our distributed file system as the underlying storage. However, unlike Apache HBase, dynamic tables are organically integrated into the overall ecosystem: they represent nodes of Cypress and can be used in many scenarios where static tables are expected.

For example, in YT, you can create a dynamic table as the result of a MapReduce operation and use it for fast key-based search and insertion. At the same time, you can create a background MapReduce process that processes a sample of data from the dynamic table and calculates some statistics about it.

Storing data in the MVCC model. Users can look up values by key or by timestamp
Scalability: dynamic tables are split into tablets (shards by key ranges) that are served by separate servers
Transactionality: dynamic tables are OLTP storage that can change many rows in different shards from different tables
Fault tolerance: failure of a single node serving a tablet causes that tablet to be moved to another node with no loss of data
Isolation: nodes serving tablets are grouped into bundles that reside on separate machines, ensuring load isolation
Conflict checking at the individual key or even individual value level
Hot data responses from RAM
A built-in SQL-like language for query scanning and analysis

In addition to dynamic tables with the k-v storage interface, the system supports dynamic tables that implement the message queue abstraction, namely topics and streams. These queues can also be considered tables because they consist of rows and have their own schema. In a transaction, you can modify rows in both the k-v dynamic table and the queue simultaneously. This allows you to build stream processing on top of YT’s dynamic tables with exactly once semantics.

YQL

YQL is an SQL-based query language; it is the first high-level primitive built on top of YT. YQL occupies roughly the same position in relation to YT that Hive occupies in relation to Hadoop. This technology allows users to write simple queries in SQL rather than building a sequence of MapReduce operations with custom code. Here’s an example of such a query:

SELECT
    region,
    AVG(age) AS avg_age_in_region,
    COUNT(DISTINCT ip) AS ips_count
FROM `//home/production/users`
GROUP BY region
ORDER BY avg_age_in_region;

Today, many big data tasks can be succinctly formulated as SQL queries. Without YQL, our ecosystem would be incomplete. It is one of the most popular tools for both ad hoc analysis of large datasets and regular production calculations.

YQL benefits include:

A powerful graph execution engine that can build MapReduce pipelines with hundreds of nodes and adapt during computation
Ability to build complex data processing pipelines using SQL by storing subqueries in variables as chains of dependent queries and transactions
Predictable parallel execution of queries of any complexity
Efficient implementation of joins, subqueries, and window functions with no restrictions on their topology or nesting
Extensive function library
Support for custom functions in C++, Python, and JavaScript
Support for using machine learning models via CatBoost and TensorFlow
Automatic execution of small parts of queries on prepared compute instances, bypassing MapReduce to reduce latency

CHYT

It goes without saying that most of my readers have heard of ClickHouse. In 2016, this DBMS became a pioneer among Yandex’s open-source technologies and proved so successful that in 2021 it became a separate company called ClickHouse Inc.

Today, ClickHouse is one of the most popular analytical databases with an incredibly efficient column-based execution engine and a variety of integrations with BI systems. One of the nice features of ClickHouse is the good separation of storage and compute parts in the source code, which allowed us to build CHYT in 2018 — an integration of the ClickHouse compute engine with YTsaurus as storage.

In the YTsaurus ecosystem, CHYT provides the following capabilities

Fast analytical queries on static tables in YT with sub-second latency
Reusing existing data in the YTsaurus cluster without having to copy it to a separate ClickHouse cluster
Ability to integrate (e.g. with third-party visualization systems) via ClickHouse’s native ODBC and JDBC drivers

I note that the integration is done at a fairly low level. This allows us to use the full potential of both YTsaurus and ClickHouse, namely:

Support for reading both static and dynamic tables
Partial support of the YTsaurus transactional model
Support for distributed inserts
CPU-efficient conversion of columnar data from the internal YTsaurus format to the in-memory ClickHouse representation
Aggressive data caching, which in some cases allows query execution data to be read exclusively from instance memory

The ClickHouse server code runs in the above-mentioned vanilla operations, using the same compute resources that can be used for MapReduce computations. In this sense, the YTsaurus cluster acts as a compute cloud with respect to the CHYT clusters within.

This allows different users or user teams to run multiple CHYT clusters on a single YT cluster, completely isolated from each other, solving the problem of resource separation in a cloud-like manner.

SPYT

In 2019, Yandex introduced SPYT, a system that integrates Apache Spark as a compute engine for data stored in YT. Similar to CHYT, vanilla YTsaurus operations provide computational resources for the Spark cluster. Apache Spark was originally designed to make it easy to connect to third-party storage as a data source.

SPYT is also well-established in the YTsaurus ecosystem. It is one of the main ways to write ETL processes, thanks to its rich integration capabilities with third-party systems. Under the hood, Spark uses a flexible distributed computing optimizer that maximizes in-memory storage for intermediate data and can implement compute pipelines with multiple joins.

Various SDKs

Often SDKs for a system in a specific language are automatically generated or written by someone from the user community and have not been maintained for a long time. In our case, we develop all APIs in popular languages (C++, Python, Java, Go) ourselves. In each case, all the nuances of interacting with the system are considered and well thought out.

Our client libraries, written in different languages, can retry requests, including reading or writing large amounts of data, despite possible network failures and other errors. When creating each library, we took into account the peculiarities of the languages and used them to make interaction with the system as convenient and simple as possible.

Web interface

A user-friendly web interface is a must for a system used by thousands of users. Moreover, we deliberately did not create separate web interfaces for users and administrators, which saved us from the common situation where an administrative web interface is hastily created by enthusiasts: after all, the user side is more important, and there is no need to be embarrassed in front of the administrators :)

Here is what you can do with the YTsaurus web interface:

Navigate through Cypress to view files, tables, and other objects
Create, rename, or delete Cypress objects, and change their attributes
Execute and view MapReduce computations
Execute and view the history of SQL queries across all engines — YQL, CHYT, dynamic tables SQL
Administer the system: monitor cluster component health, create, delete, or ban users, manage access rights and quotas, view cluster component versions, and more

The technical side of YTsaurus

Most of the server-side code is written in C++. We love this language for its rich functionality and efficient code. After releasing YTsaurus as open source, we hope to share a large number of developments that may be useful as separate C++ primitives.

The server-side code is built using the clang compiler and the CMake build system.

Individual parts of the system are written in Go, Python, and Java. There is also an API for developing applications that work with YTsaurus in the four programming languages mentioned above.

The code base is automatically synced with the internal repository. Thus, an up-to-date version of YTsaurus is always available externally.

YTsaurus runs on x86–64 Linux.

Deployment and administration

Within Yandex, there are more than 20 YTsaurus installations. They vary greatly in size and configuration, from 5 to 20K+ hosts in a single cluster. YTsaurus is also integrated with several internal Yandex systems, including authentication, access control, auditing, monitoring, hardware management, and container orchestration. All these systems allow us to manage clusters with minimal effort.

For the convenience of users, we have invested in the development of our second-level operator for automated deployment of the YTsaurus cluster in Kubernetes with support for standard upgrade mechanisms to a new version with downtime. The operator allows you to deploy your YTsaurus cluster in a few minutes on a local machine in minikube, a public cloud, or your own on-premises Kubernetes installation.

Cluster configuration can be managed on the fly by modifying system nodes in the metadata tree (Cypress). Using basic Cypress commands such as list, get, set, and remove, you can create an account, add a user or compute pool, grant access to a catalog, or retire cluster nodes.

Of particular note is the ability to dynamically configure individual components: by changing special attributes, you can adjust cache sizes, heartbeat periods, or logging settings on nodes.

YTsaurus is a compute platform, so the execution of user code is implied. To run and isolate untrusted code, YTsaurus uses Porto, a containerization system developed at Yandex. For full user isolation in a multi-tenant cluster, it is recommended to install Porto as a Kubernetes CRI. This will open the full range of YTsaurus capabilities for job isolation and the use of custom environments in different operations.

And, of course, the operation of a large distributed system is impossible without observability tools — logging, quantitative monitoring, and tracing. YTsaurus writes structured logs for auditing and monitoring user actions, as well as detailed debugging logs for deeper problem diagnostics. In addition, the system supports metric export in the Prometheus format and trace delivery via the Jaeger gRPC protocol.

What can be built on top of YTsaurus?

Let’s look at a few use cases of how our system is used at Yandex.

One of the most revealing and typical use cases of YTsaurus is the creation of DWHs. For example, orders from Yandex Taxi, Yandex Eats, Yandex Deli, and Yandex Delivery are received in YTsaurus dynamic tables in raw format with minimal delay. The volume of data reaches hundreds of terabytes per month.

Then the orders are processed using various tools: for example, most of the analytical data marts are prepared using YQL and SPYT. The total data volume exceeds 6 PB. CHYT is used for ad hoc analysis, and various visualizations are created in Yandex DataLens. Similar use cases exist for other Yandex services such as Yandex Market, Yandex Music, and Yandex Travel.

There are also very specific use cases. For example, all three Yandex supercomputers are managed by the YTsaurus scheduler. Many nodes with different types of GPUs are connected to YT and distributed across different pool trees. This allows users to explicitly specify the required GPU model and use the data stored in YTsaurus.

Currently, the dynamic tables in YTsaurus store petabytes of data, and a large number of interactive services are built on top of them. One of the largest internal customers is the Yandex Advertising team. At HighLoad++ 2022 conference, my colleagues talked about their approach to building interactive stream processing on top of YTsaurus.

In place of a conclusion

YTsaurus is a big project with a rich history. We invite all curious people to take a look at YTsaurus and find something useful for themselves. Maybe you’ll appreciate the technical solutions we’ve implemented in code, or find an opportunity to deploy a YTsaurus installation and try it out in practice.

If you’re interested and want to help us develop the system, it would be great. Share your feedback in the Telegram chat, or better yet — send your pull requests.