YDB Is Now Available as Open-Source Project

Published in

Yandex

13 min readJun 23, 2022

On April 19 2022, as a result of our multi-year effort in the development of data storage and processing systems, Yandex uploaded the YDB database to an open-source repository. We have published the source code, documentation, SDK, and all the relevant DB tools on GitHub under the Apache 2.0 license.

YDB is a DBMS designed for business-critical operational data processing that directly affects end users. We developed the DBMS keeping in mind all the key requirements for Yandex services, including disaster tolerance (continuing to run services undegraded even when one of the databases’ three data centers are down), read/write scalability to dozens of thousands of servers, and strong data consistency. YDB has been used at Yandex for over five years. To minimize the resources necessary for maintenance, we are providing our users with a fully managed service on multi-tenant clusters. Databases with totally different workloads and data access patterns have been deployed on our multi-tenant clusters. Practically speaking, we can accommodate hundred-fold database growth rates from gigabytes to hundreds of terabytes and explosive workload increases from thousands to millions of requests per second (RPS). Scalability and fault-tolerance issues are automatically resolved by the DBMS, so you don’t have to support them at the application code level. It’s also noteworthy that YDB has been widely adopted not just for high-load projects. Although Yandex development teams are free to choose a stack that best fits their tasks, audience, and other specifics, many projects choose YDB for its fault tolerance, even when their current data loads are small and can be handled by a single server. But the teams know that they can accommodate any spike in data loads by easily adding resources without modifying the application code or manually re-sharding the database.

Inspired by the high levels of internal adoption of YDB at Yandex, its immense growth in popularity as a Yandex Cloud service, and numerous requests from users, we decided to disclose our source code and make YDB available free-of-charge.

Background

The internet’s increasing penetration into our lives over the past decades, having now spread to previously simple devices like lightbulbs, watches, and vacuum cleaners, has sparked significant growth in the volumes of data that data centers must store and process. In light of this, OLTP DBMS, like MySQL and PostgreSQL, have risen in popularity.

It’s hard to imagine that the internet could have grown so fast without these databases. How many start-ups were able to use free database management systems to process ever increasing loads? Of course, there were “heavy” commercial solutions offering vertical scalability, but only large companies with big budgets could afford them.

By the mid-2000s, a range of methods for logical database sharding had gained popularity. These methods sharded databases into multiple nodes, based on the business logic of the application and its constituent entities. The application itself (or the intermediate layer between the application and the database) selects the database node to put the data on. A widely adopted approach, it adds substantial sophistication to the application logic and multi-cluster support. Moreover, additional complexity is added by queries that need a consistent result of data combined from different nodes. This problem can be tackled by exporting data to a separate system or additionally sophisticating the application logic. Each such step, however, compromises the maintainability and technological effectiveness of the system. It also doesn’t eliminate the need to re-shard data as some shards start to experience higher workloads. It’s not so easy to change the initially chosen sharding approach as your app faces new business requirements. Data loads continue to grow, and the cost of failures increases.

At the end of the 2000s, NoSQL solutions became popular, bringing scalability and fault tolerance, but sacrificing the SQL dialect, JOIN functionality, and accepting the eventual consistency approach where every replica in every availability zone or region receives a consistent data set sooner or later. Key-value databases also grew in popularity like Redis, AWS Dynamodb, along with column-family databases like Cassandra and Hbase, and document databases like MongoDB, Couchbase.

We followed the same path, with the history of the OLTP DBMS development at Yandex basically repeating the evolution of operational databases. YDB’s precursor was a NoSQL database (KV) that has been in development since 2008.

As NoSQL databases fail to support ACID transactions, developers have to emulate them, making their application code more complex as they have to expand the database functionality at the application code level. First of all, this is very difficult, and, second, you need to reinvent the wheel and add duct tape and hacks to all your projects.

The problem remained relevant, and in the early 2010s a new term, NewSQL, emerged. It’s mentioned by Michael Stonebraker and Andy Pavlo in descriptions of the new requirements that OLTP DBMS was expected to support.

NewSQL databases should have the scalability and fault-tolerance typical of NoSQL systems, but at the same time provide ACID guarantees for transactions and support an SQL dialect. A bit later, the term NewSQL got transformed, giving rise to scalable, fault-tolerant databases offering SQL support and strong consistency. They were called Distributed SQL Databases. Inspired by the NewSQL idea in 2012, we made our first commit to YDB, our Distributed SQL Database.

Why Use YDB?

The DBMS market has been growing for a long time, offering a number of well-known, mature products. Let’s look at advantages that YDB offers in comparison to other solutions, and why it is worth adopting.

Relational Databases

Manual sharding is one of the methods used to scale up relational databases. It means that, when deploying the database, you need to set up multiple database instances and decide which instance will be accessed by your application. If you need to access multiple database instances at a time, you need to implement distributed transactions in your code. YDB provides read/write scalability out of the box. For this, you just need to add more hardware capacity to the cluster. In practice, we run databases storing hundreds of terabytes and handling millions of requests per second.

NoSQL Databases

NoSQL databases scale up very well, but their functionality is significantly curtailed compared to relational databases, e.g. supporting a high rate of transactional SQL updates against multiple tables is quite an issue with NoSQL databases.

Distributed Open-Source SQL Databases

Some systems have very similar capabilities compared to YDB. In our view, YDB has the following advantages:

Having Yandex as a client gives YDB the opportunity to demonstrate in practice all its properties, which consist of a large number of services with high loads and large amounts of data.
On top of the YDB platform, we have implemented such services as time series storage to empower Yandex Monitoring, persistent queues to enable Yandex LogBroker data bus, and Network Block Store to implement virtual disks for Yandex Cloud on YDB.
With its rich functionality offering federated queries and streaming queries based on YQL, YDB can become your data management ecosystem.

Proprietary Distributed SQL Databases

All cloud systems offered today by the world’s leading providers are closed-source and only available for internal use or as a cloud service from a certain provider. Some of these systems are locked in to specialized equipment. As a result, clients of such systems have no options for local system deployment nor multi-cloud system deployment.

In contrast, you can deploy YDB anywhere you want by using the provided Kubernetes operator or manually. You don’t need any specialized equipment to deploy the database. YDB is also available as a managed service in Yandex Cloud.

Success Stories

Yandex Cloud

It is no exaggeration to say that YDB is a key component in Yandex Cloud. The platform has a hyper-convergent architecture, meaning that both the storage layer and compute layer run on the same hardware, but they are separated and independent from each other. The control plane also runs on the same equipment. YDB enables a data storage layer in Yandex Cloud. This constitutes a storage layer for both the network drives and the data and metadata of the infrastructure and platform services provided by Yandex Cloud. Numerous services have been implemented on top of YDB to provide their own data tools: Yandex Monitoring collects and visualizes application metrics, Message Queue enables messaging between applications, YDS is a scalable service to manage data flows in real time. Yandex Cloud Logging is a service to aggregate and read logs from custom applications and Yandex Cloud resources.

Alice Voice Assistant

Alice team considered YDB when preparing for increased workloads and storage volumes. After migrating, we abandoned manual data sharding to ensure strong out-of-the-box consistency in the 3DC cluster. In our previous DB, we noticed undesirable effects when switching the master replica between data centers. After extensive routine maintenance, replicas lagging for many hours had to be carefully brought in sync by our DevOps engineers. Having moved to YDB, we decreased the DevOps effort involved. Alice is currently using YDB in a number of scenarios. I will provide a few examples below:

Context

To maintain a natural dialogue with the user without asking clarifying questions, Alice stores the dialog context. For example, if you ask Alice what Gorky Park in Moscow is, and, when Alice gives you a factual answer, and you ask her to build a route “there,” Alice will build a route to Gorky Park. The dialog context runs through many Alice scenarios. Many scenarios are natively contextual, e.g. city chain game, multimedia game, and others. Alice stores her context in YDB.

Yandex Smart Home

When we connect smart home devices to Alice, we can give them names. YDB is also used to store links between the device activation phrases and their IDs or locations inside the user’s home.

Logs

Operational logs from Alice’s development framework are stored in YDB.

Verticals (Auto.ru and Yandex Realty)

Yandex Verticals use the microservice architecture extensively. Our colleagues approached the YDB team when they saw that the existing backends for the Jaeger trace database had gotten too costly in terms of traces processed per server core. It was a challenge for the YDB team, and we implemented a special API (BulkUpsert) to write logs and traces, optimizing the database for writing traces. By leveraging high YDB performance, we reached three-fold savings in terms of computing resources and managed to write all traces without sampling.

After YDB successfully proved itself as a cost-effective, efficient, and fault-tolerant database for storing traces (at the data load of 500,000 spans/sec), Yandex Verticals also started using YDB as a relational database.

Yandex Metrica

Yandex Metrica collects data on user sessions on websites. To do this, you need to store the history of all events and merge them together on the fly. For this, we had previously used a distributed pipeline with a local storage developed in-house and its own logic of replication and sharding. As the workloads grew, our colleagues from Yandex Metrica were capped by the performance of the in-house storage, so it was quite painful to add more shards without a fundamental redesign of the architecture.

We experimented and load-tested YDB as a new session repository offering transparent scalability by location and by load. In this manner, we continued increasing the load on our Sessions database, which now includes more than 100 TBs of data and withstanding the load of more than 1,000,000 RPS.

Yandex Market

Shopping cart is a key component of any marketplace, online store, or online service that sells something to its users. After we had gotten a product requirement to show a cart widget on Yandex home page (and others), it became clear to us that we needed to withstand a hundred-fold higher workload while meeting our SLA guarantees for the response time.

We considered multiple NoSQL solutions, but opted for YDB. After successful testing, we migrated our Сart to YDB, spending one developer man-month for it. We didn’t stop at that, and after the successful implementation of YDB for the Сart, we started using it in other Yandex Market services.

In addition to the success stories already mentioned, we are proud that YDB is getting increasingly popular within Yandex, given that development teams are free to choose their technology stacks.

Why Are We Going Open Source

We firmly believe that the technological development seen over recent decades was made possible thanks to the open-source culture that stimulates people’s interest in technologies and makes them available to everyone. The one thing needed is a desire to get acquainted with the technology. One can’t possibly imagine today’s internet without such databases as MySQL, PostgreSQL, such web servers as Apache and nginx, and other open-source solutions. The examples are numerous.

Everyone benefits from going open source — it’s a fascinating win-win situation where, first, we give the community the option to leverage the unique developments that Yandex has invested hundreds of man-years in, to get familiar with the code, and to run and develop YDB-based solutions free-of-charge. Technologies that allowed Yandex to scale up and grow faster, and accommodate increasing data loads more quickly, are now available to anyone. Second, we can attract more diverse users to use the system, so we can get feedback from the global community and make our product even better. We want to break down the barrier for users who are interested in this technology, but are still affected by doubt, fearing its proprietary nature and/or lack of support on their equipment or in their clouds.

How to Evaluate YDB

We have tried to provide maximum flexibility for YDB users. For local testing or debugging, you can use a Docker container or run a Kubernetes cluster in Minikube. You can set up and run the YDB cluster locally on your own using our build. For production purposes, we recommend deploying YDB with Kubernetes or a fully managed Yandex Cloud service. And, of course, you can always build YDB from the source code.

Working With a Docker Image of YDB

For convenience, the Docker image has the YDB’s console client pre-installed. In the Docker image, a database named /local is launched by default.

The following default ports are used:

Port 2136: For interacting with the YDB API over gRPC without TLS.
Port 2135: For interacting with the YDB API via gRPC with TLS. Certificates are generated automatically. To use certificates, you need to mount the /ydb_cert Docker container’s directory on the host system.
Port 8765: For the built-in monitoring and introspection tools.

Pull the current public version of the Docker image:

docker pull cr.yandex/yc/yandex-docker-local-ydb:latest

Run the YDB Docker container and mount the directories:

docker run -d \--rm \--name ydb-local \-h localhost \-p 2135:2135 \-p 8765:8765 \-p 2136:2136 \-v $(pwd)/ydb_certs:/ydb_certs -v $(pwd)/ydb_data:/ydb_data \-e YDB_DEFAULT_LOG_LEVEL=NOTICE \-e GRPC_TLS_PORT=2135 \-e GRPC_PORT=2136 \-e MON_PORT=8765 \cr.yandex/yc/yandex-docker-local-ydb:latest

Note:

-d: Run the Docker container in the background.
--rm: Remove the container after its operation is completed.
--name: Name of the container.
-h: Host name of the container.
-p: Publish container ports on the host system.
-v: Mount the host system directories to the container.
-e: Set the environment variables.

To read more about the options for launching the YDB Docker image, see the documentation.

YDB CLI Console Client

We’ll use the YDB CLI console client as our main tool for making queries and applying the test load.

Install the YDB CLI by following the instructions.

To check the database connection, run the query against the YDB database from the Docker container:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local table query execute -q 'select 1;'

Note:

-e: Database endpoint.
--ca-file: Path to the TLS certificate.
-d: DB name and query parameters.

As a result, you should see the message:

┌─────────┐
| column0 |
├─────────┤
| 1       |
└─────────┘

This means that the connection to the database has been established and the query has been executed successfully.

Using the Yandex Query Language

Below are brief instructions on how to use the YQL syntax. You can learn more about the syntax and its usage examples in the YQL documentation.

Create a table using CREATE TABLE:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s ‘CREATE TABLE series (series_id Uint64, title Utf8, PRIMARY KEY (series_id));’

Verify that the table has actually been created using the database objects list command scheme ls:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scheme ls

To view the properties of the table you’ve created, use the describe command:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scheme describe series

Add data to the table using INSERT INTO:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s ‘INSERT INTO series (series_id, title) VALUES (1, “IT Crowd”), (2, “Silicon Valley”), (3, “Fake Series”);’

Read data from the table using the SELECT statement:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s ‘SELECT * FROM series;’

Update data in the table using the UPDATE statement:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s 'UPDATE series SET title="Fake Series Updated" WHERE series_id = 3;'

Delete data from the table using the DELETE statement:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s 'DELETE FROM series WHERE series_id = 3;'

Delete the table using DROP TABLE:

ydb -e grpcs://localhost:2135 --ca-file $(pwd)/ydb_certs/ca.pem -d /local scripting yql -s ‘DROP TABLE series;’

Shutting Down the Docker Container

Once complete, stop the Docker container:

docker kill ydb-local

Please see the documentation to see other options of how to get started with YDB.

What’s next?

The disclosure of the source code has become a pivotal point for us, but we won’t stop there: this is only the beginning of our path towards becoming a global leader in this segment. At this we plan to focus on expanding the analytical features of YDB. We are working on allowing YDB to let developers “cool down” their table data to reduce the storage cost. We also plan to prepare drivers for popular benchmarks such as TPC-C, YCSB for running regular tests against YDB and comparing it with equivalents. We plan to improve our data export and import tools and support the CDC functionality (generating event streams about database changes and writing them to several supported event systems to obtain a consistent state when processed at the receiver end). And, of course, making the system more efficient is a top priority for us.

If you have any questions, feel free to ask them on Stack Overflow with the ‘YDB’ tag. Check out ydb.tech website to find out more about functionality and usage options, read the documentation, and to download all the necessary files and tools.