userver 2.0: Major framework release for I/O-bound applications

Published in

Yandex

7 min readMay 22, 2024

Just over six months have passed since the last release of the C++ userver framework. During this time, we’ve accomplished a lot:

Significantly optimized the framework’s performance, surpassing our main competitors in high-performance framework benchmarks
Made configuration much easier
Added install, Docker images, Yandex Cloud image, and DEB packages
Added new functionality, including server middleware for HTTP and a YDB driver
Changed to a new monthly release schedule and streamlined versioning

Top 15 in TechEmpower benchmarks

Over the past six months, we’ve optimized many parts of userver and moved higher into the Top 15, leaving many well-known competitors behind.

Even in these synthetic benchmarks, which have nothing to do with production solutions, we continue to handle network fluctuations (unlike most of the top contenders) and database failures, while the code written on userver remains intuitive, linear, and without callbacks.

We’re eager to share some of our creative optimizations, so…

Secret features of the PostgreSQL protocol

When a client program uses libpq to pass queries to the PostgreSQL server, the libpq library requests a (D)escribe for each SQL query. As a result, the server transmits field names and other operational information. Such data can occupy over half the total size of the server’s response if the query returns many columns and few rows.

Since its first versions, userver turns all requests into Prepared Statements in a way that is transparent to the user and only sends query arguments in binary format to the server (while libpq uses a less compact text format). This saves network traffic and allows the server to execute queries a little faster.

In version 2.0, the mechanism has been improved, and now the (D)escribe of the query is cached, and the queries themselves go to the server without requesting metadata. As a result, we’ve increased network efficiency 2x while decreasing server and client CPU load.

Currently, we try to upstream libpq extensions so that all frameworks and libraries using this library can benefit from this optimization.

Extended control over PostgreSQL pipelining

The PostgreSQL protocol enables the consolidation of multiple queries into a single network trip. We’ve been using this since version 1.0. As a result, if three SQL queries are made in the application code: to start a transaction (BEGIN;), set timeouts and perform a SELECT/INSERT — only one request will be sent to the database server, saving two network trips.

In the TechEmpower benchmark, there are situations where many SELECT queries are made in a row. For cases like this, we added a class named storages::postgres::QueryQueue to userver 2.0.

Optimization of lock add instructions

Now let’s talk about low-level operations. There are atomic instructions in CPUs, which are atomic memory operations. For instance, if you want to count the number of requests currently handled by the server, there should be an atomic variable somewhere and each new request should modify it:

std::atomic<std::size_t> requests_count{0};

void DoProcess(const Request& request) noexcept;

// Called concurrently from a multitude of system threads
void Process(const Request& request) {
    ++requests_count;
    DoProcess(request);
    --requests_count;
}

Atomic instructions are the most basic low-level primitives, and most high-level synchronization primitives are built on them and on system calls: mutexes, semaphores, condition variables, RCU, and more.

However, these instructions start to slow down under heavy load. An innocent ++requests_count; turns into a lock add machine instruction, and if different processor cores make a simultaneous attempt to perform it, they will queue up and execute it sequentially. As a result, some CPU cores may experience delays, patiently waiting for the others to complete their tasks. On larger systems, specific loading can result in microsecond delays when executing such instructions.

A robust framework should provide users with detailed information about its status, enabling them to quickly identify and troubleshoot production environment issues. In userver, we have many metrics… lots and lots of metrics, in fact. And most of them are atomic operations. However, given our processing speeds and workloads, atomic operation overheads are evident in performance analyses (perf and flame graphs).

That’s where RSeq steps in. You can make the system create a regular variable on each CPU core and make the cores access only their own variable or fall back to an atomic one in case of failure. Oddly enough, this approach gives a performance gain even on a single core, reducing the cost of incrementing a variable from ~4.42 ns to ~1.67 ns. The effect is more pronounced when there are multiple threads: you can reduce the time from 411 ns to 7 ns on 64 threads.

If you want to view the code or even use a similar mechanism in your own project, please see the userver source code.

Other optimizations

Of course, these are not all the optimizations, and we can’t cover everything in one article. We use a coarse-grained clock, have many tricks to speed up exception handling (we’ll even share them at C++ Russia), use image rendering techniques to create synchronization primitives, apply asymmetric fences, enable compilers to build containers for us (for example, here’s a story about utils::TrivialBiMap), and do many other interesting things.

At the same time, we’re working on optimizing RAM consumption. In six months, we reduced memory consumption for certain types of workloads by hundreds of megabytes. Measurements from users in our support chat show that we consume 4x less CPU resources and a bit less RAM than a similar service that is written in Go.

Simplified configuration

Configuring modern systems is a complex task. Many server configurations can take up multiple screens and be dispersed across numerous files. We are no exception.

But you can often find good defaults that are suitable for most applications or even completely avoid setting parameters by calculating them on the fly.

In userver 2.0, we made significant efforts to streamline this process. The size of tutorials has been almost halved due to optimized configuration:

The old tutorial on the left doesn’t fit entirely on the screen, no matter how much you reduce its size

The most significant change is the redesign of dynamic configurations. With userver, you can create configurations that can be modified on the fly, without restarting the service. Dynamic configurations are often used as emergency breakers, contain timeouts for various handlers and requests, or are used to turn on/off experimental features.

Our framework is full of such configurations, and new ones are often added. The defaults are usually suitable for every application and require no modification when prototyping solutions or under low workloads.

As a result, the revised dynamic configurations now have default values embedded in the code, which can be viewed from the command line. A separate fallback file is no longer required, which means all service templates have been simplified (see the example).

Docker, install, .deb, and Yandex Cloud

The first issue that C++ developers encounter when using any framework is compilation. There are numerous compilers and operating systems, hundreds of compilation options, thousands of versions of dependent libraries… All this makes adding required dependencies and assembling a project quite challenging.

To enhance user experience, we added the ability to build packages for distributions, supported installation scripts, and created ready-to-use images for development — all this in the span of six months. These include:

ghcr.io/userver-framework/ubuntu-22.04-userver-pg:latest: An image with a preset framework, an extended set of repositories for developers, and the PostgreSQL server. In other words, everything you need for assembling, debugging, and prototyping. All service templates have been migrated to this image, and the image is rebuilt weekly with the latest version of userver.
ghcr.io/userver-framework/ubuntu-22.04-userver-base:latest: An image with build dependencies. Designed for those who prefer to include userver as a subdirectory in CMake.
userver on Yandex Cloud Marketplace: Create a virtual machine with userver in Yandex Cloud in a couple of clicks. Designed for those who prefer to develop their solutions on high-performance cloud-based hardware.

New functionality

Over six months, we’ve improved many things in the framework.

The PostgreSQL driver can now automatically calculate the number of connections for the current cluster pod and no longer requires complex configuration in most cases. Added support for LISTEN/NOTIFY to subscribe to events and notifications via the PostgreSQL database.

Further optimized diagnostics for many parts of the framework to make the development process even easier and more intuitive. Added more documentation and use cases, many parts of the framework now refer to separate middleware components for more flexible configuration. A great number of new features, a landing page, bug fixes, and enhancements have been added by external contributors. Thank you so much! You are awesome!

By the way, userver now features a YDB driver and an alpha version of a Kafka driver! Which brings us to the next topic…

Release cycle and future plans

We realized (with a little push from our users) that releasing every six months is not convenient, while a three-digit version number in SemVer is neither trendy nor reasonable.

Therefore, we plan to make releases almost monthly. Version 2.1 is coming soon, followed by version 2.2, … In about six months, when the number of changes is large enough, we’ll release version 3.0.

We’ll certainly continue working on the framework: we have many users outside Yandex, including abroad. As we can see, people find the framework interesting, which encourages us to continue our work.

Our immediate plans include finalizing the Kafka driver, significantly expanding the documentation on Kafka and YDB, and improving the tutorial. In addition, we’re about to add support for code generation with conversion from JSON schemes to С++ parsers, serializers, and structures. We also have ideas on how to further improve performance.

That’s it for today! Stay tuned!