High concurrency analytics with GoodData

Today, I am going to share more details about the very heart of GoodData platform, the Extensible Analytics Engine that we familiarly call the XAE. This core component performs every insight execution on our platform.

Let me demonstrate the power of XAE through one of our top customers who provides analytics to more than 29 thousand companies. This large-scale analytics solution involves 22.7 TB of data analyzed by 880k users. The data and user loads are not evenly spread, the largest customer serves almost 750 GB to 13.5k users.

The analytic insights are embedded inside of a responsive application that is used every day. Adoption (usage) is very high by analytics standards. As of November 1st, 2018, more than 28 of the 29 thousand customers are actively using the embedded analytics. This translates to execution of 4,554,678 insights with almost 20 million queries (one insight executes roughly 4 queries in average) per day peaking at 15k insights per minute. Total time to accommodate the queries is 8,217,294s, roughly 2,300 hours of computation.

When embedded inside of a daily use application, users will become impatient if the response takes more than a second. This is a very different expectation from the old, slow analytical dashboards of yesterday!

Remember, this is just one GoodData customer. There are many more such customers with similar bandwidth requirements that our platform serves in a multi-tenant way. To make things even more interesting, our customers double the amount of data and users every 6–9 months on average.

I see you asking HOW (TF) are we able to serve all of these analytical insights with such low latency and high concurrency. Let’s dive into the details now!

Firstly, we save a lot of CPU cycles by way of sophisticated caching. Our XAE engine is based on some serious math and it decomposes every multidimensional query to core algebraic operations that are hierarchically organized in a tree (we call it Query Tree — QT). We cache the results of each base operation represented by a node in the QT tree. The XAE is able to reuse more than 50% of these cached results across the execution of individual insights. This alone saves us roughly half of the query bandwidth, reducing the 2,300 hours of computation time to approximately 1,100 hours for that example customer. Stay tuned, we are going to write a separate post on the XAE algebraic magic soon.

QT tree examples — can you see a peacock or a dog on leash ? ;-)

The granular XAE caching is great, but it is still not enough to achieve the high concurrency and low latency that our customers require.

Tomáš Janoušek — one of the XAE authors as captured by lolcommits

We don’t believe in miracles and silver bullets, so let me first rule out what you might be guessing first. Am I right? Do you think that we stick some super-duper analytical database that separates compute from storage and we are done? That would be too easy. Unfortunately these new unicorn things are melting down at considerably lower concurrencies (around two dozens of users) and can’t get any close to the few milliseconds latencies that we need. Moreover, their “throw more hardware on it” approach does not scale economically with large concurrency and user base. Long story short, these analytical databases and warehouses are designed for few super-fast queries over large amounts of data but not for concurrency and low latency. Challenge me, and tell me I’m wrong in the comments, I always desire to learn something new!

In order to achieve the extremely high levels of concurrency and low latency that we require, we actually do the exact opposite of the cloud data warehouses: we move the data as close to the CPU as possible and bet on large-scale, shared-nothing, parallel execution. We distribute data from each of the 29k companies among roughly 500 hardware nodes, and we make sure that we are squeezing as much juice from each of them as possible. The CPU utilization rarely goes under 85% and the storage latency is very low because we are using SSDs.

Compute node with 72 CPUs — 24x7 average 87% utilization!

GoodData uses large-scale virtualization (highly customized OpenStack) on top of pools of physical compute nodes. Each node has the same architecture, CPU, RAM and Storage setup. The connectivity between compute nodes is provided by simple, fast, low-latency switched network on Layer 2 (ethernet) with 2x25 Gbit/s devices. Whole L3 TCP/IP routing and security is on local Gateway, available at each compute node.

GoodData compute node pool

One resource pool contains 15,240 CPUs, 120 TB HugePages RAM and 2 PB storage space. One compute node is based on the Intel Xeon Gold 6154 CPU with 18 physical cores capable of execution of computation tasks on full TURBO frequencies across all cores in integer load, and it has 740 GB in HugePage RAM. In total the resource pool provides the impressive 335,710 SPECInt2006 in Standard Performance Evaluation Comparison rate benchmark.

All physical connections use dedicated data paths like NUMA topology, SAS/SATA physical cables without expanders. Each SSD device has its own data line (SAS/SATA) and is directly connected to local NUMA computing node with local memory modules via PCIe. This ensures the best and most predictable IO throughput.

We pay a lot of attention to low-level BIOS and operating system configurations to optimize for the virtualization: for example IO threads, CPU pinning, CPU power management, task scheduling, etc.

The compute node OS is fine tuned for non-blocking parallel IO. Linux ecosystem provides us with a lot of flexibility to pick the most performant block multi-queue storage drivers. One node has an upper limit of 260 000 IO data operations per second with fully encrypted storage space.

Summary

We’ve learned our lessons hosting the fully managed analytics platform for 11 years. Providing this super fast analytics 24x7 with high SLA is hard and expensive, thus we constantly innovate, and evaluate new technologies on both software and hardware levels. So far we haven’t found anything better than the thorough optimization and massive parallelism.

I’m convinced that the widespread data partitioning that moves the data as close to CPU as possible and thorough optimization on the HW and OS levels is the only way how to achieve the performance, scale and cost efficiency.

This approach is by order of magnitude better than any solution based on public cloud and available commercial components.

Do you do something similar and have different experiences? Let me know, I always like learning new things!