Utilizing Intel® QuickAssist Technology to Enhance TLS Performance in Trendyol CDN

Tufan Karadere
Trendyol Tech
Published in
11 min readDec 18, 2023

At Trendyol, we use MultiCDN architecture to deliver content to our users in a fast, secure and highly available way. This means we manage our own CDN infrastructure in addition to working with 3rd party CDNs. To have more information about the general architecture of Trendyol CDN, or Girdap as we call it internally, along with several optimizations aimed at increasing the overall efficiency, refer to two informative articles by Levent CENGIZ: Trendyol CDN — 1 and Trendyol CDN — 2.

When your infrastructure handles a substantial volume of traffic, as is the case with Trendyol CDN, any performance improvement, even if seemingly modest during testing with a small amount of data or traffic, assumes greater significance and impact when considered in the context of the overall traffic volume.

Intel® QuickAssist Technology is one of the important optimizations incorporated into Trendyol CDN, to boost the overall TLS handshake performance of each individual CDN node.

This article aims to provide detailed information about the installation steps of Intel® QuickAssist Technology (Intel® QAT), focusing on QAT Software (QAT SW) mode that was used to enable it in our Proof of Concept (PoC) environment, along with the results of the benchmark tests conducted.

General Information

Intel® QuickAssist Technology (Intel® QAT) is a technology that improves performance of cryptographic operations through hardware acceleration, such as symmetric or asymmetric encryption, authentication, digital signatures, RSA, DH and ECC.

There are primarily two types of platforms:

qat_hw

qat_hw is hardware acceleration utilizing the QAT driver, for which the following hardware is required:

  • Intel® Xeon® with Intel® C62X Series chipset (except C621A, which does not have QAT support)
  • Intel® Atom™ processor
  • Intel® Communications Chipset 8925 to 8955 Series (QAT accelerator adapter)

qat_sw

qat_sw is acceleration using the Crypto-NI instruction set available in 3rd Generation Intel® Xeon® Scalable Processors. This set adds new instructions such as Vectorized AES and Integer Fused Multiply Add on the basis of the Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) that the Intel® Xeon® Scalable Processors already have.

To enable QAT support in 3rd Generation Intel® Xeon® Scalable Processors (qat_sw), the following libraries need to be installed:

  • Intel® Crypto Multi-buffer library(for Asymmetric PKE), available at ipp-crypto repo
  • Intel® Multi-Buffer crypto for IPsec Library, available at intel-ipsec-mb repo
  • Intel® QuickAssist Technology(QAT) OpenSSL* Engine, available at QAT_Engine repo

Installation

Installation was done following the Nginx HTTPS with Crypto-NI Guide tuning guide, which was slightly dated but still easy to follow.

Crypto-NI

Crypto NI consists of IPP Cyrptographic Library and Multi-Buffer Crypto for IPSec Library. They can be installed using the following steps:

IPP Cryptography Library

git clone --recursive https://github.com/intel/ipp-crypto.git
cd ipp-crypto
git checkout ipp-crypto
git checkout ipp-crypto_2021_5 # Different from the guide
cmake . -Bbuild -DCMAKE_INSTALL_PREFIX=/usr
cd build
make -j
sudo make install

Multi-Buffer Crypto for IPSec Library

git clone https://github.com/intel/intel-ipsec-mb.git
cd intel-ipsec-mb
git checkout v1.1 # Different from the guide
make -j SAFE_DATA=y SAFE_PARAM=y SAFE_LOOKUP=y
sudo make install

OpenSSL

The operating system package can be used as long as its version is 1.1.0 or later.

QAT Engine

git clone https://github.com/intel/QAT_Engine.git
cd QAT_Engine
./autogen.sh
./configure --enable-qat_sw --with-qat_sw_install_dir=/usr # Different from the guide
make
sudo make install

After the installation, algorithms supported by qatengine can be seen using the following command:

$ openssl engine -t -c -vvvv qatengine
(qatengine) Reference implementation of QAT crypto engine(qat_sw) v0.6.11
[RSA, id-aes128-GCM, id-aes192-GCM, id-aes256-GCM, X25519, SM2]
[ available ]
ENABLE_EXTERNAL_POLLING: Enables the external polling interface to the engine.
(input flags): NO_INPUT
POLL: Polls the engine for any completed requests
(input flags): NO_INPUT
ENABLE_HEURISTIC_POLLING: Enable the heuristic polling mode
(input flags): NO_INPUT
GET_NUM_REQUESTS_IN_FLIGHT: Get the number of in-flight requests
(input flags): NUMERIC
INIT_ENGINE: Initializes the engine if not already initialized
(input flags): NO_INPUT

Tengine

We have used both tengine and nginx with asynch_mode_nginx extension in our tests. Nginx installation guide can be found on Nginx web site and the asynch_mode_nginx installation guide can be found on async_mode_nginx. Both nginx and tengine showed very similar results.

Installation

curl -L https://tengine.taobao.org/download/tengine-2.3.2.tar.gz -O tengine-2.3.2.tar.gz
tar -xf tengine-2.3.2.tar.gz
cd tengine-2.3.2
./configure $(nginx -V 2>&1 | awk -F':' '/configure arguments:/ { print $2 }') --with-openssl-async --with-ld-opt='-Wl,-rpath=/usr/lib -lssl'
make
sudo make install

nginx -V part in ./configure is to get the existing compile parameters from the running nginx process. Your parameteres might be different, the important part is --with-openssl-async to enable OpenSSL async.

Configuration

  • Add ssl_async on; to a server section in nginx.conf
  • Add the following content to /etc/ssl/openssl.cnf:
openssl_conf = openssl_def
[openssl_def]
engines = engine_section
[engine_section]
qatengine = qatengine_section
[qatengine_section]
engine_id = qatengine
default_algorithms = ALL

Benchmarks

Methodology

The two areas of our tests involved the algorithm speed tests and the actual tengine (and nginx) TLS handshake tests. The algorithm speed test is easy conduct on the server itself, and the speed increase can easily be observed.

For the TLS handshake tests, the information at nginx+OpenSSL+QATEgine+QAT|Core Benchmarking Methodology has been used. In short, it involves using external servers to initiate multiple instances of openssl s_time command towards the target server running Nginx/Tengine. Three different sets of benchmark tests have been done in total:

  • First set consists of testing the server by running the benchmark script on the server itself. Although this might not be the best way to perform a benchmark, as the server resources will also be consumed by the benchmark script, it still helped observe the speed increase provided by QAT SW in a relatively easier way. We used CPU affinity configuration to dedicate cores to tengine and the script processes separately.
  • Second set of tests involved using 3 external servers with a total of 48k clients (16k clients per server).
  • Final set of tests involved simulating actual production load, using our own Ares test platform with more than 100 worker instances, simulating around 100k users trying to connect to the server at the same time.

Configuration

The base configuration used for asynch_mode_nginx and Tengine (without QAT enabled) using 64 workers (32 core / 64 hyperthreads) is:

QAT disabled asynch_mode_nginx configuration:

worker_processes 64;
worker_cpu_affinity auto 11111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000;
worker_rlimit_nofile 1048576;
events {
worker_connections 8192;
use epoll;
accept_mutex on;
}


http {
server {
listen <IP_Address_of_the_test_server>:443 ssl reuseport backlog=131072 so_keepalive=off rcvbuf=65536
sndbuf=65536;
keepalive_timeout 0s;
ssl_verify_client off;
ssl_session_tickets off;
access_log off;
ssl_session_timeout 300s;
ssl_protocols TLSv1.2;
ssl_ciphers AES128-GCM-SHA256;
ssl_prefer_server_ciphers on;
ssl_certificate server.crt;
ssl_certificate_key server.key;
location / {
root html;
index index.html index.htm;
}
}
}

QAT enabled asynch_mode_nginx configuration:

load_module modules/ngx_ssl_engine_qat_module.so;

ssl_engine {
use_engine qatengine;
default_algorithms ALL;
qat_engine {
qat_offload_mode async;
qat_notify_mode poll;
qat_poll_mode heuristic;
}
}

worker_processes 4;
worker_cpu_affinity auto 11111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000;
worker_rlimit_nofile 1048576;
events {
worker_connections 8192;
use epoll;
accept_mutex on;
}


http {
server {
listen <IP_Address_of_the_test_server>:443 ssl reuseport backlog=131072 so_keepalive=off rcvbuf=65536
sndbuf=65536;
keepalive_timeout 0s;
ssl_verify_client off;
ssl_session_tickets off;
access_log off;
ssl_asynch on; #### enable async mode
ssl_session_timeout 300s;
ssl_protocols TLSv1.2;
ssl_ciphers AES128-GCM-SHA256;
ssl_prefer_server_ciphers on;
ssl_certificate server.crt;
ssl_certificate_key server.key;
location / {
root html;
index index.html index.htm;
}
}
}

For tengine too, we used a simple configuration with the addition of ssl_async on directive in server block. Tengine reads /etc/ssl/openssl.cnf file during startup, which enables qatengine. Thus, this file has been renamed, and tengine was restarted to enable/disable QAT support.

/etc/ssl/openssl.cnf:

openssl_conf = openssl_def
[openssl_def]
engines = engine_section
[engine_section]
qatengine = qatengine_section
[qatengine_section]
engine_id = qatengine
default_algorithms = ALL

For parallel openssl s_time tests, a modified version of the benchmark script found on the methodology page was used.

Results

Algorithm speed tests

For the algorithm speed tests openssl speed command was used.

Public key and key exchange algorithms:

openssl speed -elapsed $algo
openssl speed -elapsed -engine qatengine -async_jobs 16 $algo
+-----------------------+----------+-----------+----------+------------+
| public key algorithm | sign | QAT sign | verify | QAT verify |
+-----------------------+----------+-----------+----------+------------+
| rsa2048 | 1554.9 | 8867.2 | 54975.3 | 174442.4 |
| ecdsap256 | 38589.0 | 139024.4 | 12291.3 | 14252.0 |
| ecdsap384 | 1005.2 | 41650.4 | 1413.4 | 1351.6 |
+-----------------------+----------+-----------+----------+------------+
+-------------------------+----------+----------+
| key exchange algorithm | ops | QAT ops |
+-------------------------+----------+----------+
| ecdhp256 | 18461.6 | 67283.2 |
| ecdhp384 | 1067.4 | 16924.8 |
| ecdhx25519 | 27722 | 153282.3 |
+-------------------------+----------+----------+

Cipher algorithms (values are in k ops/s) :

openssl speed -elapsed -evp $algo
openssl speed -elapsed -evp -async_jobs 16 $algo
+-------------------+---------------+---------------+---------------+---------------+---------------+---------------+
| cipher algorithm | 64b | 1024b | 16384b | QAT 64b | QAT 1024b | QAT 16384b |
+-------------------+---------------+---------------+---------------+---------------+---------------+---------------+
| aes-128-gcm | 1,671,446.51 | 5,236,453.03 | 5,964,414.98 | 1,644,199.87 | 9,103,219.03 | 12,855,159.47 |
| aes-256-gcm | 1,580,357.50 | 4,513,494.70 | 5,154,717.70 | 1,845,422.04 | 9,147,968.17 | 11,502,906.0 |
+-------------------+---------------+---------------+---------------+---------------+---------------+---------------+

TLS Handshake tests

These tests have been performed using

tls-benchmark.sh -r $server_ip:443 --cipher $cipher --number-of-clients $num_of_clients --time 40

command.

TLS Handshake — 1 server (test server itself)

TLS 1.2 | AES128-GCM-SHA256:


+----------------+-----------+------+--------------+--------+---------------+--------+--------------+--------+
| nginx workers | 1 client | 100 clients | 2000 clients | 8000 clients |
+----------------+-----------+------+--------------+--------+---------------+--------+--------------+--------+
| | nginx | QAT | nginx | QAT | nginx | QAT | nginx | QAT |
| 2 | 880 | 104 | 2243 | 6822 | 2470 | 6804 | 2812 | 7535 |
| 8 | 926 | 210 | 8615 | 21612 | 9518 | 24317 | 9029 | 26357 |
| 16 | 920 | 279 | 16939 | 21745 | 18530 | 42794 | 23442 | 45986 |
| 32 | 847 | 380 | 28961 | 9687 | 32710 | 53630 | 36791 | 57524 |
| 64 | 838 | 467 | 41314 | 10289 | 56948 | 94644 | 64636 | 103959 |
| 128 | 839 | 513 | 48772 | 11312 | 65018 | 92947 | 91332 | 112527 |
+----------------+-----------+------+--------------+--------+---------------+--------+--------------+--------+

TLS Handshake — 3 external servers, 16k clients per server

TLS 1.3 | TLS_AES_256_GCM_SHA384

+----------------+--------+--------------+----------+---------------+
| nginx workers | nginx | nginx & QAT | tengine | tengine & QAT |
+----------------+--------+--------------+----------+---------------+
| 64 | 62449 | 145325 | 62650 | 141208 |
| 128 | 79442 | 175402 | 79743 | 174243 |
+----------------+--------+--------------+----------+---------------+

TLS 1.2 | AES128-GCM-SHA256

+----------------+--------+--------------+----------+---------------+
| nginx workers | nginx | nginx & QAT | tengine | tengine & QAT |
+----------------+--------+--------------+----------+---------------+
| 64 | 62449 | 145325 | 62650 | 141208 |
| 128 | 79442 | 175402 | 79743 | 174243 |
+----------------+--------+--------------+----------+---------------+

Ares results

Ares is our in-house testing platform, which can be used to test a wide variety of scenarios, including simulations of real world workloads. The following tests have been conducted utilizing Ares platform to simulate the real world traffic on our production servers.

A configuration comparable to our production was used on the test server with the exception of specifying and modifying the TLS version and cipher suite. When the test server experiences heavy load and has QAT enabled, we observed a reduction in CPU load, an increase in responses per second by the server, and a decrease in response times. Each of these improvements varied approximately between 25% and 50%, compared to the same server with QAT disabled.

TLS 1.2 | AES128-GCM-SHA256

CPU Load
Responses Per Second
Response Time

TLS 1.3 | TLS_AES_256_GCM_SHA384

CPU Load
Responses Per Second
Response Time

Conclusion

Our PoC installation and benchmark tests on a server with hardware and software configuration comparable to a production server have shown that enabling Intel® QuickAssist Technology significantly increased its performance in performing cryptographic operations with a specific focus on TLS handshakes. We observed a reduction in CPU load and response time, coupled with an increase in number of responses per second, each with a percentage in 25-50% range.

These effects have had the following practical consequences in production, since the time we have enabled QAT:

  • The total amount of CPU power required for handling TLS traffic was reduced to as low as half of what is needed when QAT was disabled. Considering the CPU is one of the more expensive components of a server, if not often the most expensive, we are now able to handle more traffic with less CPU, without any increase in costs.
  • We can respond much more quickly to the connection requests under heavy load.
  • We can handle more connection requests per second per server without scaling up the hardware.
  • Causing a denial of service on a server due to high CPU load, whether from legitimate traffic or otherwise, has become more difficult.

As mentioned previously, in the context of a high-traffic environment like Trendyol CDN, we continually feel compelled to identify areas that can be improved. Even optimizations that might be considered small in the context of one server or a small amount of data are researched by us to assess their overall impact in real-world traffic scenarios. That being said, enabling QAT was one of the crucial optimizations that successfully empowered us to use hardware resources more efficiently, resulting in a more performant architecture without additional costs.

Want to know more? Or contribute by researching it yourself?

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.

References

--

--