OS Level Latency Optimization on Forex Systems

Published in

Swissquote Tech Blog

12 min readNov 27, 2020

Hi, I’m Raphaël, a Senior System Engineer at Swissquote working in our SRE team.

I will present the result of my work on optimizing the GNU/Linux System to reduce jitter/latency on our Forex Trading applications as well as the JVM choice and configuration.

The scope of the project was to find a setup to optimize critical Forex application CPU/Memory response times by manipulating the app itself, the BIOS, the kernel and the OS configurations.

I will walk you through the various benchmarks and experiments I did to get this result. Keep in mind that this will cover only CPU/Memory optimization and has consequences on the rest of the system, I will cover those at the end of the article

What we were able to achieve (TL;DR)

Results represent the max peak jitter time and the summary is based on the hiccups* of the JVM

Overall winner: Max Jitter reduced by a factor of 38 Using an Optimized OS & Zing v11 LTS JVM

*hiccups measurement definition; it measures how long it takes a separate application thread to do absolutely nothing

Thanks to: Cédric Munger , Gwenaël Yap and Unix team for their help and integration tests.

For the whole technical part please see this github repo.

Objectives

We needed to guarantee to a high degree that jitter will not severely impact our Forex applications.

The first vector is CPU & Memory latency. For every modification I ran tests and benchmarks to determine its impact on the underlying system.

Jitter can come from many factors, so the research was done as follows.

Research: Read articles & papers on low latency system optimization
Baseline: Run benchmarks on our default Ubuntu installation, determine current latency/performance/throughput.
BIOS: Modify BIOS parameters based on official manufacturer documentation (server, CPU, hardware)
Kernels: Compare default generic kernel and low latency kernels.
OS & HW: Tweak parameters needed to reduce latency at the system, BIOS & Kernel levels.
JVM: Compare different GC, configuration and finally the JVM itself with Zulu v13 and Zing v11 LTS

Benchmark Tools Used

A lot of tools exist in the space of System performance benchmarking, here are those that I used in this process:

Interbench is designed to measure the effect of changes in Linux kernel configuration such as CPU, I/O scheduler and filesystem options.
More information here

Cyclictest accurately and repeatedly measures the difference between a thread’s intended wake-up time and the time at which it actually wakes up in order to provide statistics about the system’s latencies. It can measure latencies in real-time systems caused by the hardware, the firmware, and the operating system.
More information here

SysBench is a modular, cross-platform and multi-threaded benchmark tool for evaluating OS parameters that are important for a system running a database under intensive load.
More information here

The Phoronix Test Suite is a testing and benchmarking platform that allows for carrying out tests in a fully automated manner from test installation to execution and reporting.
More information here

jHiccup is an open source tool designed to measure the pauses and stalls (or “hiccups”) associated with an application’s underlying Java runtime platform. The new tool captures the aggregate effects of the Java Virtual Machine (JVM), operating system, hypervisor (if used) and hardware on application stalls and response time.
More information here

Tests & Benchmarking

Test Naming Conventions

Default-config-12cpus → The default server installation for Ubuntu 18.04 LTS provided for our production machines.
KernelLL-optimized-12cpus-node1 → Optimized server installation (config in the github repo) with the configuration in the github repo, using the low latency kernel.
KernelGeneric-optimized-12cpus-node1 → Optimized server installation with the configuration above using the generic kernel installed

Full Test Results Documents

For the sake of brevity I will not reproduce all the results here but rather link to PDFs with the detailed results:

Interbench, Cyclictest, SysBench: Low latency stats — FX.pdf
Phoronix test suite: Phoronix test suite results.pdf

The most glaring results are obtained when the system is loaded. According to these results it seems to be possible to guarantee that the latency stays roughly the same whether the system is heavily loaded or not, and prevent excessive jitter !

The test confirms that while the performance of the system is decreased, the latency is drastically reduced, which is that we want for our FX applications.

Results by Test

Benchmark result with my custom script focused on the CPU/Latency
*Loaded = simulated whole system load of around 250–280

Phoronix Test Suite Runs

With the short summaries here, we can see the most important impact on performance for the file system:

Write latency impact on disk -59.7%

Write impact on disk -62.29%

R/W impact on disk -61.37%

Global stats

Global stats from Phoronix show that the winner for the highest performance is not the one optimized for latency:

WINS:default-config-12cpus:                 38   [62.3%]kernelgeneric-optimized-12cpus-node1:  20   [32.8%]kernel-ll-optimized-12cpus-node1:      3    [4.9%]LOSSES:kernel-ll-optimized-12cpus-node1:      41   [67.2%]default-config-12cpus:                 13   [21.3%]kernelgeneric-optimized-12cpus-node1:  7    [11.5%]

External OpenBenchMarking Kernel Tests

Benchmarking of some different Ubuntu 17.10 Kernels (Generic, lowlatency and liquorix)

Ubuntu 17.10 Kernel Tests January 2018

Tuning impact on JVM

Now that we have improved the performance at the lowest levels, up to the kernel, we’ll try to improve the latency of the application itself

Java app analysis

First, let’s gather the current numbers; with the default OS installation, I set the test as follow:

Default Ubuntu 18.04 LTS installation
App affinity set for CPU 12 to 23 (Physical second CPUs)
Default ‘nice’ (priority) of the application set at 0

The maximum jitter that we have is around 32ms in ~2.5h

If we zoom between 2 majors peaks of jitter, the max we see is between 1.5ms to 1.9ms, often at ~1.25ms. Time-frame ~10min.

Zing

Instead of the default JVM we currently use we’ll try to use a JVM specifically tailored for performance, as they describe it :

Zing® is a better JVM with better metrics that is certified fully compliant with the Java SE 11, 8, or 7 specification. Zing enables your production applications to operate in business real time.
Zing Java for the Real Time Business

It’s a business solution and needs a license.

Testing Zing

We will first test this solution with the default OS installation, meaning with an unoptimized server, and we will then move on to the optimized installation.

We can see lot of improvement just by switching the JVM to Zing, which decreased the max jitter at ~7.9ms, it’s -75% jitter, factor of 4 from the non-optimized, and with my optimized configuration on that, it’s -97.4% jitter a factor of 38 (compared to the highest jitter value).

The best we got is -98.6%, a factor of 75 (with the median of jitter) from the default installation.

Non-optimized + Zing (6 hours)

Max jitter peak (yellow) at ~7.9ms ( -75.31% compared to the non-optimized or a factor of 4)
The max (yellow) without the huge jitter peaks is usually between 1ms to 3ms
The samples (blue) have some dispersion, between ~938/ops to ~958/ops

Non-optimized + Zing (2 hours)

Max jitter peak (yellow) at ~3.4ms

The best result - Optimized server + Zing (6 hours)

The max (yellow) is incredibly stable without really jittering, and max around 780us ( -97.56% compared to the non-optimized or factor of 41)
The samples (blue) have no dispersion between ~949/ops to ~951/ops

The best result - Optimized server + Zing (2 hours)

The max jitter in the 2h view is at ~340us, (decreasing to -90% from the non-optimized with zulu or divided by ~100)

Drivers analysis and testing options

Legends

#1 optimized server with perf enabled on the the application (bad impact)
#2 optimized server without perf enabled on the application (good, but with some little jitter)
#3 optimized server with ZGC (Z Garbage Collector) (good result !)
#4 optimized server with the default kernel scheduler (SCHED_OTHER), the default priority (0), and the Shenandoah GC
#5 optimized server with the Shenandoah GC
#6 non-optimized server with the default system installation

Using ZGC

ZGC provided one of the best results (except for Zing). As far as quick wins go, this seems to be a good one if Zing is not an option for you. As you can see in #3 above.

Optimized server + Zulu JVM (v13) + ZGC (1h view)

The max jitter is around ~1.1ms compared with ~12ms to ~15ms with the default G1GC, which is a division by 10!

Integrating it to Swissquote

Now that we know what the best configuration for our use-case is, how do we apply it to our production ?

I tried to implement it inside our environment and inside an LXD container, which is what we use on the IT side.

On the application side, the best way is to let the application with its own configuration to configure their needed CPUs at the start. Because we need to specify which cores to use and which priority is needed, there is some setup involved at the start of the application.

I tried to implement that on LXD, but it’s currently not possible (for a production application) to do that inside a container, and clearly not maintainable on the infrastructure side. There are some reasons for that; here is a short list:

We need to modify the default cpuset cgroup of LXD, and the problem with that: it overwrites the cgroup config at every modifications of the “daemon”. For example if you install or stop another LXD container in the host, it will break the CPU affinity inside the container. LXD cannot see the isolated CPUs because these CPUs are hidden on the kernel scheduler side.
We cannot assign other cgroups inside a specific container, nor sub group of itself, since cpuset is also rewritten if the DB of LXD is modified
We need privileged container (which is not an issue, but important to know)

So my recommendation is to do it on a bare metal machine only.

Conclusion

This setup is only for specific applications which need the lowest latency response possible. This is not a setup for every application, for reasons I describe below.
The result is mostly based on the hiccups of the process, it’s to measure runtime hiccups while your application is actually running: It measures how long it takes a separate application thread to do absolutely nothing

The main goal of this project was to eliminate or reduce the jitter on our Forex applications to a minimum and the possible freeze this could cause.

I was able to reduce the latency of the running process on CPUs with the isolated CPUs method and by setting some parameters in the BIOS, in the kernel, and the system.

By enabling the parameters described above, the latency was reduced as follows with the minimum jitter.

Kernel flavor

Winner: Generic kernel

We can clearly see that the optimized server installation with the generic kernel wins.

Why this difference between the low latency one and the generic? It seems like the most important improvement that was done for the real time research “preempt-rt” was merged to the generic kernel, and we can use the flags and options out of the box to enable it, which is awesome!

Theoretical tests with my benchmark

On a loaded server around 250–280 of load, and between the default configuration and the optimized configuration, the improvement of the latency with the benchmark tools used, the theoretical latency was reduced as follows:

The reduction of the max latency is huge. This result was not expected, and it’s good news!

But keep in mind it’s only the reduction of the max of jitter and not the performance or throughput of the machine, which are decreased from a default installation.

Our application without optimization

So, with the default current installation on java-13 with the G1GC, and with the optimized server installation we can reduce the max jitter from -45% to -84%, a factor of ~4. That’s just with the BIOS/OS/Kernel improvements.

ZGC

By just changing the garbage collector with ZGC on a optimized server, we can reduce the jitter to ~1.2ms which is -96%, a factor of ~25 compared to the default installation.

Zing JVM

Changing to Zing JVM, the highest Jitter peaks is at 780us which is -97.4%, a factor of ~38 compared to default installation. All the optimization added to the OS are even more striking with Zing.

Optimization Impact Summary

Results represent the max peak jitter time and the summary is based on the hiccups* of the JVM

*hiccups measurement definition; it measures how long it takes a separate application thread to do absolutely nothing

Overall winner: Max Jitter reduced by a factor of 38 Using an Optimized OS & Zing v11 LTS JVM

Warning

On the other hand, there are side effects, here are a few of them

This increases the power consumption of the machine to the maximum, the ACPI is set to be at the maximum performance scale.
This increases the CPU usage by a lot, as it isnever going into idle/sleep mode, and reduces its end of life.
We lose throughput of the disk between -59.70% to -62.29%, we can get some performance back with some tweaks, but it wasn’t the goal on this project as we don’t need disk Read/Write.
We also lose the context switching on the OS around -67.54%.

So to stay focused on the objective, dedicate some applications to some specific CPU only, if no application needs these CPUs, dedicate them.

Finally, the optimization of the latency is really depending on the application needs and is clearly specific. It is not a configuration to be applied on every machine. But it can be applied (in some of our cases) to other Forex applications which have the same needs to have low latency response and don’t need lots of disk R/W.

This works was done pointing the CPU & RAM latency. Some R&D can be still done, like network improvement, hugepage settings, numad and numactl usage for the affinity of the CPUs instead of taskset currently used

Thanks for reading my article, I hope it was as interesting for you to read it as it was for me to do this research.

References

BIOS

System

Bench tools

Divers

OS Level Latency Optimization on Forex Systems

What we were able to achieve (TL;DR)

Objectives

Benchmark Tools Used

Tests & Benchmarking

Test Naming Conventions

Full Test Results Documents

Results by Test

Phoronix Test Suite Runs

Global stats

External OpenBenchMarking Kernel Tests

Tuning impact on JVM

Java app analysis

Zing

Testing Zing

Non-optimized + Zing (6 hours)

Non-optimized + Zing (2 hours)

The best result - Optimized server + Zing (6 hours)

The best result - Optimized server + Zing (2 hours)

Drivers analysis and testing options

Legends

Using ZGC

Optimized server + Zulu JVM (v13) + ZGC (1h view)

Integrating it to Swissquote

Conclusion

Kernel flavor

Theoretical tests with my benchmark

Our application without optimization

ZGC

Zing JVM

Optimization Impact Summary

Warning

References

Written by Raphaël Prétôt