Building a High Performance Trading System in the Cloud

A discussion on why and how running a low-latency trading system in the cloud is feasible

Published in

Proof Reading

14 min readJan 6, 2022

A few months ago, we wrote a couple of blog posts detailing the low-latency algo trading system that we built from scratch in the cloud. The response to these posts has been nothing short of overwhelming. As of this writing, we have over 31K views on the posts and have received numerous inbound communications from interested customers, investors, candidates, vendors, journalists, and others. The conversations it has sparked have been wide-ranging, but at least half of them have asked, to paraphrase: “In the cloud? Is that really workable?”

In this technical post, we want to go over why and how running a low-latency trading system in the cloud is feasible, and try to address a range of technical concerns.

If you have not already read our post “Proof Engineering: The Algorithmic Trading Platform”, I urge you to at least glance at it first, for this post to make more sense in context. Below are the high-level topics we will touch upon in this post:

Performance considerations
Resilience considerations
Cost considerations
General cloud advantages

A gratuitous image of fiber optic cables for no particular reason

Performance Considerations

Achieving low latency in the cloud

A lot of the low-latency engineers we heard from did not believe that it was possible to achieve low latency at all when running on cloud infrastructure. When talking about latency, a few things need to be straightened out upfront:

What do you mean by low latency? Definitions vary — it may range from sub-millisecond (for a full-fledged electronic trading platform) to 10-microsecond for a market-making/DMA/exchange trading system to sub-microsecond for a market data processing component. For us as an agency electronic trading platform, we’re calling ~200 microseconds as low latency, which is what our platform typically takes to “process an action” (see next point). Just for context, we’re aware of multiple bank algo trading platforms where the agency OMS/Algo system latency is measured in milliseconds (some of these are “legacy systems”, but not all).
Whenever someone mentions latency, your first response should be: “latency from what to what and how is it measured?” (the second one should be — “what percentile is that?”). Measuring latency in a distributed system is an art form in itself (or is it science? topic of a future post!), and there are just too many pitfalls. Latency can be measured as: “through latency” (the time it takes for a component to read an input and produce an output, measured using the component’s local clock — this misses any latency associated with queuing), “wire latency” (same as through latency, but measured using network taps — better), “round-trip latency” (where a natural round-trip pattern exists — much better, typically the most complete measurement), “one-way latency” (this typically involves using multiple clocks for measuring latency, and it is nearly impossible to get right; don’t use this and don’t believe anyone who does, other than if they’re halving round-trip latency). What this means is that our 100 microseconds may be the same as someone else’s 10 microseconds when we dive into what is being measured and how. For us, when we say ~200 microseconds, that’s the 90th-percentile round-trip latency between our sequencer and the OMS/Algo engine (note: this includes all business processing including order state management, algo spinning out child orders, and risk checks; median ~100μs, P99 ~400μs).

Here are some of the decisions we made or things we learned when building our low-latency system in AWS:

We run our system on dedicated VM instances (”dedicated” describes the AWS Tenancy type). This costs more but ensures that we don’t compete for resources with other tenants (in fact, no other customers share the underlying hardware). We do not use dedicated hosts as they require more planning and reduce flexibility when setting up. We do not use Kubernetes or containers for the trading system as they negatively impact compute performance as well as network latency between instances (we do use containers for our UX system).
All things being equal, we found that running a small number of large instances is better than running many small instances. Running multiple applications on a single large (e.g. 48 vCPU, 96GB) server gives us multiple benefits: (1) minimal latency between those applications (2) higher overall network bandwidth available (3) higher EBS bandwidth available. Of course, you lose all of the applications when this server fails, so this only works if the system has been divided into redundant partitions and you’re ok with losing any individual partition.
We found AWS EC2 placement groups to be useful. For the highest network performance, place instances in a cluster placement group. An often overlooked fact of EC2 networking is that a single stream of data (identified by a 5-tuple) is limited to 5Gbps, regardless of the overall bandwidth available to an EC2 instance. When using the cluster placement group, this limit is increased to 10Gbps. For better redundancy though, use partition placement groups (instances in a partition do not share rack/power with instances in another partition).
We observed that communication across availability zones (AZ) is significantly slower than within a single AZ (RTL ~50μs vs ~400μs). But in light of recent AWS failures, it seems worth the performance hit to use multiple AZs. Communication across regions is of course many times slower and primarily driven by geographical latency.
All current-generation AWS EC2 instances support ENA (Enhanced Networking Adapter). If you’re using Amazon Linux (which we do), you already have the appropriate driver, but if you are using a non-Amazon OS, you may need to install these drivers to get lower inter-instance latencies and higher packet rates.
We found AWS EBS volumes to be generally usable even for low-latency tasks (thanks to aggressive OS caching, we think). Except for some extreme use-cases (such as journaling every market data message), we have not had any issues with using EBS volumes as our primary disks. If you are writing a high volume of data or want to synchronously flush disk files, you may be better off using instance stores. As an example, we use c5d.12xlarge instances for our market data processing, which provide access to local 2x1TB NVMe SSDs (warning: instance stores are ephemeral and are wiped once you stop the instance).
We were pleasantly surprised to find that most of the usual tricks used in building low-latency systems continue to be applicable in the cloud. This is at least true in AWS, where the dedicated VMs behave much closer to bare-metal machines (our GCP tests were not as promising). If you’re used to tuning your network stack or pinning processes/threads/interrupts to CPU cores or using busy-spin threads, you can continue to do so with the expected results. Here is a handy tuning guide (YMMV, always consider your individual use-cases before tuning the OS).
Some of the tricks are, uh, tricker. If you’re used to bypassing the Linux kernel for your network communications, it is no longer as easy as using VMA or OpenOnload (although OpenOnload now seems to work with any OS providing decent AF_XDP support, possibly even in AWS, so if you’re using Linux kernel version ≥ 5.3, maybe it is that easy?). You can still use DPDK if you’re really keen on bypassing the kernel, but consider if those extra couple of microseconds and the higher packet rates are worth the effort. Then, there’s also the Elastic Fabric Adapter, which bypasses the kernel, but it is not widely available and again, it seems like a lot of work.
Still other tricks are sadly unavailable. The big one in this category is UDP multicast support. AWS actually supports multicast using their Transit Gateways, but our understanding is that while the functional support is there, it is not suitably low latency. It also seems pricey. If we get around to trying this out, and things look different, we’ll be sure to update this paragraph.

Achieving high throughput in the Cloud

Throughput is often a competing concern to latency. In other words, throughput can typically be increased at the expense of latency (by using pipelines, parallelization, or batching). So, it’s important to have separate design goals for throughput and latency.

We’ll note that throughput isn’t a huge concern for us except in certain subsystems such as the market data tickerplant, where we need to be able to handle massive amounts of data. Our sequencer and certain other framework components are capable of processing a message in as little as 100 nanoseconds, so the main bottleneck for us is network throughput.

A common trick to increase throughput is to batch messages. Often, the constraint when it comes to network throughput is the packet rate. By batching, one can transmit fuller packets, thereby sending more messages for the same packet rate and increasing network throughput (at the expense of latency, as the messages have to be queued). Using full jumbo frames (MTU 9001) is an extreme example of this, which we do use where possible.

In general, when someone quotes a high throughput number for their system, (e.g. 6 million messages/second), your first response should be “what is the average message/packet size?” In our case, when using full jumbo frames, our system can max out the available 5Gbps single-stream bandwidth. Using a message size of 88 bytes, we have seen rates of ~6.4M msgs/sec in the lab, which is very close to the theoretical max when accounting for TCP/IP overhead. The real-world throughput would be undoubtedly lower since we can almost never send full packets.

Resiliency Considerations

Some readers commented that it is very hard to build a resilient trading system in the cloud. While there are nuances, we think this claim is mostly wrong. The considerations when building a resilient cloud trading system are almost exactly the same as when building an on-prem system.

First off, infrastructure failures in the cloud are no more common than in data centers (ok, I know that the recent multiple AWS outages are not helping my argument here, but in general, we still believe this). Second, the trading system design needs to be resilient to hardware failure anyway, because on-prem hardware fails too. Third, when hardware does fail, the recovery is arguably much easier in the cloud. You can simply force-stop the instance and start it again on new hardware, while retaining all of the disk contents, its IP address, as well as network settings. Try doing that to an on-prem server!

In terms of designing resiliency/redundancy, if you’re used to thinking of data center racks, cages, and disaster recovery (DR) locations, you can now use the parallel concepts of EC2 placement groups, availability zones, and regions respectively, and you can generally create an equally-or-more resilient design in the cloud.

Cost Considerations

One of the readers commented, “if you’re running this in the cloud, you must have a very large budget”. This, again, strikes us as very wrong. I mean, sure, if you use cloud infrastructure in exactly the same way as your on-prem infrastructure (always-on servers), then your cloud costs will be significantly more than if you just bought the servers and kept them for a few years. However, if you understand the fundamentally different nature of cloud infrastructure — that these are transient, on-demand, elastic resources — then you can do much better than if you bought the hardware outright (don’t forget to factor in the cost of the network switches and the team that racks and manages the physical assets; oh, and also all the spare parts you need to stock for replacing failed disks, SFP cables, and network cards).

Infrastructure Life-Cycle

When using cloud infrastructure, you need to think not only about what you need (CPU/RAM/storage), but also the life-cycle of the resources. For example, in our case, the trading servers are only needed during business hours — they only run for 10 hours a day and only on weekdays, so we only pay for 50 hours in a week, or < 30% of the cost if we ran the servers 24x7. (Fun sidebar: yes, we may need to access the disks outside trading hours, and through the magic of the cloud, that’s actually possible to do without turning on the servers when using EBS disks).

This life-cycle concept extends to non-production environments and ad-hoc projects as well. For example, we don’t need a development or a UAT environment running all the time, while sometimes, we need multiple of these! Being able to describe your infrastructure needs in code (Infrastructure as Code or IaC) makes working with dynamic demand significantly easier. This approach is much more cost-efficient than if we had to buy and keep spare servers around and allocate them as needed.

Capacity Planning

Another subtle and overlooked advantage of working with elastic infrastructure is within the capacity planning process of any system. The traditional capacity planning process goes something like this: figure out what resources the system will need under peak load, and double that to provide a sufficient margin of safety. So, what ends up happening is that for like 95% of the time, the hardware is running at 10% utilization, and only sometimes it gets somewhere near 50%. As soon as the utilization gets near 50% even for a little bit, the capacity planning mandate will require going out and purchasing even more capacity, just to expand the headroom. This is extremely wasteful.

What if you could only allocate 2x of the average capacity demand (instead of peak), and then add capacity as you go. The disks on cloud servers can be expanded on-demand while the server is running! Don’t try that in your data center. For more CPU or memory capacity, you can add those overnight with minimal reconfiguration of the system.

This change in the capacity planning thought process can itself result in significant cost savings. If you don’t exceed your planned capacity for an extended amount of time, you don’t have a depreciating asset sitting unused in your data center (you may be shocked to realize that even an unplugged server costs about the same as a server you actually use in a data center — you are probably even paying for the rack space and the power). Also, what is the useful life of a server? For a trading system, two years at most, or even one, if you’re an organization that likes to run the latest technology. That’s worth keeping in mind when making these wasteful purchases and should factor into the CapEx vs OpEx debate, regardless of which side you fall on.

Commitments

A final point in this section, which is relevant for small firms like us, is that you don’t have to sign up for lengthy contracts when using cloud infrastructure. We pay AWS month to month and we don’t owe them anything if we decide to take our business elsewhere or if we decide to scale down significantly for some reason. Most companies that use on-prem infrastructure don’t actually own data centers, they rent space in a third-party data center, and there are relatively long commitments involved. Same for servers, a lot of firms actually just lease their on-prem servers for a fixed committed amount of time.

For us, the end result of using cloud infrastructure is that we can run our entire trading system (including the databases) for a monthly cost of ~$10K/month (on most days, this includes about 35 servers of varying sizes). This spend will certainly increase as we grow and expand our footprint, but as a start, we don’t think we could have done much better.

General Cloud Advantages

We’ve touched upon some of the advantages of using cloud infrastructure, but we think it is worth spelling out more of the details, just because of the sheer number of non-believers we have met.

Sizing Flexibility

As an engineer, I’ve often been asked to provide capacity requirements that will determine how many servers my trading system will be run on, or what kind of hardware it will get. Do I need more single-threaded CPU speed or a higher core count? Do I need more RAM or faster RAM? Do I need large storage or fast storage?

Here’s a dirty little secret — very often, these requirements are no more than educated guesses. You can have an elaborate model for capacity planning, but if it rests on finger-in-the-air estimates, all you can hope for is that your padding is big enough and you don’t run out of capacity. Once you make that purchase based on the capacity planning model, it is not easy to change your requirements and get new servers.

What if, instead, you could change your mind as new information comes in? The pressure would be off, both on those who provide the requirements/metrics and those who’re in charge of the capacity planning and procurement. Not only would you save money by using smaller padding in light of the more accurate metrics, but you would also be using hardware that is better suited to your needs.

Cloud infrastructure allows exactly this flexibility. We can and do resize servers overnight as new metrics come in.

Upgradability

This ties in with the previous point — new processor and server technology is being released constantly, and the aforementioned commitment/lock-in for 1–2 years can be inconvenient. As an engineer, I want access to the latest technology, and if my upgrade cycle only rolls around every 2 years, I’m constantly working with not-the-latest tech. For example, the AVX-512 instruction set could be super handy for fast processing of streaming data, or PCIe 4.0/5.0 bus could bring new I/O possibilities, or DDR5 RAM could add impressive amounts of memory bandwidth. How long before you can take advantage of these advancements? This might come as a surprise, but the hardware available via AWS EC2 instances is often more up-to-date than what many organizations are using.

Location Flexibility

If you’re running an on-prem US equities trading system, you probably run the system in a data center in NY/NJ. Your DR data center is perhaps in another city in NJ or Chicago. What if you wanted to change your primary location or your DR location? That would be a multi-year and multi-million dollar event. We can do this in the cloud with about a week’s notice and no real cost. (A fun anecdote: I know of a bank that had picked another city in NJ for its DR data center, only to find during Hurricane Sandy that the primary and DR data centers were located in the same flood plain. Oops.)

Here’s an interesting example: we run the system in the us-east-1 region of AWS, which is physically located in N. Virginia. However, AWS recently announced a Local Zone in NY, which would be much closer to the major trading centers in NJ. What will it take for us to move? Almost nothing at all — once we sort out a couple of things with our AWS Direct Connect lines (to make sure they don’t round-trip via Virginia), it will be trivial for us to stand up another environment in the NY local zone and just like that, we’ll have relocated our entire trading stack.

Managed Infrastructure

An under-appreciated fact about the cloud, when compared to on-prem systems, is that the infrastructure is managed.

Here are some of the things we didn’t do:

We’re not network or storage engineers and we haven’t had to hire one. Our system runs in 3 separate networks, with multiple subnets for further logical separation, but we didn’t set up a single switch or VLAN or any of that stuff.
We have shared drives on our computers (and even servers) like every company, but we didn’t set up expensive file servers. We use s3 buckets.
We’ve never had to consider whether a hard drive in one of our servers may be failing or if there is a loosely seated SFP cable into a network card or switch.
We’ve never had to worry about whether there is enough power in our data center racks to add another server, or if there is still a switch port available.

To be sure, this doesn’t mean that you don’t need any knowledge of these things — we did set up the networks and ACLs and routing tables, but the point is that most of the nitty-gritty is abstracted away into simple CLI commands (or even a UI if that helps). And to top it all off, the entire infrastructure provisioning and build can be automated using IaC tools such as AWS CloudFormation and Ansible.

Closing Thoughts

We believe the Proof algo trading system is trailblazing the path for the industry when it comes to building high-performance trading systems in the cloud (despite the recent hype around the Google/CME and the AWS/Nasdaq deals, those moves are still years away, especially the matching engines). The cloud landscape is fast-moving and new capabilities are seemingly coming online every day — it is an exciting time to be building this system in the cloud. If you think we’re doing cool stuff and you would like to contribute, reach out to us at careers @ prooftrading.com. See current open roles here.

If you have counter-arguments or feedback on any of the things we are doing, please reach out to me on Twitter: @preraksanghvi, or drop us a line at info @ prooftrading.com.