Proof
Published in

Proof

Proof Engineering: The Algorithmic Trading Platform

How we built an institutional-grade algorithmic trading platform in the cloud

Why Build It?

A High-Performance Trading System in the Cloud

What about latency in the cloud?

  • Most of the arms race is in reactive trading. As an agency trading platform, we can design our algorithms so that we don’t necessarily need to react to events. Trading on behalf of our clients, our goals include getting the best price without leaking too much information, but we are not looking to harvest rebates.
  • If somehow the tick-to-trade time did matter, we believe that getting colo space, the 40Gbps lines, and the fastest prop feeds, would still not help. It is nearly impossible for agency brokers to compete with the fastest participants on the street, simply because of the nature of our trading activity (risk checks, for one!). These latency-sensitive scenarios tend to be winner-take-all, and being just good is not good enough. There is no second-prize award.
  • Most trading strategies can be divided into the macro-strategy (time horizon of minutes or hours) and the micro-tactics (time horizon of microseconds or milliseconds). This macro part (the “algo”) is not latency-sensitive and is where the high-level trajectory of the order is computed — including order schedules, market impact estimates, etc. Using this intelligence, the algo decides when and how much of the order should be sliced and, which tactics should be used to execute those slices. Some of these tactics are indeed latency-sensitive, and for those, we intend to use existing facilities on the street that we trust (e.g. the IEX router routinely captures ~100% of displayed liquidity when sweeping the street; we know this, because we built it at IEX!).

The Equities Ecosystem

A schematic diagram of the Proof trading system and the ecosystem it is embedded in

Cloud Selection (Our Choice: AWS)

  • Performance: not just raw performance, but consistency; we wanted our VMs to feel as close to bare-metal as possible. Our ranking: AWS, then GCP, then Azure.
  • Usability: administrative console and command-line interface (scriptability). Our ranking: GCP, then AWS, then Azure (which was just… No. Azure lost big on this point.).
  • Pricing: they’re basically all similar. Compute resources are a bit cheaper on AWS, but network (egress) is more expensive. This ended up not being a deciding factor in the end.

Performance

AWS

  • Replication across servers in the same availability zone with attached disks: 6.3M msgs/sec (did not matter if machines were in Cluster or Partition Placement Group)
  • Replication across servers in different availability zones with attached disks: 3.3M msgs/sec
  • EBS vs attached storage: EBS storage could keep up in terms of throughput with attached NVMe storage, but NVMe was more consistent and lower latency
  • UDP (unicast) worked out of the box (slightly slower than TCP)
  • Ping time between boxes: ~50μs for same availability zone, and ~400μs across availability zones. Using an Echo benchmark, where two applications send messages back and forth, we were able to replicate this ping time as the RTL (round trip latency) for messages between two servers over TCP as well as UDP.
  • We could easily pin threads or interrupts to specific cores (including HT peer cores), with predictable behaviors as we would expect from a physical machine.

GCP

  • Replication benchmark performance using TCP: 6.3M msgs/sec
  • UDP didn’t work out of the box
  • Noticed throttling on the boxes — rates would degrade systematically over the duration of the test
  • Ping time between boxes: ~160μs
  • When pinning threads to specific cores, the benchmark did not complete

Azure

  • Benchmark performance using TCP: started at 6.5M msgs/sec but before halfway, fell to 3M msgs/sec
  • UDP worked out of the box and roughly same perf as TCP
  • Ping time between boxes: inconsistent 600μs — 2ms
  • After running the first test, the machine became unstable and had to be restarted
  • During the 2nd run, we had a 135s pause in the middle of the test (unacceptable!)
  • Of all 3, Azure was the worst experience when trying to get help from tech support. It was clear that they are set up for enterprise customers, not start-ups.

Extranet (Our Choice: TNS)

Market Data (Our Choice: Exegy)

  1. Legit low latency provider: They have to be a professional market data vendor that processes market data in single-digit microseconds with extreme reliability, and are used on the street for order routing. This is in contrast to a ton of web-based real-time vendors in existence today that stream market data using WebSockets/SSE or even using cloud-hosted Kafka instances. From what we understand, the latency on these ranges from tens of milliseconds to multiple seconds under load, which is not suitable for trading.
  2. Hosted solution: We needed this to be a hosted solution because we couldn’t install any market-data appliances, or specialized network cards, or even run any ticker-plant feed handler software. In fact, given that all of the common market data feeds are UDP-multicast based, we couldn’t extend those feeds to the cloud at all.

System Architecture

The Sequenced Stream conceptual diagram
  • Perfect synchronization: Every node in the distributed system receives an identical stream of inputs in identical order. If the nodes of the distributed system are written to be completely deterministic and not rely on any external inputs (not even local time!), it is possible to achieve perfect synchronization (consensus) among an unlimited number of nodes. This benefit alone is worth all of the troubles; in contrast to other traditional designs, the different applications in the system are never out of sync and never need reconciliation.
  • Perfect observability: If the sequenced stream is persisted to a durable medium reliably, which it is in our case, we can achieve perfect observability. Imagine a situation where you observe some unexpected behavior in the system. The usual mode of debugging may be to check the logs and the database and conduct a forensic exercise to piece together the cause of the behavior. In our system, we can just retrieve the sequenced stream file and replay it through the same code that is deployed in production, but inside a debugger session. We can see the exact flow of logic and even the individual variable values, exactly as they were in production. We don’t have to spend hours attempting to reproduce a race condition, we have perfect reproducibility each time.
    This enhanced observability also extends to aspects such as performance monitoring. The apps as well as the sequencer add enough telemetry to the sequenced messages to be able to precisely locate bottlenecks and queueing in the system.
  • Perfect auditability: With perfect observability comes perfect auditability. We never have to guess or piece together what the state of the market was at the time an order was sent, or how that state came to be. We have definitive answers to such questions.
  • Perfectly streamlined processing: Since all of the system inputs and outputs are recorded on the sequenced stream, it is trivial to delegate housekeeping tasks like logging and database insertions to separate non-critical apps. Critical path processing of orders and market data is streamlined to the point that an individual application can process events within microseconds. And all this while maintaining perfect observability, as outlined above.

The Tech Stack

  • Operating System: Amazon Linux 2
  • Programming languages: Java (SE of course, nobody in their right mind still uses EE), Python, Typescript
  • DevOps: Jira, Bitbucket, Confluence/Notion
  • Databases: SingleStore (fka MemSQL) for trading system, RedisDB for UX, OneTick for historical research
  • Clustering: None for the trading system, AWS ECS for UX
  • Operational Tools: AWS CloudFormation, Ansible, Jenkins, DataDog
  • UX Technologies: Node.js, React/Redux, AG Grid

High Performance Java

  • All code is single-threaded. Java has perfectly usable concurrency primitives, but keeping things single-threaded is not only safer (no synchronization bugs), but also much more performant (a simple thread-context-switch resulting from a call to Object.wait() can cost as much as 20ms in Linux). If we must use multiple threads for some reason, we use a Disruptor-based ring buffer to pass messages across threads.
  • Avoid garbage collection. Java is infamous for unpredictable stop-the-world garbage collections. The best way to avoid GC is to not create garbage in the first place. This topic could fill a book, but the primary ways to do that are: (a) Do not create new objects in the critical path of processing. Create all the objects you’ll need upfront and cache them in object pools. (b) Do not use Java strings. Java strings are immutable objects that are a common source of garbage. We use pooled custom strings that are based on java.lang.StringBuilder (c) Do not use standard Java collections. More on this below (d) Careful about boxing/unboxing of primitive types, which can happen when using standard collections or during logging. (e) Consider using off-heap memory buffers where appropriate (we use some of the utilities available in chronicle-core).
  • Avoid standard Java collections. Most standard Java collections use a companion Entry or Node object, that is created and destroyed as items are added/removed. Also, every iteration through these collections creates a new Iterator object, which contributes to garbage. Lastly, when used with primitive data types (e.g. a map of longObject), garbage will be produced with almost every operation due to boxing/unboxing. When possible, we use collections from agrona and fastutil (and rarely, guava).
  • Write deterministic code. We’ve alluded to determinism above, but it deserves elaboration, as this is key to making the system work. By deterministic code, we mean that the code should produce the exact same output each time it is presented with a given sequenced stream, down to even the timestamps. This is easier said than done, because it means that the code may not use constructs such as external threads, or timers, or even the local system clock. The very passage of time must be derived from timestamps seen on the sequenced stream. And it gets weirder from there — like, did you know that the iteration order of some collections (e.g. java.util.HashMap) is non-deterministic because it relies on the hashCode of the entry keys?!

Cloud Setup and InfoSec

  • Our entire system is deployed in private subnets, which means all hosts in the system use private IP space. There are no public IP addresses exposed to the internet.
  • The system is accessed, whether for administrative purposes or for UX purposes, over a VPN connection. We have a completely offline machine with our Root CA, that is used to sign the VPN client certificates.
  • Our system follows the NIST reference architecture, which prescribes having a management network separate from our production network. We use 3 separate VPCs (Virtual Private Clouds) in AWS — one for our trading system, one for our web/UX system, one for the management network.
  • Access to the trading system or the web system servers is through a pair of jump hosts (aka control hosts, or management hosts), access to which is tightly controlled, and the list of authorized keys is refreshed every 12 hours to remove any inadvertent grants of access.
  • Access to every individual server is protected using AWS Security Groups, NACLs, and a distinct set of authorized SSH keys, also refreshed every 12 hours.
  • We use the principle of least privilege, not just for users, but even for the apps. Even within our private network, we limit the communication between servers to specific TCP/UDP ports (e.g. we disallow SSH between hosts).
Proof trading system layout in the AWS cloud

The OMS and the Algo Engine

  • OMS: All algos need extensive OMS features such as accepting and validating orders, amendments, and cancel requests, as well as sending out child orders, amendments, and cancel requests. In addition, when fills are received from the venue, they need to be relayed back upstream.
  • Market Data: This algo container allows the strategy to subscribe to and receive market data from the rest of the system. In our system, a strategy can subscribe to market data for any reference security, not just the security of the order.
  • Risk Checks / Validation Engine: The algo container ensures compliance with certain invariants at all times. For example, at no time may the open child orders for a strategy add up to more quantity than the parent order quantity. Similarly, the algo container will ensure that no child orders or amendment requests violate the parent limit.
  • Static Data / Algo Ref Data: The algo container will facilitate loading and access to static data such as security master, venue configurations, destination preferences, symbol statistics, volume curves, and any trading models. The system is set up to allow the strategy to load any delimited file as ref data, without any code changes to the core system.
  • Timer Service: This doesn’t sound like anything worth writing about, but we mention it because it is an important algo “trigger”. In our system, a strategy can request an unlimited number of wakeup calls, each accompanied by a token payload that serves as a reminder for why the wakeup was requested. These wakeups/timers fire in accordance with time as observed on the sequenced stream (in a deterministic fashion).
  • Child Order Placement: The algo container provides facilities to help the strategy with venue and destination selection (which are two separate concerns). The container keeps track of which venues are accessible via which destinations, whether the venue or the destination is down, and even round-robin across connections when sending to a particular venue.

FIX Gateways

FIX Engine

Client Connectivity

Venue Connectivity

User Experience

UX Data Flow
  • Ability to dock views using drag and drop, with the ability to maximize views
  • Ability to open the same view multiple times (e.g. open Alerts window and filter by “ERROR”, then open another Alerts window and filter by “Order Rejected”; now you have two custom Alert views)
  • Ability to save derived views (e.g. filter Orders view by client=XYZ, then save the filtered view; now you have a dedicated Orders view for a specific client)
  • Support for multiple types of views side-by-side (e.g. order blotter and a volume curve chart)
  • Ability to detect and notify the user when data in the views may be stale
  • Ability to export/import the workspace layout, including properties of individual views such as column positions / widths and sort orders
  • Chrome Push Notifications
Proof Trading UX

AG Grid

  • Fast updates, multiple times a second. There should be no lag when navigating around the blotter as a thousand rows are being updated each second.
  • Virtual rows/columns — this is a performance optimization useful for blotters with a large number of records. If the grid has a million records, but the viewport will only allow 10 rows to be visible, it is beneficial for the grid to only render the rows/columns that are in the current view. This is tricky to maintain, of course, as the user scrolls or filters or navigates to rows outside the view. AG Grid does a great job of making this seamless.
  • Ability to select, sort, filter, and group on columns
  • Ability to quickly search through the entire grid (not at a column level, just everywhere). We’ve supercharged this feature to allow Regex searches, as well as expression searches (e.g. symbol=’IBM’ && client=’XYZ’)
  • Ability to see summary and counts in a status bar
  • Ability to use custom renderers or specify CSS class definitions per row/cell based on conditional logic (e.g. if an Order is fully filled, turn it green)
  • Ability to select rows or cells and copy selection
  • Ability to add custom actions to the right-click menu (e.g. Orders view has a right-click option for Cancel Order)
  • Ability to save/load current configuration of the blotter. This ties in with the workspace layout export/import feature mentioned above.
Proof Trading UX blotter based on AG Grid

Infrastructure and DevOps

  • We use BitBucket / Git to store all of our source code
  • Bitbucket Pipelines are configured as our CI (continuous integration) tool. For each commit, it builds all projects, runs hundreds of unit and integration tests, and passes or fails the build.
  • Official builds are created from check-ins to the main branch, and stored in a Maven repo in an S3 bucket. All official builds are tagged in the Git repo.
  • Once an official build is available, and a Change Management ticket has been approved by relevant stakeholders, the release is deployed first to the jump host, and from there, to the entire cluster (using ansible). So far, our policy is that all parts of the system must run the same build on any given trading day. The workflow for UX builds is a bit different, since we’re deploying the build to AWS ECS in that case.
  • Ansible is used for release and configuration management, as well as administrative tasks (e.g. clean-up old logs) both at the host level and the application level.
  • Jenkins is used as a scheduler to orchestrate all production jobs such as deployments, system start/stop, post-trade, archive, and even patch management (yes, yes, Jenkins is not necessarily the best tool for this; we could be using other tools better suited for ops jobs, but we’re developers at heart, so Jenkins it is for us!)
  • We use AWS CloudFormation for provisioning nearly all of our AWS resources. It works reasonably well, at least among the available options, though we certainly have our niggles with it (nested CFN stacks are impossible to manage; drift detection is broken; launching EC2 instances doesn’t work well over time; it sometimes wants to delete and recreate resources for the simplest of updates).
  • Monitoring: We use DataDog for infrastructure as well as application monitoring. We push all of our application logs to DataDog using their agent, and are able to set up monitors for specific keywords (e.g. ERROR or “Order Received”). We also monitor CPU, Memory, Disk space on the servers, and process monitoring allows us to receive an alert if a critical process is dead. They can be a bit expensive and their contract process was a bit weird, but otherwise, we can wholeheartedly recommend DataDog vs a number of hosted ELK providers we looked at.
  • Automation: We have automated nearly all of the operations tasks. Here’s a typical day in the automated life of our trading system: In the morning, a reference data set is produced and deployed to the relevant servers. The system starts up, connects to clients and venues, and begins accepting orders. It trades throughout the day with no human intervention (unless a supervisor deems it necessary), and at the end of the day, all open orders are canceled back to the clients. The system shuts down, performs all post-trade and regulatory tasks, and archives all of the logs/data, including to WORM storage for 17a-4 purposes. All of this without a human so much as clicking a button. We have appropriate safeguards at each step of the way, and a licensed human is always watching, but otherwise, we have automated humans out of the manual processes they usually perform.

Backoffice

  • Reference Data: We create our security master by combining FINRA CAT symbol master, NASDAQ security master, and IEX Cloud ref data / prices [Shout out to IEX Cloud — I mean, really, is there anything as good and cost-effective for financial data as IEX Cloud? I asked around, and as an example, Intrinio was 80x more expensive for what we needed].
    In addition to the security master, we generate symbol statistics and volume curves on a daily basis, and our dynamic VWAP model reference data periodically (using OneTick/Python). These are combined with static data files such as client/venue connections, to produce a reference data set for the trading day.
  • Clearing & Settlement: This involves: (1) dropping our trades to Apex throughout the day via a drop copy FIX connection, and (2) sending trades and allocations to Apex via a REST API at the end of the day after appropriate checks/reconciliations.
  • CAT: We produce CAT records in-house at the end of the day and publish files to FINRA via SFTP (FINRA handily supports an AWS PrivateLink connection).
  • OATS: We produce OATS records in-house at the end of the day and publish files to FINRA via IFT (Internet File Transfer). A lot of people don’t know this, but IFT is fully scriptable, meaning there is a REST API that can be used to automate uploads as well as collect feedback. If you have access to FINRA IFT, point your browser to this link and be amazed at the swagger API docs!
  • Supervisory reporting: Currently, we produce these reports: (1) a Daily Activity Report, that summarizes trading activity and flags any issues such as trades outside NBBO or possible market manipulation attempts (2) a Time Synchronization Report, which checks that our server clocks are synced to within 50ms of an acceptable reference clock. (Sidebar: the Amazon Time Sync Service is an incredible no-effort way of synchronizing clocks to within a millisecond of GPS time) (3) 606 reports which are generated upon client request and cover the preceding 6 months of trading data

Closing Thoughts

--

--

Proof is a new institutional equities broker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store