10 Weeks of Live Trading at Proof: Tech Releases in Production

Published in

Proof Reading

11 min readJun 8, 2021

Launch day at a startup is of course very exciting. In our case, we spent two years assembling the team, designing and building the tech, fundraising, building customer relationships, integrating with the street, and obtaining regulatory approval. Two years of hard work all leading up to a single launch day.

Unlike at IEX (which similarly took about 2 years to build), Proof’s launch did not need to be a big-bang event. IEX is a marketplace, which means we needed a critical mass of customers on day one so that they could interact with each other and have a half-decent experience. At Proof, we are building an agency broker, and we can provide an excellent experience with even just a single customer. That means we can open the floodgates gradually, and keep an eagle eye on every individual order to make sure each feature of the product functions properly as we slowly ramp up trading activity and risk.

Both cases, though, did involve a mad dash leading up to launch: testing like crazy, chasing down every last bug and corner case (that we knew of!), tying up loose ends, etc. There are inevitably things that you push off until later as well as problems that don’t reveal themselves until after you go live. The hope is that you’ve built enough guardrails and alerts that when something does go wrong, at least you realize quickly and the damage is controlled, but you never really know how it will go until you take the leap and flip the switch.

At both Proof and at IEX, launch day was exhilarating and stressful, but in retrospect both launches went surprisingly smoothly. There is an unbelievable amount of complexity to a trading system — so many moving parts interacting with each other, so many edge cases and race conditions. As many times as we’ve done this before, and despite the thousands of hours of testing, it is still a surprise when things mostly just work as intended.

This post is an overview of all the changes we deployed to our trading software over our first ten weeks in full production. Why publish these to the blog? First, we are proud of our ability to build and deploy to new features at a rapid pace. Our algo and platform have come a long way in a short time, and if we are able to maintain this pace of iteration and improvement, we believe we will have one of the most compelling offerings in the space before long.

Second, even with a strong experienced tech team like ours, bugs inevitably make their way to production. Nobody is perfect, and we don’t shy away from admitting our mistakes. So far at least, every issue we have encountered was identified quickly, caused modest damage, and was fully remediated within 24 hours. This is no coincidence: we have spent a great deal of energy building layers of guard rails, alerting mechanisms, and investigative tools. When fires do arise, we have the tools and experience to attack them with focus, precision, and speed. This post is an opportunity to reveal the nature of the problems that arise in a nascent trading system and what it takes to address them.

Finally, almost everyone on the team, myself included, spends the majority of their time on quantitative research and software development, so we love to showcase those efforts any chance we get. Hopefully this detailed look behind the curtain is helpful or at least interesting to others folks in the industry, especially those embarking on their own startup journeys. Also, please note that many of the descriptions below contain jargon and terminology specific to our trading system. These are meant more to provide a sense of the scope and scale of the changes we have made since launching, rather than providing a self-contained description of our trading platform. We plan to soon release a more accessible overview of our entire technology stack as part of our series on How To Build a Broker-Dealer (1, 2, 3, 4, 5).

High level summary

Date range: 3/29/21–6/4/21 (10 Weeks)
Total number of releases: 34
Major feature releases: 2
Bugfixes: 5
Commits: 345

Major Releases

Dynamic VWAP Model Paradigm Shift (04/21/21)

In the original version of our VWAP algo, each interval of time was treated independently: the dynamic volume prediction model would predict how much volume would trade in the upcoming X minutes relatively to the rest of the day; and the algo would schedule that proportion of its remaining shares to trade. More specifically, the volume model outputs where on the volume curve it believes the stock to be right now, and where it believes it will be X minutes in the future, and the algo extrapolates the rest. We noticed that in some cases the algo would trade multiple heavy intervals in a row, or multiple light intervals, and get a bit ahead or behind. For example at a given point intraday, the model might say it believes 50% of the day’s volume has traded so far and 60% will have traded by the end of the upcoming interval. In this case, the algo would schedule 20% of its remaining shares. At the start of the next interval, it might have revised predictions from that point to the next of 55% and 64%; and then a subsequent 20% of the then remaining shares would be scheduled (or in total, 36% of original shares over those two intervals). Looking at the two intervals together though, this seems suboptimal.

We decided to change the algo logic to work more gracefully and in sync with the dynamic volume model; instead of comparing the upcoming interval prediction to just the remainder of the day, the algo now also takes into account its prior trading activity. In the above example, the new algo logic would try to have the entire order 60% complete by the end of the first example interval, and then 64% complete by the end of the second (assuming it was a full-day order), as it is aware of the fact that it has gotten ahead and will try to fall back in line with the latest prediction. This was a fairly complicated change, especially with all of the corner cases around partial-day orders and optional auction participation.

Order Amendment Functionality (05/24/21)

This was by far the largest release we have made so far. For our initial launch, we decided to defer order replace functionality to a later date — i.e. the ability for a customer to amend the quantity or limit price of a live order. Order amendments are a standard use case, but the mechanics involved are actually quite complicated. One common approach is to treat an amendment as a “cancel/new” behind the scenes: in other words, cancel out the previously live order and create a new order for the balance with the adjusted parameters. One challenge with this approach is what do you do if you can’t cancel the original order, for example if it has a live child order currently locked into the closing auction? The cancel/new approach is also suboptimal in that child orders automatically lose their queue position on the street as they get cancelled and reissued.

We knew from the start, we wanted to implement “true” cancel/replaces where child order priority is preserved and all the different “stuck child order” scenarios are handled as gracefully as possible, but doing so is far more complicated so we put it on the back burner.

On April 29th, our pilot customer requested to amend an order, and we told them it wasn’t yet implemented, but this was the kick we needed to get our act together and finally build the thing. What ensued was a marathon of code changes and testing across several different applications in the system: the OMS, the algo, the UI, and every single post trade application (OATS, CAT, clearing, regulatory reporting) required major changes to accommodate amendment functionality. While we were at it, we modified the algo logic to send true replaces downstream as well, another significant lift. All in all, it took about 4 weeks from beginning this project to deploying it to production. It was exhausting but rewarding, and we are relieved to have it behind us. In our experience, making a similar scale change to the production trading system at any of our previous firms would likely have taken a year or longer.

Full list of changes by application

Algo

3/31/21: bugfix for issue where at the the start of the trading day, the algo would sometimes see the shares it had exposed to the opening auction and confuse these with the shares it was supposed to send for the first interval in the continuous market (so instead of sending additional shares, it might just do nothing and wait). The fix was for the algo to always wait until the opening auction was complete before determining its course of action for the first interval.

4/6/21: several features and enhancements added:

We incorporated our information leakage model such that a large order marked “need-not-complete” will only schedule what it considers to be a reasonable number of shares so as not to cause undue market impact.
Previously the algo would split-post between IEX D-Limit and Nasdaq when attempting to post passively; this change replaced Nasdaq with the security’s primary exchange.
In the case of a large and sudden, but sustained, price shift; certain passive orders might have previously sat in the market limited-away. This change allows the worker layer of the algo to re-price a limited-away post router if the parent order’s limit price has additional room to do so.
Introduced configurable auction collars to allow the algo to send orders to the auctions with more aggressive limits than our standard intra-day order collars.

4/7/21: bugfix; the implementation of #4 in the above change had a bug where in some cases the standard order collar would still be applied to auction orders (in addition to the new auction order collar). This bug had no impact in production and was fixed for the following day’s trading session.

4/12/21: two new algo enhancements:

Introduced randomness into interval timing.
Smoother scheduling of remaining shares in the order near the end of the day after sending a slice to the closing auction.

4/21/21: two new algo enhancements:

Dynamic VWAP Model Paradigm Shift (described in the Major Releases section above).
Introduced dynamic scaling of interval duration throughout the day: slightly longer intervals in the morning, slightly shorter intervals toward the close. The longer the interval duration, the longer the algo will attempt to pick up passive fills before requiring an interval’s shares to be completed (and potentially crossing the spread). Longer intervals mean there’s more variance around how far ahead/behind the algo may be relative to its target in a given moment. Because spreads are wider at the start of the day, this change gives the algo a little more opportunity to get passive fills and avoid paying those full spreads at the trade-off of slightly increased variance.

4/22/21: bugfix; as part of the VWAP model paradigm shift above, we introduced a bug where need-not-complete orders constrained by the information leakage model still used the total order shares, and not the smaller number of scheduled shares, when calculating the portion of the order that had traded so far. Affected orders still traded the appropriate number of shares, but they front-loaded the schedule. We identified the issue on the day of the release and fixed it for the following day.

4/28/21: operational enhancement to log additional information about the VWAP schedule.

5/24/21: upstream and downstream order amendments (see Major Releases section above).

OMS

5/24/21: Upstream and downstream order amendments (see Major Releases section above).

5/28/21: bugfix for race condition where a full fill received while a child order amendment to a larger quantity was in-flight would not be property relayed upstream. The way we implemented child order replaces to a higher quantity, the OMS internally acknowledges the replace request immediately. This means when this full fill race condition occurs, the OMS either needs to send a new child order downstream for the increased quantity, or it must cancel the child order upstream. We elected to cancel the child order, but the bug was that the cancel was sent upstream out of order, prior to the relayed fill message. This issue happened in production on 5/27 and resulted in our broker dealer subsidiary’s first ever trading incident (the broker-dealer took on a small error position; no impact to the client). We identified the bug and fixed it for the following day’s trading session.

Market data

4/8/21: treat opening auction prints as regular way and volume/price setting for the purpose of determining price collars and tracking volume in the VWAP model.

4/27/21: on 4/22, our market data provider dropped its subscription to one exchange in one symbol, resulting in a stale/crossed quote. The stale quote fortunately had no material impact on trading that particular order. On 4/27, we introduced a command into the ticker plant to allow us to manually re-initiate a specific market data subscription, which would have remediated that issue on that day. We also added monitoring around stale quotes and crossed quotes. We have not experienced any dropped subscriptions since then, so we haven’t had to use this new command yet.

5/3/21: when we receive a new order, if it hasn’t already, the algo engine subscribes to market data in that symbol and receives a snapshot of the latest quote and trade from each exchange. Previously odd lot trades were not treated as “regular way” (which was defined to include all “Last Price Setting” trades, plus certain outside-hours trades), and if every last trade per exchange coincidentally happened to be an odd lot, the algo would not capture the current market volume at the start of the order, affecting its subsequent calculations for how much market volume traded throughout the life of the order. We introduced this change such that the algo engine would treat those odd lot trades as regular-way to better handle this scenario.

5/25/21: performance enhancement where network interrupts are now pinned to a specific core.

Client and Venue Gateways

4/5/21: more graceful handling of IEX D-Limit order restatement FIX messages. Previously these would cause an alert to be logged which was harmless but distracting from a support perspective.

4/13/21: small bugfix for mistranslation of LastLiquidityIndicator on ExecutionReports (no trading impact).

4/20/21: CG enhancement to accept and record FIX tag 5700 (LocateBroker) on short sale orders.

6/4/21: new feature to allow SSL properties to be loaded from the config file.

Infrastructure / Framework

3/30/21: operational enhancements including the ability to inject messages into the sequenced stream via a json file (previously, we only supported csv) and a new admin command allowing risk limits to be updated by the UI.

3/31/21: more graceful handling of scenario where an application fails to send a message to the sequenced stream and requires a resend.

4/8/21: added additional source information on messages generated by the sequencer. Additionally introduced improvements to the logging application to more gracefully log messages containing delimiter characters in their payload.

4/22/21: enhancements to sequenced stream playback mechanism and to prevent applications from attempting to write new data to the stream after system stop.

5/18/21: performance enhancement where the writer threads on the OMS and algo engine servers now busy spin.

5/20/21: performance enhancement where we increased the initial and maximum memory allocation pool in the JVM.

5/27/21: performance enhancement to the database writer where market data records are now inserted into the MemSQL database via a secondary aggregator to allow for load balancing.

UI

4/1/21: improved client-side logging in the UI to help investigate potential issues.

4/7/21: session security enhancements in the UI backend.

4/7/21: performance enhancement to restrict data snapshot pushes only to currently subscribed views.

4/14/21: performance enhancements to reduce lag in the case of large numbers of orders in the UI (e.g. 500k+ child orders) and to reduce network time on initial loading of the UI.

4/16/21: improved management and storage of secret keys for UI deployment through SSM.

4/30/21: UI backend performance enhancements: Optimized large snapshot delivery by kickstarting with bulk payload and trickling in subsequent chunks to prevent large snapshots from potentially bogging down the UI; Reverse insertion order message streaming to prioritize most recent data for painting.

5/4/21: introduced a powerful new feature called “canned queries” which allows us to input an SQL query and generate a persistent view of the query’s results directly in the UI.