On the Power and Peril of Examples

Published in

Proof Reading

11 min readMay 9, 2022

We like to talk about our algos. We especially like to talk about ideas for making our algos better, ideas for making our measurements of algos better, and how our current algos work. We’ve published the research behind our flagship Proof algo here and discussed the lower level tactics of our algos here. But understandably, most potential clients ask for the short pitch version instead of the 50-page whitepaper version.

Here’s the thing about short algo pitches though — they don’t usually make much sense. Suppose we were to claim something like: “we use AI to beat the slippage vs. arrival of our competitors by 5 bps!” First, how could we know that? Our competitors don’t publish what they’re doing or share their data. Second, even measuring our own performance is a bit of a chicken and egg problem. It takes a lot of trading to build up enough sample size to get meaningful results on noisy metrics like slippage vs. arrival. So if most people insist on such results before trading with us, we’re caught in a circular trap.

This is a conundrum that’s not particularly unique for Proof — it’s generally true for any new company in the algorithmic trading space. But there aren’t very many new companies in the algorithmic trading space, so as an industry we don’t have many recent models for navigating this well. The approach we’ve been taking so far is: ask to be evaluated on the strength of our experience (members of team built Dpeg at IEX, as well as algos at RBC before that), the openness and quality of our research, and low level metrics like short term markouts that are robust at our current sample sizes. We are very grateful to the early clients who are using us based on these reasons, but such brave souls can be hard to find.

There are a few other approaches we could take when it comes to selling our algos. One is rather common but also scientifically fraught — extrapolating a case from a small number of examples. When we are developing and testing algos, we frequently look at such examples — making sure that our high level understanding of what the algos are doing checks out when we play things out for a specific order. Examples are great for building our intuition, keeping us grounded in reality, and potentially revealing our blind spots. They are great for inspiring ideas about how we might do things differently. What they are not great for, unfortunately, is evaluating the overall quality of an algo in a noisy market environment.

We might look at an example and say for instance: well, in this case, we wish the algo had traded more aggressively. This is a great jumping off point for research, but it’s probably not a great idea to just go and change the algo code to behave more aggressively. Because that code change would then affect algo behavior in lots of different situations, and we have no idea how representative our example was. Maybe in the majority of these situations, we were better off with our prior, less aggressive behavior.

The temptation to extrapolate a comprehensive story from a small example is fundamentally human. It is how our brains have evolved to operate in a world that would otherwise be overwhelming in its distinct complexities. That is why selling an algo on an example can work — humans are very inclined to believe that the example is representative. They’ll even do the sketchiest part of the work for you. You can say, “see, in this particular situation, we saved you X bps!” At this point, you have made a true statement. A demonstrably true statement, even. The client then does the math themselves. “So if you save me X bps on every order, and I trade Y orders a month… then wow! I will save so much by trading with you!”

This is a victory for every middle school math teacher who told you that “word problems really are useful in real life.” And it is correct math — but it is bad science. The client has made an assumption that the example is representative — that their set of Y orders trading in a noisy market can be expected to behave as a collection of very similar examples. Even if Y is large and noise can be somewhat canceled out for the performance of these orders on the whole, the sample size of the example is small. In the best case, the example was selected somewhat randomly, and there is some chance that it reflects the algo’s average case performance on orders like the client’s. Presumably though, if we had a good information about the average case performance of the algo, we would be using that directly instead of extrapolating from an example. So what we’re doing here is taking a stat that isn’t stable because the sample is too small, and making the sample even smaller by narrowing in on a single example. This is not a strategy that should inspire confidence. And in the worst case, the example was cherry-picked, whether intentionally or unintentionally.

Nonetheless, we can learn a lot from examples — we just shouldn’t leap to an overall estimation of an algo’s performance. But examples can give us insights into the relative importance of different kinds of algo decision-making. So here, let’s go through the exercise of using an example to gain a better understanding of the impact minimization model that is part of Proof’s flagship algo.

First, a bit of context: Proof’s main algo has two components, a liquidity seeking component and impact minimization component. The goal of the liquidity seeker is to find large blocks to trade, presumably against natural counterparties. The goal of the impact minimization component is to trade a reasonable volume in such a way that minimizes market impact. Its main strategy for doing this is to choose a schedule that minimizes our own model of expected market impact as a function of our algo’s posting and spread crossing activities.

The liquidity seeker is the flashier piece that everyone wants to talk about. What dark pools are you connected to? What min quantities do you use? These things are well known variables and relatively easy to talk about. But it’s harder to be a topic of gossip when you’re an impact-minimizing scheduler. If you do your job well, nobody in the market really notices you. And there isn’t as much concrete common ground in the discussion around impact minimization. There are terms like “impact” and “reversion,” but their exact definitions shift, and no one bothers to fully define them most of the time anyway. And so our impact minimization scheduler, which is perhaps one of the most unique things about our product offering, languishes in the shadow of its more popular liquidity-seeking sibling. [As I write this, I feel like a mom telling the less popular of her children, “it doesn’t matter that your sister is going to prom and you aren’t. At least I think you’re special.” Did you hear that? The impact minimization scheduler just groaned in embarrassment.]

Proof’s impact model operates on a basic premise: in order to trade, the algo does things that are visible to the market. In our case, the algo posts orders at the NBBO and sometimes crosses the spread. Posted orders that join the NBBO and aggressive orders that take across the spread may cause the NBBO to move. So if we want to model how our algo is going to affect prices, we can do this by modeling how our scheduling decisions tend to translate into a mix of these actions, and then modeling how the market generally responds to such increases in NBBO-joining and spread-crossing behaviors. Our model of how our scheduling decisions translate into expected amounts of joining and crossing is built from our own recent trade data, while our model of how joining and crossing imbalances affect prices is built from historical market data. [This use of historical market data is currently a necessity, as our own trading data is too small to support a robust price impact model fully on its own.]

Even with the large sample sizes of historical market data, building a model of price behavior is challenging due to high levels of noise. But it helps if we keep the time scales relatively short. So we build our model by examining 10-minute intervals, and we consider the price movement in the current interval as a noisy function of the pressures exerted by NBBO-joining and spread crossing behaviors in the current and most recent interval. On average, we see the kind of price impacts you would intuitively expect. If there is a currently more spread-crossing to trade at the offer than the bid, prices have a tendency to move up. If there is currently more joining on the offer than the bid, prices have a tendency to move down. We can also see reversion effects: if there was previously more trading at the offer and joining at the bid, but now things have stabilized, prices will have a tendency to partially revert.

Human traders know this and often take it into account. If they think they have traded too aggressively and caused the price to move, they will likely slow down or sit out for a bit, hoping for the price to revert a bit before they resume behaving more aggressively. We have designed our impact-minimizing scheduler to do the same thing, but with one key difference: it doesn’t look at the current price in the stock to decide if it should back off or not. Instead, it looks at its own recent behavior (how much spread-crossing and NBBO-joining did it actually do in the last interval?) and computes what it expects to be an optimal continuation, given its constraints and assuming average market conditions.

Why do we assume average market conditions, you ask? Well, this is definitely something we’d like to improve upon as we iterate on the model. But for now at least, if we split our model training data into smaller pieces based on various market conditions, the noise begins to overwhelm the signal.

So what does this look like in practice? We’re finally ready to take a look at an example. Let’s imagine that we want to buy 1% of the ADV in a given stock during the last hour of the trading day, which we’ve broken up into six 10-minute time intervals. Our algo will typically consider trading arbitrary amounts of round lots within reasonable constraints, but for simplicity of illustration, let’s say we’re going to choose how many shares to buy in each time interval from a small list of possibilities, namely 0, 0.1%, 0.2%, or 0.3% of the ADV. In other words, we’ll trade in units of 0.1% ADV, and we’ll schedule up to 3 units per interval. We have 10 units to trade overall, so a TWAP-ish schedule might look something like: 2,1,2,2,1,2. A VWAP-ish schedule might look something like: 1,1,1,2,2,3.

Our impact model takes in a stock-specific volatility parameter, so we’ll set that equal to what we computed for SPY as of yesterday. [As I write this, it is May 5 and the markets had a very bad day today, so we’ll take a slightly older historical calculation that did not include today, and which produced a more typical value for recent times.] With this parameter set, we can run our impact model to estimate the relative expected costs of various schedule choices for this trading activity. Turns out it thinks the optimal schedule in average market conditions with these clunky constraints is: 3,0,0,3,1,3. This seems to reflect the intuition above: we trade relatively aggressively for a bit, then wait for prices to revert, then trade aggressively again.

So what does our model think the expected cost of this schedule is, as compared to say our TWAP-ish or VWAP-ish one? Our cost units are a bit weird here and not worth explaining, but the ratios of our estimates are meaningful. The cost estimate for the TWAP-ish schedule is 10.09, the cost estimate for the VWAP-ish schedule is 10.085, and the cost estimate for the “trade-and-wait”-ish strategy is 10.045.

I don’t know how to shout this loudly enough in blog post form, but THIS SHOULD NOT BE TAKEN AS AN ESTIMATE OF HOW OUR ALGO WILL PERFORM! IT IS ONE EXAMPLE, AND THERE ARE MORE CAVEATS HERE THAN A PHARMECEUTICAL AD. [*side effects may include: flawed assumptions of average case market behavior, shaky extension of a two-interval model to a six-interval scheduling problem, choice of an example that may not be representative … Use only as directed.]

However, what we can see here is the scale of the stakes. Tens of bps are on the line when we make these scheduling decisions. On average, over time — that can really matter.

So what should you do with this information if you’re a buyside firm trying to decide which trading product to use? You probably already knew these decisions can matter, and you are in one of two positions: 1. you trade enough that you can get sufficient sample sizes to evaluate your brokers reasonably on metrics like slippage vs. arrival, or 2. you don’t trade enough to do that.

Many (if not most) buyside firms are in position 2. This is a tough spot to be in — you know these decisions matter, and you also know that you do not have a firm quantitative basis for making them. It might seem like it makes your job easier when brokers can pitch: “we use AI to beat the slippage vs. arrival of our competitors by 5 bps!” But actually it makes it harder, because now you have to figure out who is lying, who is cutting corners, and who is making leaps that don’t pan out in reality. In this scenario, evaluating a transparent broker like Proof on new criteria like the quality of our research process may sound like a lot of work you don’t have time to do. But here’s the thing about making people show more of their process: it’s usually not that hard to tell who has done their homework and who hasn’t.

Now let’s now imagine you are in position 1. Lucky you! If you want, you can just try new products and evaluate them effectively to decide if you should keep them. You shouldn’t really need us to say: “we use AI to beat the slippage vs. arrival of our competitors by 5 bps!” You can measure that better than we could anyway. What you might need instead is: a reason to support new products in general, and then a reason to support Proof in particular. Why should you support new products, you ask, when your effective policing of your existing brokers gives you excellent execution already? Well, perhaps because very few firms are like you. So your ability to police your brokers effectively doesn’t mean the general state of competition among brokers is healthy. And your ability to compare brokers doesn’t guarantee that the anyone in the overall set of brokers you have is actually innovating and getting anywhere close to theoretically optimal execution.

If we believe that true competition drives good results, then we should believe in the importance of occasionally having new entrants come in and try new approaches. And we should be realistic about the challenges new entrants face: if it were easy, we’d see more than a handful of new brokers every decade or so.

In the context of algos, single trading examples are easily available but low on meaning. But in the context of brokers, we believe that a single transparent broker can provide a rare but powerful example that can help raise the bar of accountability for everyone. So I’m not going to say that “we use AI to beat the slippage vs. arrival of our competitors by 5 bps.” I just can’t bring myself to do it. I hope you’ll consider trading with us anyway.

On the Power and Peril of Examples

Written by Allison Bishop