The Value of Things that Don’t Work

Published in

Proof Reading

10 min readJun 23, 2021

There is joke that history likes to play on young scientists. It goes like this: you are staring at a whiteboard in a windowless shared grad student office (probably sitting under a ceiling tile that leaks a mysterious brown liquid from time to time that you have decided it’s better not to investigate). Suddenly, you have an idea! It is a glorious idea, entirely original and cool. No one in the history of ideas has had an idea as simultaneously clever and retrospectively elegant as this idea. And it works! You race down the hall to tell your research advisor. Your advisor’s desk is buried in stacks of printed conference proceedings and half-graded final exams. She is hunched over her laptop, where her inbox displays a daunting gazillion unread messages. She doesn’t look up while you describe your idea. She wordlessly reaches deep into one of the many stacks of printed papers and hands you something that is at least 80 pages long and may be written in another language. The pages are yellowed from wear, and probably coffee. For a brief moment you optimistically and whimsically wonder if maybe the ceiling tile is just leaking coffee.

“It’s already been done,” she says flatly. “See appendix D.”

The joke is that this doesn’t stop happening as you become an older scientist. But you do grow to expect it. Ideas that work are etched into the scientific record, which is a vast and disorganized, and it tends to swallow up your greatest achievements and spit them back as unoriginal.

Ideas that don’t work, however, often stay in the realm of private fantasy. They are much less likely to be written down and shared with others, and hence much less likely to harshly encounter the larger scientific context. “If that had worked,” you privately think, “I could have been famous.”

But surely our ideas that don’t work are as common (and sometimes as insightful!) as our ideas that do work. By not adding them to the shared scientific record, we forfeit key opportunities to reduce duplication of effort and crowd source solutions. And worse, we distort the true record of positive results, robbing each other of crucial context for understanding and interpretation. We also rob each other of inspiration — failed ideas often spur ultimately successful ones, and hoarding them can be just as counterproductive to scientific communities as hoarding successes.

I have long been an advocate of publishing and discussing failed ideas more openly. In fact, I founded an annual conference that publishes and celebrates failed research attempts in cryptology (the science of designing and breaking codes). Similarly, I believe that Proof’s promise of transparency and collaboration is not fully satisfied if we are only forthcoming about our successful ideas and omit the unsuccessful ones (or “not-yet-successful” ones, as they may prefer to be called).

So today I’d like to share the story of some research that … drum roll please … did not improve our VWAP algo.

To describe this research, we need to establish a little bit of context first. A central object of study for designing VWAP algos is the “volume curve.” We can think of the volume curve for a given symbol on a given trading day as a graph, where the x-axis represents time and the y-axis represents what fraction of the trading day’s total volume has traded so far. This graph will always start at 0 the beginning of the trading day and end at 1 at the end of the trading day.

At any given point during the trading day, we know how much volume has traded in an absolute sense (the number of shares traded so far) but we don’t know what fraction of the total day’s volume this represents, as the day isn’t over yet. Hence, in real-time, today’s volume curve is something we have some information about, but is not something we can fully know until the day is over. If we did somehow know the volume curve in real-time, we could shape our trading to fit it, and we would expect to achieve an average price for our trades very close to the market’s volume-weighted average price. Hence, it makes sense to try to make good predictions of the volume curve as the day progresses in order to shape our own trading towards our current “best guess” of what the volume curve will be.

A very common approach is to guess that today’s volume curve in a symbol will follow an average curve computed from recent completed trading days (e.g. a rolling 20-day or 30-day historical average). We’ll refer to such averages as “historical volume curves.” By themselves, these curves do not consider any data from the current trading day, so predictions using this method do not change as we receive real-time market data.

Right now (as of June 2021), our VWAP algo uses real-time volume data combined with historical volume curves and historical average daily volume (ADV) information to make real-time adjusting predictions of the volume curve in a symbol we are trading. (See our whitepaper on this topic for more details.) The inputs to this calculation (the real-time volume, the ADV, and the historical volume curve) are computed individually by symbol, but the model that translates these inputs into an updated prediction is one-size-fits-all.

When we ask questions like: “This symbol’s volume curve expects 43% of today’s volume to have traded by now, and we’ve actually observed volume in this symbol that represents about 60% of the ADV. What do we think is going on?” we consult a model that is trained broadly across symbols, weighted by notional value. “In my vast experience across stock trading,” we can imagine it replying, “that typically means that closer to 46% of today’s volume has traded by now. You want to revise the volume curve estimate up at bit, but don’t overreact.”

We might ask: why not train such a model for each symbol, just like we compute a volume curve for each symbol? There are a few reasons why this is not desirable. One is sheer complexity — instead of one model, we’d have about 10,000 of them. That’s 10,000 opportunities to get a bad model. And worse — each model would be based on dramatically fewer data points. This would make us much more susceptible to over-fitting and getting thrown off by unusual, outlier observations. From a conceptual standpoint, training a model per symbol is like pretending that trading behavior across symbols is a set of roughly 10,000 completely distinct and independent phenomena that must each be approached from scratch. This is intuitively sub-optimal from a learning perspective. Sure, there are differences in behaviors between symbols, but there are also similarities. Would we really train a native Spanish speaker in French with a mantra of: “first forget everything you know about Spanish”?

On the other hand, having a one-size-fits-all model means that the model will reflect average case behavior that may not be a great fit in non-average situations. In some sense, our one-size-fits-all model acts as a force pushing us towards the average behavior, coaxing us away from trying to chase idiosyncrasies of particular volume curves more narrowly.

Is this good or bad? We definitely believe it is good on average, or our model’s predictions wouldn’t be testing as being more accurate on average than the basic approach of sticking with the historical volume curves. However, just because this approach seems good “on average” does not mean that it is the best approach for any particular situation, and there may still be considerable improvements to be found in customization. To get a sense of what customizations might be useful, it is worth dissecting a bit where our performance gains might be coming from.

Better performance “on average” can mean many things — it can mean we make dramatically better predictions on a small number of symbols, and only marginally better or even worse predictions on many others. It could mean we make modestly improved predictions on most symbols. In general, averaging will conflate scenarios that might have differing implications. If it is not the case that our model makes better predictions than the historical volume curves on most symbols, for example, then we may want to restrict our use of the model to the symbols where it seems to be an improvement.

To gauge this, we looked at the model’s performance on each symbol over a test data set spanning most of the month of April 2021. (We ran this analysis on April 29, so it included historical data from April 1 through April 28. The model we were testing was trained on data from January-March of 2021.) For each symbol, we sampled predictions of cumulative volume every 10 minutes, employing our model as well as predictions coming from the basic historical volume curves. We computed the average error of each method of prediction relative to the true volume percentages. In an overwhelming majority of symbols, the predictions from our model outperformed the historical volume curves. In fact, our model’s predictions out-performed the historical curves in all but 565 of the roughly 10,000 symbols. For those 565 symbols where using the historical volume curve as predictions beat our model, the performance gap was very tiny in all but 383 symbols. These 383 symbols combined represented only about 3.5% of the total notional value traded in our test data. This suggests that our model is an improvement on the status quo of historical volume curves for not just the “average” symbol, but rather for nearly all symbols. So there is very little to be gained here by adjustments like turning the model off and reverting to using the historical volume curves alone for certain symbols.

Nonetheless, there may still be something to be gained by customization. In particular, we might try to find a happy medium between having one model trained over and applied to all symbols vs. having a separate model trained over and applied to each symbol. A natural way to split the difference is to divide symbols into a modest number of groups, and then train and apply a separate model over each group.

There are many ways that people typically group symbols: by ADV, by volatility, by average spread, by market cap, etc., or any combination thereof. Most intuitive for our task, however, would be a grouping that accounts for typical volume curve shapes. If we think that our one model might be sub-optimal in its grouping together of volume curves that have differing shapes, it makes sense for us to define a feature that captures volume curve shape and then group symbols based on that feature. There are many plausible ways to define such a feature, but we chose to calculate the area under the volume curve. Here we are thinking of the volume curve as a graph of the cumulative volume percentage for a day, which starts at 0% and climbs to 100% by the end of the trading day. This feature collapses curves to single numbers such that curves with similar shapes will typically have similar numbers, and curves with very different shapes will typically have very different numbers.

In particular, curves that ramp up early in the day will have higher areas under their graphs than curves that are more heavily weighted toward the end of the day. This is because the same percentage of volume contributes a greater area under the curve when it is added earlier in the day rather than later, since area involves a multiplication by time.

We calculated the area under the volume curve each day for each symbol over a training data set, and then grouped symbols into three categories by their average area: low, medium, and high. The low category covered about 25% of the symbols, the medium category covered about 50% of the symbols, and the high category covered about 25% of the symbols (this part was not done very scientifically, we just picked something that seemed reasonable to get a proof of concept). For each group, we trained a model exclusively from data on that group. This gave us three models that together covered all symbols. We then tested the models’ performance on the test data, meaning that for each symbol we applied the model of its group in order to generate cumulative volume percentage predictions.

To get a proper sense of the results, let’s first look at just how the historical volume curves did as predictors of cumulative volume percentages over this period of testing data (and aggregating over symbols proportionally to notional value traded). The average squared error, (aka the average value of (prediction-truth)²), was about 0.00557. Taking a square root to get a rough sense of how off these predictions were, we get about 0.0747. Since these calculations are in units that treat 1 = 100% of the daily volume, this means our estimates of the cumulative volume percentages tend to be off by about +-7.47% of daily volume on average if we just use the historical volume curves. Doing the same calculation for our one model trained on all symbols and then applied to all symbols yields an average squared error of 0.00437, which corresponds to errors of about +-6.61% of the daily volume. Using our three models applied over the three symbol groups yields an average squared error of 0.00430, which corresponds to errors of about +-6.56%.

All in all, this seems to suggest: our one-size-fits-all model applied to symbol-specific inputs represents a meaningful improvement in prediction quality over relying solely on historical volume curves for the vast majority of symbols, and training it more specifically on groups of symbols, or restricting its use to particular symbols, does not appear to yield improvements sizeable enough to make it worth the additional complication that such approaches bring. So all of this work was to … ultimately decide to leave the VWAP model alone for now. At least until we have better ideas. *cough* oh sorry. I don’t necessarily mean better ideas. I mean “ideas that happen to work” ;)

The Value of Things that Don’t Work

Written by Allison Bishop