A Taste of Simplicity

Allison Bishop
Proof Reading
Published in
4 min readJul 9, 2021

[TL/DR: We have released a whitepaper detailing a newer methodology for our distilled impact metric. It shows a meaningful improvement over our initial approach. This work was led by Matt Schoenbauer, an early-career quantitative researcher who has recently left Proof to broaden his scientific horizons. We wish him well on his quest!]

It’s probably only a matter of time before the AI hype comes for cocktails. Some MIT grads have apparently “solved” wine, so cocktails are the next remaining frontier. “Our robot bartender will craft something special and never-before-seen, just for you,” they will say. And I will stare at my lavender, curry, mango monstrosity and apologize to the bourbon underneath it. “Sorry you had to go out this way, buddy,” I’ll whisper sadly. “You were meant for better things.”

Automation often brings with it a crisis of choice. Computers are great at searching through a vast universe of options very quickly, but we have to give them criteria for what makes a “good” option so they can recognize one when they find it. This is at the heart of several challenges with using machine learning to solve problems. Our intuition about what a “good” solution is can be hard to formalize and communicate to a machine. Also, our experience of “good” and “bad” solutions is usually based on a pretty narrow realm of typical examples. As we venture further away from things we’ve tried before, our intuition is likely to get less reliable in anticipating what will be “good” vs. “bad.” As a result, the automated search might be using criteria of success that don’t make much sense on large swaths of its search space. Problems like these are likely to murder a few good bourbons in the name of progress.

In a previous post (link to https://medium.com/prooftrading/distilling-bourbon-and-markets-ba6e8f326340), we discussed our initial approach for trying to remove some general market noise and “distill” stock price movements down to the impact of trading in that particular symbol. In that initial version, we used our human intuition to pick a small set of ETFs that we felt represented some general trends that we would want to remove. We then used historical market data for each symbol to find a linear function of price movements in those ETFs (ultimately 6 of them) that best fit with the price movements of that symbol. Subtracting those linear functions from the price movements of individual symbols reduced the variance on average, suggesting this this was a meaningful and helpful correction overall. However, a lot of noise still remained, and we vowed to continue investigating new approaches and iteratively improving our methodology.

This first approach is kind of like building a robot bartender who is constrained to use the same ingredients every time but can vary the proportions to customer taste. Want your whiskey sour a little more sour or a little more sweet? It’s got you covered. Want a pina colada with a tiny umbrella instead? Get out of here with that silliness.

Giving the robot bartender firm constraints is important especially when there is very limited and/or very noisy data on each customer. Focusing on the narrow goal of learning how they like their whiskey sours makes success at that goal more likely. The improvement our metric shows on historical market data overall compared to adjusting only for movement in SPY is perhaps akin to saying: on average, customers prefer a whiskey sour well-mixed to their taste to being given a standardized version of the single most popular drink.

However, I’ve been told some people prefer margaritas to whiskey sours. Those people are wrong, but ok.

Naturally we asked: what additional degrees of freedom can our robot bartender/distilled impact metric handle well? Definitely not too many of them, we discovered. To test this, we opened up the set of ETFs that could be used as proxies to help “explain” price movements in individual symbols. We gave our model the freedom to choose different subsets of these ETFs for each symbol, and then asked it to find a linear function of the chosen ETFs’ price movements that best approximated the symbol’s price movements over a training data set. We then tested the power of the linear function as a prediction of price movement in that symbol over a fresh testing data set. The quality of the results on the testing data degraded immediately as we increased the number of ETFs the model was allowed to use for the linear model.

What does this mean? Well, machine learning practitioners would call it over-fitting or a failure of generalization. It means that the modeling on the training set has enough freedom to start shaping itself around things that turn out to be coincidental, or at least not stable over time. This causes the model to begin failing as the details of the context change, as some of the things it has “learned” no longer hold true. In other words, the amount of data we have about each symbol is not really deep and stable enough to serve as a strong foundation for choosing between too many choices. As we evaluate more potential choices, the odds of some looking “good” merely due to random noise increase, and we can easily be lead astray.

Nonetheless, we did find that allowing the model to choose just one ETF from a larger list of 50 choices as a “best match” for each symbol performed reliably better in reducing noise, compared to our initial distilling method. This is a pretty simple model, but it’s the best we’ve found so far. We will be updating our distilled impact metric to use this method going forward, rather than sticking with linear functions of the same six ETFs for each symbol. The full details of this phase of our research can be found in our new whitepaper.

So our robot bartender has a larger menu now. I guess you can order a margarita. Or a Manhattan. Or even a pina colada. No substitutions though.

--

--