Sitemap

How We Think About Data At SDI Partners

7 min readJun 3, 2025

--

Algorithms are crucial in algorithmic trading, as the name suggests. The most basic way to understand an algorithm is to think of it as a black box that receives data and produces trading decisions. As researchers, we dedicate significant time to exploring what occurs within that box. This leads to a wide variety of internal mechanisms in trading strategies.

However, there is one thing that all algorithms have in common: if the data they ingest is fundamentally flawed, they are destined to fail. “Garbage in, garbage out” sums it up. In this article, we share the broad strokes of how we think about data quality and provenance in order to give our trading strategies a fighting chance to succeed.

What do we want from data?

An objective standard exists to evaluate any dataset: the enhanced performance of optimal trading strategies when conditioned on the dataset compared to when they are not. However, utilizing this standard is costly. Developing trading strategies requires significant time and effort, and finding one that qualifies as ‘optimal’ is even more challenging. We aim to identify which datasets hold value so we can concentrate on them and avoid wasting time on ineffective ones. Therefore, we use heuristics to form preliminary assessments of how datasets are likely to perform under closer examination.

Relevance

Let’s start with the most prominent heuristic: is there a reason to believe the data could be predictive of future asset prices? What makes it relevant? There are lots of datasets we can confidently dismiss in this spirit: sunspot events, UFO sightings, etc. In fact, if we didn’t dismiss them off-hand, our real discoveries might get lost in a sea of spurious correlations.

These fanciful examples are clear-cut, but most datasets we’re considering aren’t quite as obvious. For example, say you’re presented with a dataset of all Reddit posts and comments. Should that dataset be relevant to pricing US stocks? Maybe? Good researchers develop simplified mental models of how the world works that help them answer questions like this one. The best researchers (in addition to being technically strong), constantly refine their mental models as they make sense of disparate observations.

Uniqueness

Our job at SDI is to model asset prices and help the market reflect the “right” prices. To that end, we compete with other trading firms and sophisticated investors who have the same goal. If a dataset is already being widely used by our competitors, we can’t expect to add much value by using the same data in the same way. What does this mean for evaluating datasets? Simple datasets that are widely used — even if indisputably relevant — might not fare very well.

For example, the Price-to-Earnings (P/E) Ratio is a widely reported metric about companies. If I Google “AAPL stock,” Google displays Apple’s P/E ratio above the fold. The P/E ratio is the quotient of the share price and the earnings per share, and roughly speaking, it captures how “expensive” the company is. It’s incredibly relevant to investors. Yet, if you wanted to make a trading strategy solely using P/E ratios, you might be underwhelmed by its performance given the importance of the metric. We could say it’s already been “priced in,” which is to say that this is information so many people have extracted that it doesn’t provide our strategies a competitive advantage.

The most exciting datasets are novel or complex enough that we can hope to extract something from them that our competitors do not.

Avoiding lookahead

Let’s think through a hypothetical. Say you’re approached by a newspaper which has a new data product targeted for investors: Market Moving News Articles™ (MMNA). The newspaper has recently identified that most of its articles don’t matter to investors, and it wants to add value by delivering *tagged* articles. In addition to the regular article contents such as the text context, time of publication, author’s name, headline, etc. , whenever a new article is published, subscribers of MMNA will get three extra fields annotated by the newspaper’s stock analyst: affected_stock_ticker, is_market_moving, and is_good_news. They graciously offer you a trial of five years of historical data. You back test a simple strategy of buying (selling) the stocks with good (bad) market moving articles and it looks great. What’s wrong with this picture?

How did they get the tagging of five years worth of articles if they just recently had the idea to tag articles? If they had the foresight five years ago to start tagging articles real-time then there’s nothing wrong. But let’s say they didn’t. Instead, they had their analyst look through past articles and retroactively tag them. This, though unintentionally so, is incredibly dangerous! There are all sorts of ways in which the analyst might be benefitting from hindsight to make better annotations than would have been made in the moment. The MMNA historical dataset is rendered effectively unusable.

Lookahead can manifest in subtle ways. It’s imperative as researchers to stay vigilant and understand the provenance of the data we work with to dodge these traps.

Sample size and noise

Otherwise useful data could be disqualified if there is too little of it or if it’s too noisy. Thanks to simplifying assumptions and statistics, we can usually get a reasonable idea of whether this will afflict us even without looking at the dataset.

The US Bureau of Labor Statistics (BLS) publishes monthly statistics on employment (such as the unemployment rate) that are closely watched. Say a hypothetical survey company constructs a high-fidelity random sample one day before the BLS to try to independently estimate the unemployment rate. However, constructing true random samples is expensive, so they only survey 250 people (and report the proportion who are unemployed). They are offering to sell you the survey result for $X, but won’t show you their data before you pay them. Are you interested? (Assume you believe their methodology is airtight.)

Our best tool to answer this sort of question is some back-of-the-envelope statistics. How much uncertainty is there in the new unemployment rate the BLS will announce? We can upper bound this by fitting a simple auto-regressive model to the unemployment rate. Let’s say our model has prediction errors with 0.5 percentage points of standard deviation. What is the standard error of the proportion in the survey? Modeling as a binomial distribution and assuming unemployment is around 5% yields

If we assume the survey is an unbiased estimator for the BLS figure and it’s uncorrelated to our autoregressive model’s errors, that means we’d be able to improve our autoregressive model error standard deviation to

We could then use some further back-of-the-envelope math to conclude whether our trading strategy will benefit from an additional 0.03% in predictiveness on the unemployment rate more than the cost of the data (we’d need some model of the sensitivity of asset prices to the unemployment rate).

Where do we get data from?

We’ve talked about how to evaluate the potential of certain datasets, but where do we actually get data from?

Market data

Financial markets generate a prodigious amount of data, primarily in the form of orders and trades made on exchanges. The data is unquestionably relevant and complex; in other words, it checks off all the boxes for data we could expect to be useful.

Let’s consider one possible use of market data. Companies do not exist in a vacuum — they are all interconnected in various ways. A chip-maker sells to a phone-maker; an app-maker sells on the phone’s platform; an advertiser buys ads on the app; the list of relationships goes on. Stock prices have to somehow reflect these relationships. A simple model for these relationships is that similar stocks should have similar stock returns. Thus, you could conceivably use the data of AAPL stock returns to predict, say, MSFT stock returns. And of course, things can get arbitrarily more complex than this basic “pairs trade.”

Alternative data

Widespread computer usage has made it easier to collect granular data about all sorts of things that aren’t securities but could be related. This type of data is the hardest to describe because the datasets are proprietary and no two datasets are exactly alike. For example you could purchase data on foot traffic to various stores, capture twitter streams, or even track wind patterns in various geographies.

It is worth noting a recurrent commonality in this type of data: the relationships can be delightfully creative and/or surprising! To provide an example I’ve not personally verified but which sounds plausible: there is evidence to suggest that angry reviews claiming Yankee Candles are scentless coincide with surges in COVID-19 prevalence.

Data Transformations

This may seem like cheating, but data transformations can sometimes so fundamentally change a dataset that in effect you’ve made a new dataset.

Say you start with a dataset of all trades in US stocks. As is, this data is already plenty interesting. However, let’s consider a character-changing data transformation on it. Roughly speaking, US Regulation (SEC Rule 612) prohibits exchanges from quoting prices in discretisations smaller than one cent while allowing sub-penny price improvement (subject to some limitations). This sub-penny price improvement is particularly characteristic of wholesalers who provide these small “discounts” to retail customers. Thus, if you transform the data into the proportion of trades with sub-penny prices, you’ve made a proxy of retail interest across stocks and across time.

There are all sorts of research questions you can ask of a dataset of retail participation that you couldn’t directly ask of a dataset of trades. New data!

Parting thoughts

This was just a short, high-level overview of how we think about data at SDI and its profound impact on the quality of our strategies. Notably, there are many important things not covered here. For example, good engineering practices are critical to both research with and productionising data, particularly when the datasets get large. There are also many nuances particular to the different types of data (whether time-series, textual data, etc.). If these types of problems sound interesting to you, take a look at our current openings!

--

--

SDI Partners
SDI Partners

Written by SDI Partners

SDI Partners is an award winning AI-integrated quantitative investment manager with a strong history of sound risk management and outsized, consistent returns.

Responses (7)