Building the Dataset — Part 3: Cleaning

8 min readJul 3, 2022

Even with the precautions outlined in Part 2 of this series, arriving at the stage where modelling is ready to commence with a dataset with missing and messy data is still possible, and in many domains, is extremely likely. It is at this point that a final clean of the dataset needs to be conducted. The cleaning can be divided into a number of distinct facets: conceptual cleaning, logical cleaning and physical cleaning. This article will cover each of these activities.

Conceptual Dataset Validation and Cleaning

Before a dataset is thrown at a model, it is critical to validate that the final dataset is conceptually correct for the task that is being modelled. During the initial stages of the dataset build, it’s easy to get lost in the weeds and focus on simply getting a dataset where the maximum number of rows and columns are filled out, and jumping from different data sources and vendors to get a more complete back history. Once you have built this dataset, and have a concrete modelling exercise to conduct, it is critical to step back and ask whether the dataset has the appropriate data for the task.

The prototypical mistake for equities modelling is to take the current set of companies in the S&P500, ASX200 or any other market benchmark grouping and to model a strategy based on these companies since their IPO. The Dow 30, one of the oldest and smallest index groupings, has not had a continuous member, though General Electric was part of the original class of 1896 and had a stellar run of over a century before failing the cut in 2018. Taking today’s Dow 30 components and getting their stock performance since IPO will deliver a wonderful backtest result with even the most trivial or clumsy of strategies. There is an obvious massive logical error in this approach — a company has already experienced incredible growth to make it eligible for Dow membership, and finding Microsoft at IPO is a much trickier proposition than finding it when they entered the Dow.

Strategies based on stocks that exist in an index grouping need to be accurate for each re-balancing, and for a Dow30 strategy, that means now bankrupt companies like Bethlehem Steel need to be present in historical datasets. Even testing the whole market, based on today’s traded stocks, is not valid. Dozens and dozens of companies go broke each year, and by excluding the Exxon and Lehman Brothers blow ups from the test set, an overly optimistic set of results will be achieved. The only valid way to backtest is to include the participants in the market or index on each day or each period of trading.

Frank Sobotka is saddened by Bethlehem Steel’s demise and its impact on equity strategy backtesting

While its a huge topic and best left for a separate post, market microstructure knowledge is critical when conceptually validating a dataset too. In Australian horse racing, each state has separate laws on the minimum amount online bookmakers are legally obliged to offer to a customer for a particular bet type. For example, bookies must legally accept bets from all accounts to win AUD2000 on a Saturday metro race in NSW, known as the Minimum Bet Limit (MBL). Prior to 19 Nov 2021, Western Australia had no MBL, and bookies could partially accept a $500 bet on a even-money shot all the way down to zero. Bet365 were notorious in this regard, often offering a fill amount below $10 and earning them the well-deserved nickname “Bet3.65". The consequence of this weird betting situation is that Bet365 could offer very sloppy, lucrative odds for WA racing, and just limit the amount for winning punters to trivial amounts. A naive modeller could easily find great potential profits when looking at historical Bet365 data and could assume bets would be filled to today’s MBL laws, but that simply isn’t the case.

To conceptually clean a set of Australia fixed odds racing data for WA prior to November 2021, the only valid choice is to remove all online bookies except TAB. TAB have an exclusive license in Australia for retail off-course betting, and this allows large and anonymous bets to be placed. With enough leg-work, building large positions is possible, with the exception of COVID lockdowns when retail outlets were shut or extremely limited.

Logical Dataset Cleaning

Logical cleaning is similar to conceptual cleaning, but differs in the scope of the events being considered. While conceptual cleaning deals with big and wide issues like the makeup of indexes, trading laws and the universe of things to consider for inclusion and exclusion in a dataset, logical cleaning will typically focus on very narrow events and situations, and will generally impact a narrower range of rows and columns. Logical cleanup will focus on specific events and scenarios where markets went weird, and which will inevitably have significant impacts on any trading or investing strategy.

A clear and simple example of logically cleaning a dataset that is going to be modelled for equity short selling is to exclude periods of extreme market sell offs where short selling was banned. This is a no-brainer and the only correct way to deal with the data. Rather than delete the rows for those days, it’s just simpler and cleaner to set the number of shares available to short-sell to zero.

A very contentious area of logical cleaning is how to treat extreme market events that are “impossible” today. The Black Monday 1987 crash were a approximate 25% fall occurred can’t happen today — market-wide circuit breakers are now in place:

Market-wide circuit breakers provide for cross-market trading halts during a severe market decline as measured by a single-day decrease in the S&P 500 Index. A cross-market trading halt can be triggered at three circuit breaker thresholds — 7% (Level 1), 13% (Level 2), and 20% (Level 3). These triggers are set by the markets at point levels that are calculated daily based on the prior day’s closing price of the S&P 500 Index.

A market decline that triggers a Level 1 or Level 2 circuit breaker before 3:25 p.m. will halt market-wide trading for 15 minutes, while a similar market decline “at or after” 3:25 p.m. will not halt market-wide trading. A market decline that triggers a Level 3 circuit breaker, at any time during the trading day, will halt market-wide trading for the remainder of the trading day.

As with 1987, the 2010 flash crash resulted in a new set of rules where “market makers have to post bids and offers within 8 percent of the National Best Bid and Offer (NBBO)” and “clearly erroneous” trades have a formal set of rules to allow for their cancellation. In rough terms, a trade 30% off the established price caused by a short term market disruption can be cancelled.

As Nassim Nicholas Taleb has spent a lifetime arguing, removing all these “once off events” (that tend to occur with alarming regularity) from a dataset will make the world appear to be a much safer place than it really is. The recent run on many alt-coins is just another example of deja vu all over again.

If a strategy that relied on stop-losses for equity protection had been trading during the 2010 flash crash, these stop loss orders would have been triggered. What happens next is difficult to determine — its definitely not a case of the trade being cancelled a few minutes later and life going on happily. Working out what to do with modelling in the event of these catastrophic events is very hard.

By removing the worst 10 minutes from a beef cattle’s life, it is very easy to establish a backtest that proves that they live happily into old age and retire contentedly to a lush grassy pasture. However, for real world beef cattle, those 10 minutes at the abattoir are quite significant in relation to their longevity expectations, and removing them from models may lead to inaccurate conclusions.

The critical point to be aware of is that when past “anomalies” and disasters are removed, any model developed is naturally very fragile to new surprises. The best approach is to leave the disasters in place, and develop explicit model safeguards to deal with extreme price-movement events without relying on regulators to help. This is especially critical for dealing with margin limits — the whole Long Term Capital Management blow up and it’s logical follow up in the GFC show how critical margins, leverage and liquidity are in a crisis.

Physical Dataset Cleaning

Once the very hard obstacle of logical and conceptual validation and cleaning have been conducted, dealing with data that is just missing or damaged would seem like the easiest task of all. There are dozens of resources available on pandas for describing the steps necessary to inspect a dataset for anomalies and unexpected values — the describe function is a great first start, and charting data from each column can quickly detect big issues like a week of expected data that has been zeroed out.

When data is missing, the first course of action is to validate that the missing data time range is actually an error — there are many historical occasions where exchanges are closed or experience major technical disruptions. The NYSE was closed for nearly a week after 9/11, and market failures due to technical issues are still regular occurrences. The distributed nature of the US market offers more resilience, but as this FT article from 2021 details, outages are still very much a real thing for most exchanges.

It is important not to fill a dataset to compensate for real world events like market outages. If a strategy is found to be crippled by not being able to trade on these days, thats a great thing to find in backtesting and not during live trading. It might be appropriate to introduce other instruments on other exchanges that are available during outages to offset the impact of being unable to trade on the strategy’s primary market. Finding this out before all hell breaks loose is desirable.

If the data is just missing due to a technical issue with the data provider (or the capture of the data at the original sources), pandas offers some very well known ways to fill in the data with derived values. As Matt Harrison points out in an excellent tweet, care needs to be taken because some methods will leak data backwards and this can poison backtest results.

The various techniques that can be used are:

ffill, which uses the last valid value in a Series to fill holes.
bfill, which uses the next valid value in a Series to fill holes, and hence leaks the future into the past.
interpolate, which uses both past and future values to fill in the missing values, and hence leaks the future into the past.

It is also possible to use a moving average to fill data, but for trending data this will result in weird temporary drops or rises in the filled data as the MA hasn’t “caught up” with the most recent data.

Rather than simply filling with data intra-Series, it is also possible to use a related Series with high correlation and just apply a scale factor to get a reasonable approximation. For example, the New Zealand Stock Exchange (NZSE) has a very correlation with the ASX — its 0.94 according to Macro Axis, so repairing missing NZSE daily OHLC values with the scaled equivalent from the ASX is very valid. The public holidays for the two countries aren’t aligned, so care needs to be taken not to fill in a day that should legitimately have no value.

Whats Next

With all the cleaning and validation of a dataset done, it's time to move onto the stage of finding something interesting in the data, and that's where the next posts will focus on.

Building the Dataset — Part 3: Cleaning

Conceptual Dataset Validation and Cleaning

Logical Dataset Cleaning

Physical Dataset Cleaning

Whats Next

Written by Nick Wienholt quantivfy.eth