Building the Dataset — Part 2: Merging

Nick Wienholt quantivfy.eth
8 min readJul 1, 2022

--

Photo by pine watt on Unsplash

Merging datasets is a desirable and inevitable element of the data exploration and feature engineering journey. As ‘big’ and non-traditional datasets become more prevalent, the challenge of merging with hard to define keys becomes more common.

Merging datasets can be tackled in at least three different ways:

  1. The use of synthetic/ surrogate keys from an industry body or data provider. This approach generally runs out of steam pretty fast. For equities and forex you might get away with it, but as soon as you get near debt, derivatives or sporting markets, the ability for an industry body to assign a surrogate key for each unique entity, and for everyone else in the industry to reference this key in their data, is not feasible or likely. In Australian horse racing, for example, it is often tempting to use the AAP/ Mediality Racing ID for the race as the key as many data providers will also include this as a field for each row, but this approach breaks down when you need to incorporate trials and jump outs in your training set and you need to go searching for a new master key (trials are race-like events where betting is not permitted and are run with a registered jockey, while a jump out is a less official race simulation often done with a track-work jockey and are less race-like). There are many other different examples where the synthetic key approach is unworkable— anything OTC is going to be hard to tie back to a data provider key.
  2. A synthetic key assigned internally at import time. This seems like a good idea, but the ‘librarian’ who maintains all the foreign key relationships is hard to imagine. You would need to maintain a central master list of entities, and then map foreign keys to all the other entities to avoid a O(n²) complexity where every set is individually mapped to every other set. It feels like very preemptive engineering, as you never know when and if a relationship will be needed.
  3. Natural keys are identified and used. These are hard and messy to keep perfect, and subject to weirdness. Even in the cleaner world of equities and forex, natural keys can get messy — listed companies can change their name and stocker ticker symbol — the ASX list indicates its approximately a weekly event. While a dataset can be cleaned by back-propagating the current code into historical data, it’s also worth considering that a ticker change can indicate a substantial change in business activities, all the way up to a backdoor listing where the shell of a listed non-operational company is transferred to a new business — the proverbial gold exploration company becoming a dotcom or crypto company — so caution needs to be taken when extending continuity. Forex is generally very neat — when a currency like the Russian ruble or Zimbabwean dollar is re-denominated, a new currency symbol is created. The 1992–1998 ruble has a symbol of RUR, while the current symbol is RUB.

Natural Keys in Betting Markets

A reasonable first pass for a sports event might be to nominate a natural key made up of the date of event, the normalised home team symbol and the normalised away team symbol. Even that simple technique can fail — one set of historical data for the “Los Angeles” Angels, who play in the nearby but distinct city of Anaheim, may have them as the California Angels /CA — (official name from 1965 to 1996), or the Anaheim Angels/ANA (1997 -2004), or the stupendously named Los Angeles Angels of Anaheim/LAAA (2005–2015) or use the current Los Angeles Angels/LAA(2016-). Normalising everything to the current name makes the most sense.

Baseball also has the problem of double-headers, which come in five different flavours. In addition, there are frequent make-up/ moved games when a wash out (or a strike impacted season) results in a scheduling change — the dataset needs to be clear whether the date part of the key refers to the original scheduled game date or the date of the makeup game, and which game of the day is being referenced.

Makeup games are much rarer in the NFL, but famously did occur in 2010 when snow collapsed the roof of the Metrodome in Minnesota, and their game against the New York Football Giants was played in Detroit a day later than scheduled. Datasets for the COVID-impacted 2020 and 2021 seasons are also extremely difficult to deal with — the obvious factor of playing in empty stadiums negates a significant amount of home team advantage, and when modelling events in these seasons, a new variable capturing the percentage of crowd capacity allowed should be introduced.

Moving franchises also causes an issue for long term models. The Cleveland Browns team that moved to Baltimore and became the Ravens was physically the same team despite not being the same franchise. The Owner/ President was the same, QB Vinny Testaverde was on both teams, and a number of the coaching staff made the move (though a young gentlemen named Bill Belichick was fired as head coach at the end of 1995, but has found some success subsequently). The 1999 Cleveland Browns are a new team, and a multi-season NFL model would need to treat them as an expansion team despite being in their 47th season in the NFL.

Using a city name and sport code doesn’t solve the issue- the team that played American football in 1995 in Baltimore (the year prior to the Browns relocation), were a Canadian Football League (CFL) team called the Baltimore Stallions and were the only US team to win the CFL. All these teams are unrelated to the Baltimore Colts, who left in 1984 to become the Indianapolis Colts.

The NFL data engineer is gladdened by the likes of the Green Bay Packers who have never moved city, and have done nothing weirder than play 2–3 home games a year in Milwaukee from 1933 to 1994.

The natural key will be dependant on the data being examined — while the 1996 Baltimore Ravens are officially an expansion team, any examination of draft picks, coaching changes and ownership continuity needs to treat them as a relocated franchise. For a long term NFL model, the correct way to treat the Ravens is as a ‘moved’ franchise, with the only wrinkle is that all franchise records were reset at the start of the 1996 season. All other new franchises in the modern era have been allocated extra draft picks and have started with empty rosters, and comparisons with the Ravens are not valid.

How much attention a data engineer wants to give to anomalies is a judgement call. The huge messiness of Baltimore in the mid-1990s can easily be discounted as interesting trivia, but the team was successful quickly and won the Superbowl in the 2000 season. With less than 60 Superbowl winners to model, throwing out the 2000 Baltimore Ravens as too weird doesn’t make sense if Superbowl betting and Superbowl futures is being considered. In addition, the first two players drafted in Baltimore in the 1996 draft (Jonathan Ogden, 4th overall and Ray Lewis, 26th overall — both first round) went to the Hall of Fame, so the team is definitely an interesting datapoint. Incredibly, the former Baltimore team (the Colts) also drafted a future Hall of Famer in the same round in Marvin Harrison.

A hallmark of a great data engineer is one who will obsess about getting 1990s Baltimore right. Not because Baltimore is that important (but from The Star Spangled Banner to Edgar Allan Poe to The Wire to Cal Ripken Jr., the city punches above its weight culturally), but because once Baltimore works, nearly everything else will work. God is definitely in the details for data engineering.

When you get to betting products on the events, the data becomes even harder to wrangle — Betfair is often used as the gold standard for efficient odds, but they have 4 or more extended planned outages per year, typically on a Tuesday, and on those days Australian horse racing will not have a Betfair Starting Price (BSP) struck for a race.

TAB, the dominant bookmaker for Australian betting, had a major outage on a peak racing day on Saturday 7 November 2020, where they were offline for a number of hours. Some races were delayed, some were run anyway without much betting, fixed bets made with TAB prior to the outage stood but tote (pool) bets were refunded, and many bookies pulled down their fixed price as it was simply a blended derivative of TABs price. Modelling anything to do with pricing on this day is a nightmare, and it’s probably just worth excluding it from any dataset. It’s not complex data — it’s rubbish data.

Merging Datasets

It is critical to not lose data (or understand why you did) when merging data sets. Going back to the baseball example, suppose we have two pandas datasets for MLB — one with the starting day payroll, and one with the final score for a number of games. The final goal of the data investigation is to get the correlation of winning percentage against payroll (this article will use trivial mock data and focus on the dataset matching aspect).

These are the steps we need to take to merge the data:

  1. The opening day payroll will be an annual figure. An MLB session is always contained within a calendar year, so we can pad out the dataset with a value for every day of the calendar year. The full notebook for this example is also available on Github.

2. Next add some very contrived game data.

3. And finally match the dataset, joining on the team acronym and the date.

As typically happens when an inner merge/ join is performed on datasets, some of the data is lost-in this case one of the games from 2004 drops out. While this is trivial to find in a three row dataframe, when there are tens of thousands of rows in both datasets, manual scanning for mismatches isn’t practical. To find the rows that weren’t matched, an outer join is performed, and the null (mismatched) rows in the merge are then identified in the dataset. Here we can see the row that failed the merge on the inner side (this example is slightly more complex as we are joining twice — once for home, and once for away teams):

And finally bringing this back to the outer join side, we see where the key was logically misaligned:

Quick Summary

  • Merges are hard, and natural key merges are messy but inevitable.
  • Deep knowledge of the data is required. Events impacting the data, like Cleveland moving to Baltimore or multiple outbreaks of Zimbabwean hyperinflation need to be known, and their meaning to the data being merged and modelled needs to be captured.
  • Using pandas to correctly and quickly identify mismatched rows between datasets in a valuable skill to develop, and isn’t particularly hard.

--

--