Live Sports Data: Things We’ve Learned Consuming Real-Time APIs
Akash Barve, Software Engineer II, Master Data Services @ Fanatics
You’re enjoying a nice Sunday afternoon watching the game, your favorite team is giving their all on the field…. and they Win!! Minutes later, you get an email from Fanatics with the latest fan gear suggestions to celebrate the occasion. Wow, that was fast! How did they do that?
The data that enables those automated messages comes from an API, of course. However, consuming and storing sports data can be a bit (or a whole lot) different than other APIs because of its very dynamic nature. And then you need to figure out what to do with it!
The diagram below shows a simplified version of our ingestion process: automated jobs get the incoming response to the application server, which then unmarshalls the incoming JSON, stores the data, and makes it available to consumers in various ways.
These sources provide live details on a huge range of sports: events, athlete stats, rankings, and more. To successfully ingest this data we’ve had to constantly adapt and improve our automation approach. This isn’t limited to Fanatics and sports data, of course — there are many applications for real-time data across all industries. In this post, we will cover some of the gotchas we’ve learned from working with real-time APIs.
Define a canonical model
Model your data based on your use case, and then transform what the external API sends to match your desired structure. This allows you to switch data providers more easily, without instantly breaking everything. The challenging part isn’t ingesting the payload, it is getting the data model correct and then setting up tasks to load data. This is the toughest part, but the tips below will help you with this!
Look for global keys and map them
Each data provider has a globally unique identifier for each league, team, game, venue, or athlete. These allow your application to map primary IDs from that specific vendor to your org’s identifiers. A perfect example is mapping the NFL team's external IDs to your internal IDs as properties. This way a property lookup against the ID in the payload matches it to the correct team record.
A case to watch out for: a data source could return multiple names (and IDs) that are really the same athlete, such as “Alameda Ta’amu” and “Alameda Ta’Amu” (note the different capitalization.) Fuzzy string matching enables you to flag unresolved data and avoid duplicates. Also, some data providers may just use an ID and not include the team name.
Learn the sport
A no-brainer, but… a few things I wish I’d known before spending a lot of time debugging are that MLB has “triple headers” and the NHL has “overtime losses.” Or that you can leverage the fact that American football teams play one game a week and avoid processing repetitive data. Also, it helps to know that NFL “Week 18” in the payload data is actually post-season “Week 1.” Oh well, *Live and Learn*. Knowing details about those sports in advance would have saved me a few headaches.
Time is your best friend
You may have a situation where the endpoint returns new, updated records over the season, but the values themselves don’t change. Standings or rankings might do this. Yet you want records in time order.
In cases where the sport has no concept of a week, you could try to determine if it’s time for a new record based on the total number of games played. But a “time ingested at” field would make this much simpler. Including this field in your model helps your consumers understand how to use data like this.
Always set defaults
The world of sports data will definitely call for mapping one value to another. A payload might give a player’s position with an abbreviation like “QB”, but you want “Quarterback.” Season types can do this too: “1” is pre-season, “2” is regular season, and “3” post-season.
You may want to map these to their full string representations before ingesting, which is a good idea. But what if you get an unexpected value? Including an additional default to fall back on avoids problems.
Turn errors and exceptions into notifications
But no matter what, you will get in situations where the vendor’s ID for a record is not mapped to your internal one. A default value allows you to continue, but you will want to know that it happened. Try to capture these scenarios and push them into something like a Slack or email notification. Then your team will know that the new venue “XYZ” could not be mapped to a venue in your system and they should add it.
Be ready to handle unique key collisions
You might be using an event name as a unique key, like “Home Team vs Away Team League Season Date Time.” Sure, games get canceled and rescheduled all the time, but that should still be fine, right?
However, what could happen is the provider changes the global ID for this game record but neglects to update the name. The older ID might get a new status or maybe is not sent in the payload at all. But when your code fetches the updated data, your system now has two records with the same unique key. Oops.
Eventually, the vendor should correct the error, but in the meantime, you may not be able to use that event’s data. Try to anticipate such situations, even if they are rare. And hopefully, come up with an automated fix should the cruel stars align again.
Flexible client configurations
In addition to the huge number of records we handle, the data in each record is super rich in parameters. Maybe you didn’t include all of them in your model, but later a business user wants to add one.
An annotated protocol buffer schema allows us to add new entities or new attributes to existing entities. Then a code generator creates most of the assets for our application, including UI components for our tools.
Flexibility also includes filtering out records we know we won’t need. For example, Covid-19 means all NBA games are currently played at a single venue. Ingest might filter games based on the venue, so it doesn’t need to process ones that aren’t happening. Or if processing all college basketball events is so large a task that it times out, filtering to just Division I and Division II would still provide sufficient business value in less time.
Focus on what to group and separate in the long run
The cool thing about sport is that it can be so different, yet the same in many ways. A team’s score might be goals, touchdowns + field goals, runs, or baskets, but captured with the same data type. One athlete may have “rushing yards” as a stat and another “rebounds.” Standings and rankings might change even when the team does not play a game. However you decide to set up your tables or columns, it is wise to give some thought to how you want to group or separate different types of values.
Automate to not miss out
If you are dealing with tournaments, like World Cup or Champions League, the venues and teams will definitely change from season to season. Unlike more consistent leagues such as the NFL or NHL. Appropriately built automation is flexible enough to handle both. And if it isn’t, your data validation notifications will let you know the adjustments needed to continue ingesting data.
You will be doing an absurd number of lookups, so cache wherever possible to make life easier. A response payload might contain the results of all NFL athlete stats for a day, and you could make an API call for the global key lookup once for each record. Or instead, fetch all the matching athletes at once and cache. If you can cache your data, you should do so. This reduces the number of network calls and increases overall throughput.
Leverage cloud services to operate efficiently
We use serverless tasks for ingestion instead of a traditional long-running service, allowing us to reduce our compute costs. When Covid-19 hit and leagues were canceling or postponing on short notice, we were able to respond quickly and selectively turn off ingests. This also means we could spin it back up when the sport returned, and only pay for processing when actually needed. But, serverless tasks do come with memory and duration limitations. You will need to determine if that works for your use case.
We hope these tips are helpful for managing your own real-time API data sources. This is just a small subset of what Fanatics does with external APIs, we have some really cool integrations we hope to share with you in the future. Until then, enjoy that game!
We would like to extend our appreciation to Rob Wong, Andrea Longo, and Matthias Spycher, for contributing to this article.