Tiers of Data

No one ever really talks about the types of data used in predictive models, but we thought we would give it a whirl because it’s integral to our ability to offer an industry leading 1% margin. It turns out that the underlying data really matters when you make predictions with some statistical process.

For this example, we’re going to look at MLB data which comes in four tiers.

A) Game Data

B) Inning Data

C) Event Data (Each At Bat)

D) Spatio-Temporal Physical Data

As data gets more granular the predictive quality increases, but so does the complexity of the calculation. The reason these calculations are so computationally expensive and complex is that you have to create random pitches from a cohort of past data and the count matters. Features like, balls, strikes, outs and number of pitchers effect the outcome of each simulated pitch-f/x interaction drastically. The count is intimately connected to the individual outcomes of each simulated pitch as opposed to men on base and left-right stances which have almost no bearing on accurate pitch result predictions.

If you think of these calculations in the context of In-Play betting, after every pitch we want to quickly calculate a new win probability — this computationally intensive task has to be completed in under 9–12 seconds. It is therefore much easier to use Event data as it reduces the calls for exhaustive probabilistic computation. By eliminating the combinatoric explosion of accounting for the pitching count, computations will gain speed.

Although you gain great speed by moving up the data chain to at-bat level data, you quickly find out that predictive power drops drastically and is perhaps even -EV. Event data results in the loss of all the amazing predictive data available in the actual physics of each play and game situations which influence behaviour.

This is true of all sports — significant predictive correlations can be found between player positioning, ball position and game situation. Parsing through play-by-play data to find statistically significant predictors of success and parsing them into pre-baked machine learning calculations results in computationally intense predictions, especially when 30,000 or more simulations are required for predictive convergence. Even getting that data and breaking it down is a challenge in itself and it’s likely too difficult to do for a syndicate, especially when data streams like the NBA API record player movements 25 times a second with the rafter cameras.

At FansUnite, we believe that ensembles of best in class algorithms acting on sports data is the future of book making and will displace high volume models. We look forward to employing the techniques honed by SaberCruncher in baseball and applying them to set sharper lines across all major sports.

TL;DR -> Spatio-temporal data is large and difficult to use in generative predictions, but the predictive reward is significant and worthwhile, especially when using machine learning. Bookmakers who utilize these techniques will be able to shape sharper lines and manage with much tighter margins.