Tiers of Sports Data
No one every really talks about the types of data in MLB,but I thought I would give it a whirl; because it really matters when you make predictions with some statistical process.

Data in the MLB comes in four forms.
A) Game Data
B) Inning Data
D) Spatio-Temporal Physical Data
As data gets more granular, the predictive quality goes up but the complexity of the calculation gets a lot more intense and its not linear in terms of ROI on compute time. Currently its not really possible to serve In-Play data with D) because the computations are actually too intense even for a single GPU and requires distributed approaches. (and likely a team of hackers still)
The reason is that you have to create random pitches from a cohort of past data and the count matters. Features like, balls, strikes, outs and number of pitchers effect the outcome drastically of each simulated pitch-fx interaction. The count is intimately connected to the individual outcomes of each simulated pitch as opposed to men on base and left-right stances which have almost no bearing on accurate pitch result predictions. This means that the cost of exhaustive predictive pre-computation is something like 4000*3*4*3*12 calls + overhead per pitcher to predict_proba in sklearn compared to 1500*150*5 calls just running random draws from the cohort of 4000 historical pitches itself.
If you think of this in the context of in-play, after every pitch we want to quickly calculate a new WP and make a great in-play product. The problem is that takes 10 minutes to pump out 3000 game simulations in slow python after each at bat. It is therefore much easier to use Event data as it reduces the calls for exhaustive probabilistic computational drastically be eliminating the combinatoric explosion of accounting for the count.
Although you gain great speed by moving up the data chain to at-bat level data, you quickly find out that predictive power drops drastically and is perhaps even -EV. Modellers who use refactored steamer data lose all the amazing data gain in the actual pitch physics and count data which really tell you a lot about the tendency of a pitcher to get into unfavourable situations.
This is a similar problem for all sports, particularly the NBA and FIFA. Learning the habits of on-pitch movement, spacing and ball movement are tremendous advantages for bettors who want to beat the bookie. Even getting that data and breaking it down is in itself its own challenge and its likely too difficult to do for a syndicate, especially when you note data streams like the NBA API literally record player movements 25 times a second with the rafter cameras.
TL;DR -> Spatio-Temporal data is really large and difficult to use in generative predictions, but the predictive reward is evident, especially when using machine learning. The trade off is speed and inability to partake in the in-play market (for now).
I hope in the future that we bettors can solve some of the problems of speed and optimisation in pitch by pitch level computations like Second Spectrum has done for Basketball, but its still an open research problem and a hard one at that.
