Data science can feel like a mysterious world of algorithms and models thrown together to get a creative solution. But it’s not all smoke and mirrors… getting to a worthwhile conclusion can be difficult without a well thought-out process. Why?
There is so much you can do with data and SO MUCH OF IT.
Because everything is better with an example…
I’ll walk you through a model I created to recommend pitches to the Cubs in games against the Cardinals, and the steps I took to get there. (Technically, this model could help any team, or any talented pitcher quite frankly when throwing pitches against Cardinals players… but my model is dedicated to my Cubs).
Step 1: Identify your Problem & Goal
“Your scientists were so preoccupied with whether or not they could that they didn’t stop to think if they should.”
-Dr. Ian Malcolm (Jurassic Park)
Quite frankly, if the problem you’re trying to solve isn’t actually solving a problem… no one cares.
Identify why what you’re trying to solve for is important… Make it useful. Try to tackle the most important piece of the puzzle... Be focused. There are so many problems to solve, and so many ways to approach them.
Let’s say the overarching objective is ensuring The Cubs will win games against the Cardinals. Imagine every way you could approach that problem — would boosting attendance help? Changing the placements of players? Investigating the effects of travel? Weather? As a cognitive science major, I’d love to dive into neuroscouting…
The list is endless, but time isn’t endless. With the dataset I had I could have spent the next year looking at every pitch, every pitcher, batter, condition of the field…
But I had two weeks, so I kept my problem focused.
How can we reduce the number of balls the Cardinals put in play?
What do you want to achieve? What is everything you’re doing from this point forward going to be in pursuit of? If you don’t keep this light guiding you you’ll easily get off track. There is just so much to look at! Not to discourage exploring data to stumble upon something new, but alas, you should often ask yourself, “does this help me get to where I’m going?”
Goal: Predict whether a pitch to the Cardinals will be missed/fouled, or put in play.
Step 2: Gather and clean your data
Easy! Well… no. Expect that this will take 80% of the total time. Clean and ready data is very hard to come by.
Before collecting ask yourself, what data do I need to solve this problem? Ultimately, my model only used three inputs (Pitch zone, Pitch Type, Batter) — but before I got to modeling, I wanted access to all the information about every pitch to the Cardinals in every game this seasons. The data I was in search of would have information like, location, inning, score, who was at bat, pitch speed, weather. It would have information about the results of each pitch like hit speed, hit zone, hit type, pitch outcome…
Luckily, I found an amazing play-by-play dataset from Sportradar’s API that had all the information I could have possibly dreamed of (thanks Sportradar!)
But the fun didn’t stop there. Each game was an 80+ page JSON file, and the information I wanted was nested in the middle. It was the world’s worst Wonder Ball. Trying to get to the candy when there were 20+ layers of foil and like 15 boxes. It was hard. But you don’t care. Because we haven’t solved the problem yet.
Then, it’s time to clean. Cleaning more or less means snooping around for anything sketchy (and likely tossing those pieces out). 15 rows where there was no pitcher or hitter listed? Hm… doesn’t sound like a pitch to me. How about the 10 rows where the pitch zone was 0? Yeah… not totally buying that’s a real thing. Little things like this happen all the time, with so many rows of data, it’s only natural something was entered or interpreted weirdly. Just make sure they don’t end up in your final dataset!
You can clean out other things too — for example, I only focused on batters that had at least 50 plate appearances in my dataset. At the end of the day, you’re the architect. Think about what will get you the strongest model, with the goal in mind.
Step 3: Get to know your data
Often, this is as simple as doing counts. How many Fastballs were thrown? What percentage of balls put in play resulted in legitimate hits? How many times was each player at the plate? You’re going to be spending a lot of time with this data, you better know and like it first.
Visualizations are also a great way to catch patterns in your data you may not have noticed otherwise. I created a visualization that showed all of the balls put in play for each player over the first 31 Cardinal games this season, and where the pitches landed. Shapes represented pitch type, and colors represented pitch outcomes. A visualization helped me ultimately see that my three chosen model inputs “pitch type”, “pitch zone” and “batter” were creating patterns in the data, and may have power to predict whether or not a pitch would be missed or put in play. (I mean… duh. But it would be bad to take anything for granted).
(I promise it’s more visual than the link…)
Step 4: Picking your model
This visual does a great job of recommending which model is best based on your situation.
For me, I followed the path all the way down to “Ensemble Classifiers”.
1. had more than 50 rows of data (5,000+ plate appearances in fact).
2. I was predicting a category outcome (missed/put in play) versus a number outcome (hit speed for example).
3. My data was “labeled”, meaning it included the results of previous pitches that my model could use to find patterns in and learn from.
4. I had less than 100K samples
5. Linear SVC model did not prove great…
6. I had “text” inputs (like name of player and pitch type), not just number inputs (like pitch-zone).
7. KNN model was also not great…
8. Ensemble classifier worked just right!
The model I ultimately used was a “Catboost model”.
Catboost models use decisions trees. So, just like the decision tree in the model-graphic above, my model took steps. When given a “pitch type” it would say “do we think this ball was hit or missed?” Then, on to the next input… based on the pitch zone do we think it will be hit or missed? And finally, the batter. With these three answers, it would make a guess, check its work, and learn what to expect in similar situations in the future.
The model I used is called a “Cat”-“Boost” because the inputs are categorical (i.e. Hitter name and Pitch type), and it uses gradient boosting, meaning it aggregates these results to make them stronger. On their own, each of these features aren’t very predictive, but when reviewed many times and grouped together, their predictive power gets a lot stronger.
Step 5: How do I know if my model is good?
There are many different metrics one could use to evaluate a model, but it’s best to choose the one that makes the most sense for what you’re trying to do.
First, you must understand how models are tested. A common way to test a model is by doing a “Train-Test-Split” — splitting a piece of the data off, training on the rest, and seeing how good your model is at predicting the outcomes of the split set. This test is representative of how your model will perform in the future, predicting outcomes that are not yet known.
I primarily focused on Precision as my metric of success. To understand why, we must look at the confusion matrix for my test data:
True Positive: Times my model guessed a miss and it was.
False Negative: Times my modeled guessed the ball would put in play and it was missed.
False Positive: Times my modeled guessed a miss and it was put in play.
True Negative: Times my model guessed it would be put in play and it was.
Now, you may say… 374 times the model guessed it would be hit and it was missed… that’s terrible! If we were interested predicting every time the ball would be hit, it would be bad. Really really bad. But, consider the purposes of this model, to not throw pitches that result in hits. So if the model thinks a pitch will result in a hit, don’t throw it! No problem there.
In fact, it got to the point where I purposefully increased the number of false negatives, and reduced the overall accuracy of the model, in order to decrease the “false positive” rate as much as I could. What you REALLY want to avoid is predicting a pitch will be a miss, and it’s actually a hit. Not good. By reducing the number of false positives, I increased the model’s “precision”.
And my model ended up with 91% precision. Up from 83% without the model. Pretty good! (That’s a B- to an A-… the world to a nerd like me).
Step 6: Improving your model
I was pretty content with my final precision score, but this wasn’t what I got on my first shot… I had to reconsider and refine a number of times. Here are some of the steps you should consider when trying to improve a model:
1. Add more data!: I had originally only looked at the five games between the Cubs and Cardinals this year. As one might guess, it just wasn’t enough data to make a strong prediction. So I included all the games they’ve played this season.
2. Reduce the number of features: I spent a considerable amount of time collecting and cleaning my data, and got a ton of advice from the biggest baseball fans I know. I wanted to include it all: pitcher, batter and team statistics, weather, wind, humidity, altitude of the field, inning, home/away game, time of game, who was on base, the score, number of strikes/balls/outs… But my poor model got confused with so many inputs, and a limited number of plate appearances to learn from. I had to tone it down, and ultimately decide on three: pitch type, pitch zone & batter.
3. Reduce the number of outputs: If I had it my way, I would predict not only whether the ball was put in play, but what type of hit it was and where on the field it landed. Once again, over ambitious for the data I had access to. It was best to stick to only two predicted outcomes: miss/put in play.
4. Try a different model: I tried many classification models. But the Catboost model came to on top!
5. Change your perimeters: Each model has certain parameters you can change, kind of like dials on a washing machine. What might be good for shirts might not be good for towels. It takes trial and error and nob turning to figure out the best setting. For example, my Catboost model used “weighted classes”, a dial to even out the fact that there are way more misses than hits in my dataset.
Step 7: Make your model useable
You’ve made it to the end! Congratulations. Hopefully this provided a little more context on the data science process.
If you’re interested in learning more, check out my GitHub!