Machine learning and baseball pitch types classification

Ning Lee
Analytics Vidhya
Published in
3 min readFeb 3, 2020

This brief article is about how I applied machine learning to the MLB games data. What I am trying to do is using PITCH f/x Data includes velocity, spin rate, spin direction or position to classify pitch types. You can visit my Github for detail information, results ,and code for this project.

1. Data and visualization:
The data is from Kaggle; it includes every pitch thrown and at bats during the 2015–2018 MLB regular season. Data for each game and every player’s name can be found in other csv files. The pitch data contains 40 columns which record velocity, spin rate, spin direction, pitch position ,and situation on the field.

Fig 1. The total amount of each pitch type

Here is a figure about the total amount of each pitch being thrown. The FF which is Four-seam Fastball has the largest number, whereas FO, PO (Pitchout), EP (Eephus), SC (Screwball) are rarely used in games. Therefore, these pitch types, will be removed during the preprocessing. The IN (Intentional ball) will also be removed because it is kind of pointless to classify such a pitch type and it may not have enough data for the algorithm as well. I input only a part of the data into models due to computational costs except for neural network.

2. Preprocessing and model evaluation:

As I mentioned in the above part, rare pitches will be removed from the data due to their small amount of data set. Base on correlation analysis several features were excluded; for instance “end_speed” is highly collinearity to “start_speed”. Finally, features which record situations on the field are removed as well. These features simply have no correlation with baseball’s movement.

I used pipeline to evaluate different classifier including SVM, Random Forest, Gaussian Naive Bayes and Logistic Regression. The model of neural networks is evaluated separately. The f1-score is around 0.7, SVM and Random Forest have the highest score of 0.75. In contrast, Gaussian Naive Bayes has the lowest score of 0.65.

However, I did not search for the best hyper-parameters for these models. The neural network has the best f1-score of 0.78. I have tried different structures of the neural network and related hyper-parameters for the model. Due to neural networks having the best performance, therefore the following discussion will focus on neural network only.

3. Result:

The classification report indicates that some pitch types have relatively lower precision and recall. These pitch types include FC (Cutter), FS (Splitter), FT (Two-seam Fastball), KC (Knuckle curve), SI (Sinker) and the report indicates their f1-score is from 0.31(FS) to 0.60(FT). Obviously they also brought down the overall accuracy of the model. It seems that it is difficult to have significant improvements. Even though I searched for the best hyper-parameters and tried different neural networks structure or activation function.

Table 1. Classification report of neural network

However, what if we tried to classify a single pitcher rather than all pitchers in the league?

I chose Clayton Kershaw, Gerrit Cole and Zack Greinke to work on the idea. It turns out the model had significant improvements. Overall f1-scores have at least 0.93 (Gerrit Cole); the model of Kalyton Kershaw has the best f1-score 0.99! In addition, the models of Zack Greinke and Gerrit Cole are 0.97 and 0.93, respectively.

Fig 2. Confusion matrix of Gerrit Cole

However, some pitch types still have quite low precision and recall. If we look further into the confusion matrix of Gerrit Cole, many FT (36%) and SI (23%) were classified as FF. Even we are able to identify them in videos, but the algorithm is still not able to do it successfully. One possible solution is we can combine different features into some values which are more related to trajectory or the movement of the ball. After all, the main difference between FF and FT is some extra movement.

--

--