Clustering MLB Hitters from the 2023 Season

Jaron Richman
INST414: Data Science Techniques
4 min readApr 30, 2024

MLB hitters are typically grouped into three categories: power hitters, contact hitters, or a combination of the two. When grouping hitters, they are placed into each group on the basis of batting average and home runs. But what if there were other statistics to look at when grouping players? And what if it turns out there are more than three common groups? With Baseball Savant, we are able to collect and dissect almost any stat we want. To start my analysis, I took two different datasets from Baseball Savant: one that contains batted ball data, and another that contains plate discipline. Note that all players were qualified hitters in 2023. After combining the two together based on a players’ name and ID, I had one dataset that encompassed the entire skillset of a hitter.

data1 = pd.read_csv("/Users/Jaron/Desktop/INST414 Files/exit_velocity.csv")
data2 = pd.read_csv("/Users/Jaron/Desktop/INST414 Files/stats.csv")
df = data1.merge(data2, how = "left", on = ["player_id", 'last_name, first_name'])

df = df[['last_name, first_name', 'avg_hit_angle', 'max_hit_speed', 'avg_hit_speed', 'brl_percent', 'k_percent',
'bb_percent', 'z_swing_percent', 'oz_swing_percent', 'iz_contact_percent']]

df = df.set_index('last_name, first_name')

df = df.dropna()

The cleaning aspect of the data was fairly easy, as it was official MLB data and did not have any inconsistencies. All that needed to be done was eliminating any rows with null values, and condensing the dataset to only contain statistics I wanted to include in my analysis.

To measure similarity, I will be using the standard Euclidean Distance when I create my model. However, I first had to select my value for k. I did that by creating an elbow curve, and evaluating what the best k value would be. After a careful analysis, I came to the conclusion that k = 4.

With k = 4 set, I went ahead and created my clusters. Once it was completed, we had clusters of 31, 34, 25, and 40 players. The first cluster consisted of players who had a blend of power and the ability to hit for average. These are your above average players who are exciting to watch, and excel in multiple assets of the game. They have the capability to hit home runs, but also will go gap to gap and collect extra base hits in droves.

Sample of players who were in Cluster #1

The second cluster consists more of your “pure hitters”. These are hitters who might not have the hardest exit velocities, but have great plate discipline and do not strike out much. They tend to hit for high average, and are typically a tough out at the plate.

Sample of players who were in Cluster #2

Cluster #3 are your power hitters. They hit the ball hard, hit a lot of home runs, but also strike out a lot.

Sample of players who were in Cluster #3

The final cluster is more or less everyone else. They are a similar group to Cluster #1, but with better plate discipline and a little more power. There are more superstars in this cluster than any of the others, and that makes sense as these are the most well rounded players.

Sample of players who were in Cluster #4

With me only using qualified hitters from 2023, that eliminates hitters with a small sample size who could have contributed to these clusters. The same can be said for me using only 2023 data; if I combined multiple season, I would have more datapoints. Additionally, choosing arbitrary stats could result in different outcomes if someone else were to create a similar model using their own metrics. The ones I chose were stats I felt were important in evaluating a hitter, but other people may have different opinions.

I have included a link to my GitHub containing my code for this article:

https://github.com/jaronrichman/INST414-Module-Assignments

--

--