America’s Favorite Pastime: Similarity Assessment of Popular Philadelphia Phillies Players

Dhruvit Patel
INST414: Data Science Techniques
5 min readMay 1, 2024
Photo by Kirk Thornton on Unsplash

Introduction:

Ball game food, summer vibes, and competitiveness. There are many reasons that make baseball great, but most of all it’s the players. Hitting an MLB pitch is said to be one of the most difficult things to do in all of sports. This makes good hitters extremely valuable to MLB stakeholders. We will dive into a player analysis by conducting a similarity assessment of the Philadelphia Phillies hardest hitters: Nick Castellanos, Kyle Schwarber, and Bryce Harper. This analysis will be insightful to general managers, whose primary responsibility is to have the best starting lineup within a certain budget.

Question and Stakeholder:

A key question that can be answered by this assessment is which players in professional baseball exhibit performance characteristics most similar to our players? If a player is closely similar to a retired baseball player, scouts and managers can research what the retired baseball player did to get better and implement it to their current player. Alternatively, if the current player has a contract that is ending soon, this assessment can help management replace them with a player on another team, perhaps for cheaper. General managers can heavily benefit from this assessment to make management and player performance decisions.

Data:

This all time MLB baseball hitting dataset was found on Kaggle. It has 2508 records (players) and 18 columns. Some of the columns in this dataset are player name, position, games players, number of at-bats, runs, hits, doubles, third bases, home runs, strikeouts, stolen bases, batting average (AVG), on- base percentage, slugging percentage, on-base plus slugging percentage and more. The three columns we will be using in our assessment are batting average (AVG), on-base percentage, and slugging percentage. These three statistics are the most influential indicators to tell if a batter is good or bad. Batting average is determined by dividing the number of hits by the number of at-bats. On-base percentage is how frequently a batter reaches the bases per at-bat. Slugging percentage is the total number of bases a batter reaches per at-bat. Together, they show how good a batter is in baseball, thus how valuable they are. If a player is similar to one of the target players based on the similarity assessment, it means they have similar playing performances.

This data was collected and manipulated using Python’s pandas library. The dataset from Kaggle is a CSV file and was made into a dataframe using pd.read_csv(). There were some missing values in the dataframe, so they were dropped using df.dropna(). The total number of records went from 2508 to 2488.

Similarity Measurement:

The best similarity metric to use for this analysis was cosine similarity. This was used instead of Euclidean similarity because I do not want to take into account the number of games played (time). Every player has played a different number of games and thus more opportunities to impact their hitting performance. Some players may have played less in general due to injury too and I did not want to hold that against them when calculating similarities, although this is something that the stakeholders should take into consideration when making informed decisions.

Findings:

The 10 most similar items in the dataset to the three chosen query items (Bryce Harper, Kyle Schwarber, and Nick Castellanos) are shown above. Every one of the other players listed in each column have a cosine similarity value of at least 0.999, indicating a very strong similarity.

To answer our question, let’s take a look at the players similar to Bryce Harper. The player that is the most similar to Harper is J Foxx (Jimmie Foxx). Jimmie Foxx played in the 1920s and 1930s. Even though this was a very different era for baseball than today, general managers can understand what helped and didn’t help Jimmie to potentially apply it to Bryce Harper. Since they have similar performances, analyzing the past might be the key to improving today. Furthermore, Matt Chapman is currently playing for the San Francisco Giants. Him and Bryce Harper are both 31 years old. However, Chapman entered the MLB in 2017 and has played a total of 731 games, whereas Bryce Harper has been in the MLB since 2012 and has played 1382 games. However, they both have similar statistics for batting average, on-base percentage, and slugging percentage. If cosine similarity was not used, these two players would not be similar due to Bryce Harper playing 651 more games than Matt Chapman. This information can be very helpful to general managers because Bryce Harper is considered a ‘superstar’ in baseball. However, if he was to get injured for an extended amount of time or is at the end of his contract, a general manager can acquire Matt Chapman to replace him for far cheaper and the performance in theory would be relatively the same.

Limitations:

One limitation of my analysis is that it does not take into account the human aspect of baseball. There are moments where players are in slumps or never get back to their old selves after an injury. For some players, having more games to play can help them get a rhythm and start performing well. Also as I mentioned before with Jimmie Foxx, he played in an entirely different era of baseball. It wasn’t until 1947 that African American were allowed to enter the MLB. This dataset does not take into account them because the data is based around MLB players. There were many African American playing in other leagues before 1947 that could be similar to the three Philadelphia Phillies players we based our assessment on.

Github:

https://github.com/dhruvitpatel5/Mod3HittingStatsMLB/blob/main/mod3MLB.ipynb

--

--