Finding Similarity in Professional Volleyball Players

Alejandro Tarazona
INST414: Data Science Techniques
4 min readMay 6, 2022

As a Peruvian, we take pride in our women’s volleyball team. They were one of the first teams to take us far in the Olympics. So without a doubt, I decided to focus on a volleyball CSV file. I ran into a bunch of different files that I could have used. So I started with an MLS stats CVS file, but the report wouldn’t work correctly. Then I moved on to an NCAA women’s volleyball stats CSV file, which was too large to pull from. It had over 700,000 rows of that, so my computer would take a long time to generate. I started to look around Kaggle for more CSV files. Then I ran into an international volleyball stats CSV file. It was super easy to upload and manipulate. The data was explicitly on professional women volleyball players in the FIVB.

Data Collection

I started by uploading the file into my google cloba docs. I wanted to see what the database was like, so I did a simple pd.read_csv, which opened the file up and generated what is seen below. This was a super easy process that took less than a minute.

Data Analysis

Then I wanted to take the data of the raw number. I ran the pd.read_csv file once again. Then I sorted the values by the “name” column. From there, I wanted to get just the player stats information. I did this by making a variable name equal to df._get_numerical_data(). This allowed the system to pull the numerical values within the database, leaving me with just numbers, as seen below.

I looked around the CSV file and wanted to look at all the players in this database. The stats are amazing to see, significantly, when sometimes some pages would not update the stats of their players. So I did a quick df.iloc[38] to look at row 38, Angela Leyva.

I selected Angela Leyva to be my target play. Looking around online and on my peers’ medium posts, I noticed that Grace O’Neill used a pairwise distance function. So this was the route that I believed I needed to take as well. So now I had to decide what to do to find some similarities. I would open the scikit-Learn python library, which allowed me to use many different tools. I was looking for the main one was the pairwise_distance() function. This function allowed me to creat a pairwise_distance based on the raw data pulled compared to leyva_info.

I started to make an array of stats for each player in this database. I wanted to do everyone in this database since the data was not massive. So I created the all numerical data set as shown above and kept Angela Leyva’s stats as leyva_info = [player_info.iloc[38]].

Next, I created a dictionary with names as keys and distances as values. I changed the order of the database on the name in ascending form. I then made a list of the first ten dictionary items to decipher the results quickly.

I generated a list of the ten players closely similar to Angela Leyva. This was done by doing a small distance value. These players were the ones that were the most identical to Angela Leyva. Some of the stats are repeated due to the different seasons. This is why there is another Angela Leyva record that matches.

Problems, Issues, and Bugs

I ran into many problems when picking the correct CSV file. Some files were so large that the computer would take up to five minutes to upload into Google Colab. Finding a good balance between the amount of data and data specifics makes it a difficult pick. From there on, I had to look around for some help on what type of graphs I needed to generate. I wanted something simple that would let me understand what my data was trying to show. In something like soccer and volleyball, it will be hard to find a similar player since everyone is different. There are so many positions that someone can play then you also have to consider what the coach is asking from them. They might not be killing or scoring. Then they will have to assist. Finding the right balance within a database was what took the longest.

Conclusion

I learned that statistically speaking, some players could be similar even though we think certain players are better than others when we watch a sport. My MLS stats also showed me that some players could be very similar in stats. Suppose we took all the players and just based them on their stats. Things might change for many of us. I have recently watched more highlight videos of Coraima Gomez and Sara Bonifacio. Although they do not play the same way Angela Leyva plays, they still get the same amount of kills. This project has opened my eyes to other players.

Link to GitHub : https://github.com/Atarazona11/INST-414/blob/main/international_vb_players.ipynb

--

--