FIFA 21 Data Manipulation Practice

Ozan Güner
The Startup
Published in
6 min readNov 24, 2020

I would like to mention how I prepared and analyzed the Fifa 21 dataset in this week’s post. This post will be more practical than informative. Of course I will share some information, but I want this article to encourage more research. Because I think there is more to learn while doing research.

Cited from: www.essentiallysports.com

Before I start, I share the link that you could download dataset I worked on. I will be using Jupyter Notebook on this practice. You could download Jupyter Notebook from here.

Let’s begin:)

First Look

After downloading dataset in your computer, you could create a new file on your desktop. Then you move the dataset into this new file.

I created a file called “Medium Series” and then moved dataset to this file as below. Then I clicked “New” button and by clicking “Python 3” I created a new notebook called “Fifa_2021”.

I need to import some popular frameworks while working on data analyzes. Pandas and Numpy are some of them. You could access the links by clicking on the names to install.

After the installation, I opened “Fifa_2021.ipynb” notebook and then imported the frameworks that I would like to use.

Then I need to read file to create a data frame by using “read_csv()” method from pandas framework. I generally prefer copying the main data file to protect it against any deterioration by using “copy()” method. My data frame’s name is “df” now. I observed first 5 rows by using “head()” method. Thus I aimed to get some information about dataset.

By using “tail()” method I could also observe last 5 rows from dataset.

I accessed the number of columns (features) and rows (observations) of dataset by using “shape”. Then I checked if there is any missing value by using “.isnull().sum()” methods. I saw that there is no missing value.

All players must have unique “player_id”. So using “drop_duplicates()” method, I deleted duplicate members id to be sure if there is duplicate.

To get some information about datatypes of columns, I used “info()” method. I could also observe that there are 17981 rows and 9 columns in dataset by using “info()” method. And you could see that dataset has index numbers from 0 to 17980.

To trim spaces at the beginning and the end of the words in “team” column, I used “str.strip()” method. If you want to get more information about string methods, you could take a look at my “Most Common String Methods in Python” post.

Number Of Players Per Team

I would like to see number of players of each team. By using “groupby()”, “count()” and “sort_values()” methods I grouped the teams by number of players as you see below. And I assigned it as “df_sum_of_plyrs”.

I would like to focus on teams with more than 12 players so that they don’t affect the averages too much. Therefore I reorganized “df_sum_of_plyrs” as below.

As you could see, 211 players are in free agents status. Therefore I removed them from dataset and I assigned as “free_agents”. I also assigned players that already have a team as “df_team_player”.

By using “mean()” method, I calculated the number of players as “24,95…” per team. You could also see the first five teams that have most players, by using “sort_values(ascending=False).head()” as below.

Overall Average Per Team

Let’s take a look to overall average of teams. I accessed to overall average, as you see below, by using “mean()” method. And then, I sorted them from highest to lowest.

Let’s take a look at 10 teams that have highest and lowest overall average.

Extracting Base Positions of Players from Positions

I used “str.strip()” method again to trim spaces at the beginning and the end of the words in “position” column. Then I defined a new column called “base_position” and assigned base positions to new column by using “str.extract()” method.

I got number of players according to their base positions in whole dataset by using “value_counts()” method.

Overall Average According to Players’ Age Average and Base Positions

I selected columns that I would like to use from “df” and created as a new “df_pos” data frame.

Then I grouped overall average and age average by base position and sorted the values of overall average from highest to lowest.

I imported two important frameworks for data visualization called matplotlib and seaborn. You could access how to installing them from this link and this link:)

And finally I visualized data. You could see the age and overall performance average of the positions.

--

--