Master Pandas For Data Science Using These Four Functions

Yash Chauhan
Accredian
Published in
7 min readSep 23, 2022

(Differentiate yourself from the crowd by Mastering Advance Pandas Functions)

Introduction

When talking about data analytics, none can argue that Python has become a synonym for the word, and that is because Python has a highly beloved library like Pandas in its arsenal that has become the favourite tool in the shed of a data scientist.

Though the Pandas library is easy to pick up and start using, the real thing that draws so many data scientists to this library is its flexibility and functionality, which is highly reliant on simple API calls that can be combined to solve many complex problems.

But the reality is that most beginners never really go beyond the pd.read_csv of the Pandas library, which has become a severe bottleneck for the community. So in this article, we will go over the “Four most important” functions of Pandas Library that everyone needs to master to become a better data scientist.

Dataset used

Throughout this article, I will use the “Air Quality” Dataset. You can access the data from this link:

The Data contains the daily recorded value of Industrial Waste, Vehicle Waste, and Air Quality Index of various cities in India, covering a time span of “Five Years” from 2015 to 2020.

Importing the Data

Before talking about the advanced functions, let’s load the data using the good old pd.read_csv and look at the first Five rows usinghead().

# Library
import pandas as pd
# Loading Data
path = 'https://github.com/ITrustNumbers/Medium_Data/raw/master/Article:%20Four%20Function%20To%20Master%20Pandas/Air_Quality_Data.csv'
df = pd.read_csv(path)
df.head()

Output

First Five Rows of the Dataset

Function to Master Pandas

1. unique() and nunique()

Whenever we have Categorical features in the dataset (for example, “City” and “Air Quality” feature in the Example Dataset), the first question we should ask is how many unique values or levels the feature has and what are those unique values.

This is where the unique() and nunique()functions come in handy. As the name suggests, when used over a specific column, the “unique()” function will return an array of all the unique values in the column.

While the nunique() function can be understood as the “n-unique” or the “Number-of-Unique” function that returns the number of unique values in each column.

Using unique()

df.City.unique()

Output

array(['Amaravati', 'Amritsar', 'Chandigarh', 'Delhi', 'Gurugram',        'Hyderabad', 'Kolkata', 'Patna', 'Visakhapatnam'], dtype=object)

Since, in a new dataset, the number of unique values can be very high, we should ideally use the “nunique() function on the entire dataset before we use the “unique() function to avoid printing out an unnecessarily large array.

Using nunique()

df.nunique()

Output

Output from nunique() | Number of Unique Values in Each Column

2. isnull()

The next thing we can check for is missing (null) values in the dataset. Now there are multiple ways to do this. You can find libraries (Ex: missingno) dedicated to plotting or calculating the null values in the dataset.

But using a whole library for such a nominal task is Overkill and inefficient. We can use the default functions of pandas to relay the same information effectively.

Using only isnull()

df.isnull()

Output

Output from isnull()

When we use isnull(), the function returns a copy of the dataframe where each cell represents whether the value in that cell is missing or not.

This is called a map, so isnull() returns a null value map of the entire dataframe. Now, we can combine this by the sum() to find the total number of missing values in each column.

Using isnull() and sum()

df.isnull().sum()

Output

Number of Missing Values in each Column

Hence, we can conclude that there are 36, 28 and 36 missing values in Industrial_Waste, Vehicle_Waste and Air_Quality columns, respectively.

Bonus

We can also easily calculate the percentages of missing values in each column by dividing this output by the number of rows present in the data and multiplying it by 100.

We can also pretty up the output by rounding the percentages to two decimal places.

Null Value in Percentage

((df.isnull().sum()/df.shape[0]) * 100).round(2)

Output

Percentage of Missing Values in each Column

3. loc[] and iloc[]

The most fundamental aspect of pandas is “Indexing”. Indexing is the act of extracting relevant data from the entire dataset using some logical filter or query. We can use Indexing to zero down on the location of data points of interest in a big dataframe which is very useful while analysing any dataset.

So, Indexing is fundamental in Pandas, and the best methods to index a dataframe are loc[] and iloc[]. The function locrefers to the location of the requested data, while the “i-loc” refers to the integer location or index position of the requested data.

The fundamental difference between the two is that loc uses names/labels of the rows and columns to identify the location while iloc” uses indexes of the row and column.

Some examples of loc

# 1. Find the 96th row data in the Industrial_Waste Column
df.loc[96,'Industrial_Waste']

Output:

40.97

# 2. Find the 6th to 10th row data in the Vehicle_Waste Column
df.loc[6:10,'Vehicle_Waste']

Output

6 236.41
7 297.09
8 266.57
9 272.81
10 261.65
Name: Vehicle_Waste, dtype: float

Some examples of iloc

# 1. Find the 42nd row data in the Date(1st) Column
df.iloc[42,0]

Output

2018–01–06

# 2. Find the 56th to 60th row data in the Vehicle_Waste(4th) and Air_Quality(6th) Columndf.iloc[56:60,[3,5]]

Output

We can create some advanced queries using “loc[]” and “iloc[]. For example, we can update the data in the dataframe conditionally.

Let's look at the first 8 rows of data.

df.head(8)
First Eight Rows of Data

What if we want to change the Air Quality to Poor for each row where AQI is more than 185? We can do that by using the “loc[]” function.

Conditionally Update Air_Quality

# Conditional Update
df.loc[df['AQI'] > 185.00, 'Air_Quality'] = 'Poor'
# Looking at the result
df.head(8)

Output

Updated Dataset

Notice how Air_Quality is updated using our specified condition. This is the power of indexing using loc[] and iloc[].

4. groupby

groupby(), as the name suggests, is a function in the pandas library that allows us to club together data using some key and aggregating operations.

Let’s take an example:

Three friends, Shivam, Gauri and Ujwal, take one test multiple times. Shivam takes the test 3 time, Guari takes the test 5 times, whereas Ujwal only takes the test 1 time. The data looks something like this.

Randomly Generated Data

Now, we want to find the maximum score attained by each friend. We can use the “groupby() function to achieve this.

First, we will club the data using the Friend column as a key and then use max() as the aggregation operation to find the maximum score for each friend.

Using groupby and max

df_marks.groupby(['Friend']).max()

Output

Results from Groupby and Max

As we can see, we got the desired result. We can also utilize a variety of aggregation functions like “mean()”, “min()”, “max()”, “count()”, etc.

For Example, in our original Air_Quality dataset, we can find the average AQI for each City listed in the dataset using “groupyby() with City as the key and “mean()” as the aggregate function.

Finding Average AQI

df.groupby(['City']).mean()[['AQI']]

Output

Average AQI Using Groupby and Mean

Conclusion

  • We saw some of the essential functions in the Pandas library, but there are a lot of variations or flavours that you can add to these functions and a ton of other functions in the Pandas library.
  • I encourage you to read through some of the “Pandas documentation” and note some exciting API calls or functions to start playing with the library.
  • Up Next, I’ll be covering more functions and libraries that are important for data science like “NumPy”, “Sci-Kit learn”, and “TensorFlow”.
  • Follow me! for more upcoming Data Science, Machine Learning, and Artificial Intelligence articles.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).

--

--

Yash Chauhan
Accredian

Trying to juggle my Passion for Data Science and my Love for Literature, Sculpting a part of me through every word I write.