Master Pandas For Data Science Using These Four Functions
(Differentiate yourself from the crowd by Mastering Advance Pandas Functions)
Introduction
When talking about data analytics, none can argue that Python has become a synonym for the word, and that is because Python has a highly beloved library like Pandas in its arsenal that has become the favourite tool in the shed of a data scientist.
Though the Pandas library is easy to pick up and start using, the real thing that draws so many data scientists to this library is its flexibility and functionality, which is highly reliant on simple API calls that can be combined to solve many complex problems.
But the reality is that most beginners never really go beyond the “pd.read_csv” of the Pandas library, which has become a severe bottleneck for the community. So in this article, we will go over the “Four most important” functions of Pandas Library that everyone needs to master to become a better data scientist.
Dataset used
Throughout this article, I will use the “Air Quality” Dataset. You can access the data from this link:
The Data contains the daily recorded value of Industrial Waste, Vehicle Waste, and Air Quality Index of various cities in India, covering a time span of “Five Years” from 2015 to 2020.
Importing the Data
Before talking about the advanced functions, let’s load the data using the good old “pd.read_csv” and look at the first Five rows using “head().”
# Library
import pandas as pd# Loading Data
path = 'https://github.com/ITrustNumbers/Medium_Data/raw/master/Article:%20Four%20Function%20To%20Master%20Pandas/Air_Quality_Data.csv'df = pd.read_csv(path)
df.head()
Output
Function to Master Pandas
1. unique() and nunique()
Whenever we have Categorical features in the dataset (for example, “City” and “Air Quality” feature in the Example Dataset), the first question we should ask is how many unique values or levels the feature has and what are those unique values.
This is where the “unique()” and “nunique()” functions come in handy. As the name suggests, when used over a specific column, the “unique()” function will return an array of all the unique values in the column.
While the “nunique()” function can be understood as the “n-unique” or the “Number-of-Unique” function that returns the number of unique values in each column.
Using unique()
df.City.unique()
Output
array(['Amaravati', 'Amritsar', 'Chandigarh', 'Delhi', 'Gurugram', 'Hyderabad', 'Kolkata', 'Patna', 'Visakhapatnam'], dtype=object)
Since, in a new dataset, the number of unique values can be very high, we should ideally use the “nunique()” function on the entire dataset before we use the “unique()” function to avoid printing out an unnecessarily large array.
Using nunique()
df.nunique()
Output
2. isnull()
The next thing we can check for is missing (null) values in the dataset. Now there are multiple ways to do this. You can find libraries (Ex: missingno) dedicated to plotting or calculating the null values in the dataset.
But using a whole library for such a nominal task is Overkill and inefficient. We can use the default functions of pandas to relay the same information effectively.
Using only isnull()
df.isnull()
Output
When we use “isnull()”, the function returns a copy of the dataframe where each cell represents whether the value in that cell is missing or not.
This is called a map, so “isnull()” returns a null value map of the entire dataframe. Now, we can combine this by the “sum()” to find the total number of missing values in each column.
Using isnull() and sum()
df.isnull().sum()
Output
Hence, we can conclude that there are 36, 28 and 36 missing values in Industrial_Waste, Vehicle_Waste and Air_Quality columns, respectively.
Bonus
We can also easily calculate the percentages of missing values in each column by dividing this output by the number of rows present in the data and multiplying it by 100.
We can also pretty up the output by rounding the percentages to two decimal places.
Null Value in Percentage
((df.isnull().sum()/df.shape[0]) * 100).round(2)
Output
3. loc[] and iloc[]
The most fundamental aspect of pandas is “Indexing”. Indexing is the act of extracting relevant data from the entire dataset using some logical filter or query. We can use Indexing to zero down on the location of data points of interest in a big dataframe which is very useful while analysing any dataset.
So, Indexing is fundamental in Pandas, and the best methods to index a dataframe are loc[] and iloc[]. The function “loc” refers to the location of the requested data, while the “i-loc” refers to the integer location or index position of the requested data.
The fundamental difference between the two is that loc uses names/labels of the rows and columns to identify the location while “iloc” uses indexes of the row and column.
Some examples of loc
# 1. Find the 96th row data in the Industrial_Waste Column
df.loc[96,'Industrial_Waste']
Output:
40.97
# 2. Find the 6th to 10th row data in the Vehicle_Waste Column
df.loc[6:10,'Vehicle_Waste']
Output
6 236.41
7 297.09
8 266.57
9 272.81
10 261.65Name: Vehicle_Waste, dtype: float
Some examples of iloc
# 1. Find the 42nd row data in the Date(1st) Column
df.iloc[42,0]
Output
2018–01–06
# 2. Find the 56th to 60th row data in the Vehicle_Waste(4th) and Air_Quality(6th) Columndf.iloc[56:60,[3,5]]
Output
We can create some advanced queries using “loc[]” and “iloc[]”. For example, we can update the data in the dataframe conditionally.
Let's look at the first 8 rows of data.
df.head(8)
What if we want to change the Air Quality to Poor for each row where AQI is more than 185? We can do that by using the “loc[]” function.
Conditionally Update Air_Quality
# Conditional Update
df.loc[df['AQI'] > 185.00, 'Air_Quality'] = 'Poor'# Looking at the result
df.head(8)
Output
Notice how Air_Quality is updated using our specified condition. This is the power of indexing using loc[] and iloc[].
4. groupby
“groupby()”, as the name suggests, is a function in the pandas library that allows us to club together data using some key and aggregating operations.
Let’s take an example:
Three friends, Shivam, Gauri and Ujwal, take one test multiple times. Shivam takes the test 3 time, Guari takes the test 5 times, whereas Ujwal only takes the test 1 time. The data looks something like this.
Now, we want to find the maximum score attained by each friend. We can use the “groupby()” function to achieve this.
First, we will club the data using the Friend column as a key and then use “max()” as the aggregation operation to find the maximum score for each friend.
Using groupby and max
df_marks.groupby(['Friend']).max()
Output
As we can see, we got the desired result. We can also utilize a variety of aggregation functions like “mean()”, “min()”, “max()”, “count()”, etc.
For Example, in our original Air_Quality dataset, we can find the average AQI for each City listed in the dataset using “groupyby()” with City as the key and “mean()” as the aggregate function.
Finding Average AQI
df.groupby(['City']).mean()[['AQI']]
Output
Conclusion
- We saw some of the essential functions in the Pandas library, but there are a lot of variations or flavours that you can add to these functions and a ton of other functions in the Pandas library.
- I encourage you to read through some of the “Pandas documentation” and note some exciting API calls or functions to start playing with the library.
- Up Next, I’ll be covering more functions and libraries that are important for data science like “NumPy”, “Sci-Kit learn”, and “TensorFlow”.
- Follow me! for more upcoming Data Science, Machine Learning, and Artificial Intelligence articles.
Final Thoughts and Closing Comments
There are some vital points many people fail to understand while they pursue their Data Science or AI journey. If you are one of them and looking for a way to counterbalance these cons, check out the certification programs provided by INSAID on their website. If you liked this article, I recommend you go with the Global Certificate in Data Science & AI because this one will cover your foundations, machine learning algorithms, and deep neural networks (basic to advance).