Pandas: First Step Towards Data Science (Part 3)

Published in

The Startup

4 min readSep 5, 2020

Data is almost never perfect. Data Scientist spend more time in preprocessing dataset than in creating a model. Often we come across scenario where we find some missing data in data set. Such data points are represented with NaN or Not a Number in Pandas. So it is very important that we discover columns with NaN/null values in early stages while analyzing data.

We have covered many methods in Pandas library and if you haven’t read previous articles, I recommend you to go through those articles to get in a flow. But if you are following from the beginning then lets get started.

In this article, we are going to learn

What is NaN ?
How to find NaN in dataset ?
How to deal with NaN as beginner ?
Finally, some methods to make dataframe more readable.

How to find NaN in dataset ?

To check NaN data in a column or in entire dataframe, we use isnull() or isna(). Both of these works as same , so we will use isnull() in this article. If you want to understand why there are two methods for same task, you can learn it here. Lets begin by checking null values in entire dataset.

>> print(titanic_data.info())output :
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Here you can see some valuable information about dataset. But information that we are interested is in Non-Null Count column. It shows number of non-null data points in each column. First line of output shows that there are total 891 entries that is 891 data points. We can also directly check number of non-null entries in each column using count() method as well.

>> print(titanic_data.count())
output :
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

From here we can conclude that Age, Cabin and Embarked are the columns with null values. There another way to get this result using isnull() method as we discussed earlier.

>> print(titanic_data.isnull().any())
output :
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool>> print(titanic_data.isnull().sum())
output :
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As we can see this result is much better if we are solely interested in null values.

How to deal with NaN as beginner ?

It is important to know number of null values in a column as it can help us understand how to deal with null values. If there are small numbers of null values like in Embarked, then we can remove those entries from dataset. However if most of the values are null like in Cabin, then it is better to skip that column while creating model.

There is another case where null values are not large enough to skip the column and small enough to remove entries as in the case of Age here. For such cases we have many ways to deal with null values, but as a beginner we will learn just one trick here and that is to fill it with a value. We will use fillna() method to do that.

>> titanic_data.Age.fillna("Unknown", inplace = True)
>> print(titanic_data.Age.isnull().any())
output :
false
# It is Age column have no null values

We used inplace argument so that changes are implemented in dataframe which is calling the method. If we do not pass this argument or keep it False then changes will not appear in our dataset. We can also check if a specific column have null values in same manner as we did for whole dataset.

We can also replace values in a column which are not NaN using replace() method.

>> titanic_data.Sex.replace("male","M",inplace = True)
>> titanic_data.Sex.replace("female","F",inplace = True)
>> print(titanic_data.Sex)
output :
0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex, Length: 891, dtype: object

Some methods to make Dataset more readable

rename() : There might be situation, when we realize that column name is not suitable as per our requirement. We can use rename() method to change column name.

>> titanic_data.rename(columns={"Sex":"Gender"},inplace=True)
>> print(titanic_data.Gender)
output :
0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Gender, Length: 891, dtype: object

2. rename_axis() : It is a simple method and as name suggest is used to provide names for axis.

>> titanic_data.rename_axis("Sr.No",axis='rows',inplace=True)
>> titanic_data.rename_axis("Catergory",axis='columns',inplace=True)
>> print(titanic_data.head(2))
output :
Catergory  PassengerId  Survived  Pclass  .....
Sr.No                                      
0                    1         0       3   
1                    2         1       1
[2 rows x 12 columns]

With this we come to end of this article and series on Pandas. I believe that methods which we came across in this series are very helpful for analyzing data before we can start training them. However, this is just a small fraction of methods in Pandas library and just a beginning of data exploration and preprocessing. But as a beginner, I think these are enough to get started with Data Science journey. I hope you found this series valuable. Thank you for reading. Keep practicing. Happy Coding ! 😄

Pandas: First Step Towards Data Science (Part 3)

How to find NaN in dataset ?

How to deal with NaN as beginner ?

Some methods to make Dataset more readable

Written by InfiniX