Deep Dive in Machine Learning with Python

Part — VII: All about Pandas

Rajesh Sharma
Analytics Vidhya
8 min readDec 24, 2019

--

Image by WWF(World Wildlife Fund)

Welcome to the seventh blog of Deep Dive in Machine Learning with Python, till now we have covered the python basics. Kindly access below blogs if you want to revisit an earlier topic:

Python and ML Fundamentals: Deep Dive in ML with Python — Part-I

Jupyter Notebook: Deep Dive in ML with Python — Part-II

Strings: Deep Dive in ML with Python — Part-III

Lists: Deep Dive in ML with Python — Part-IV

Tuples & Dictionaries: Deep Dive in ML with Python — Part-V

Functions & List Comprehensions: Deep Dive in ML with Python — Part-VI

Today, we will work with the most extensively used Data Science Library i.e. Pandas, which was built for performing the Data Cleansing, Data Wrangling(includes data munging and transformation), Data Analysis, and similar activities.

In this blog, I’ll be using the popular Gapminder dataset. You can either download it from the provided link or install the Gapminder package using pip(e.g. python -m pip install gapminder). As, I already have this package installed on my system, so, use the same in this blog. However, in future blogs, we will explore the Pandas functionality of reading the data from files.

At the end of this blog, I’ll also share some bonus tips related to Pandas.

Import the gapminder package

Import the package and read its help

Problem-1: How to view the attributes of the ‘gapminder’ package?

Using the dir() method, an inbuilt function that returns a list of the attributes and methods of any object (e.g. pandas dataframe, modules, functions, strings, lists, dictionaries, and others).

Solution-1

Problem-2: What happens when ‘dir’ executed without any parameters?

It returns the module names added to the local namespace including all the existing and previous ones

CASE-I

Solution-2.1

CASE-II: Import more modules

Solution-2.2

Thus, ‘numpy’ and ‘statsmodels’ also added in the objects returned by dir().

Pandas majorly comprise two data storage components:

DataFrame: The row and column shaped container of data which may or may not include headers.

Series: It is referred to the single column of the dataframe.

Problem-3: Create a Pandas DataFrame of ‘gapminder’ data

Solution-3

So, in the above example, we have created the Pandas DataFrame i.e. ‘gapminder_df’ and the above row and column structure is the dataframe representation. Thus, this dataframe contains 1704 rows and 6 columns (named as ‘country’, ‘continent’, ‘year’, ‘lifeExp’, ‘pop’, ‘gdpPercap’).

NOTE: The leftmost running sequence of numbers from 0 to 1703 is the index of the dataframe.

Problem-4: How to view the columns/features/variables of the gapminder dataset?

Solution-4

By using the ‘columns’ attribute of the dataframe i.e. ‘gapminder_df’ we can view its column names.

Problem-5: How to view the first 5 records of the dataset?

Solution-5

By using the head() function, we can view the first 5 records of the dataframe.

Problem-6: How to view the first n records via HEAD command?

Solution-6

Here, in the head() function, you can specify the number of records, let’s say 20, that you want to retrieve from the dataframe.

Problem-7: How to view the last 5 records of a DataFrame?

Solution-7

By using the tail() function, we can view the last 5 records of the dataframe.

Problem-8: How to view the last n records of a DataFrame via TAIL command?

Solution-8

Here, in the tail() function, you can specify the number of records, let’s say 15, that you want to retrieve from the dataframe.

Problem-9: How to find a summary of the DataFrame?

Solution-9

So, by using info() function, we came to know that out of six variables in the gapminder_df, four are quantitative and 2 are qualitative.

  • Quantitative: Variables that are numeric. They represent a measurable quantity. For example, the population of the number of people in the city is a measurable entity. Hence, the population considered a quantitative variable.
  • Qualitative: Variables that accept values like names or labels. The color of a ball (e.g., red, green, blue) or the countries in the world would be the primary examples of qualitative or categorical variables.

Problem-10: How to find the basic statistics of DataFrame features?

Solution-10

In the above example, although there are six features in the gapminder_df, however, the statistics of ‘Country’ and ‘Continent’ features didn’t get displayed after executing describe() method.

This is because by default describe() method considers quantitative variables for these statistics.

Problem-11: How to include Qualitative variables in describe() method for finding their basic statistics?

Solution-11

So, in the above example, we can see that 3 new statistics parameters (‘unique’, ‘top’ and ‘freq’) got added specifically for qualitative variables.

Problem-12: How to rename a column of the DataFrame?

Solution-12.1
Solution-12.2

Problem-13: How to add a new column to the existing DataFrame?

Solution-13.1

Let’s add a column ‘Planet’ to the above DataFrame i.e. ‘gapminder_df’ with value ‘Earth’.

Solution-13.2

In this way, we can add a new column to the DataFrame.

Indexing and Slicing

It means either one of the followings:

  • Selecting all the rows and some of the columns
  • Selecting some of the rows and all of the columns
  • Selecting some of the rows and some of the columns

Indexing also referred to the Subset Selection from a DataFrame.

Problem-14: How to select the first column of the DataFrame?

Solution-14.1

As we are selecting one the single column, thus its type returned as Series.

Solution-14.2

Again we have selected the single column, thus its type returned as Series.

Problem-15: How to select the first 3 columns with all rows from a DataFrame?

Solution-15.1

As we have selected multiple columns, thus its type returned as DataFrame.

Solution-15.2.1

What iloc stands for?

Well everyone has their understanding of iloc. Some call it as ‘Integer location’, while others use ‘Index location’, however, I prefer to name it as ‘Index Integer Location’.

Solution-15.2.2

What loc stands for?

For this one, its very simple loc means ‘Label-based indexing’. It means instead of index you need to specify its column name/label. It comes very handily when the index of your DataFrame is named.

While using iloc and loc the numeric value before the comma(i.e. ‘,’) within the square brackets points to rows index and numeric values after it corresponds to columns index.

Problem-16: How to select records from 20th to 30th index position for the last 3 columns?

Solution-16.1

If you closely see the above cell result then you would find the records from 20th to 29th index position. And, if you remember the indexing with lists then it might be easy for you as PANDAS also follows the same concept.

Solution-16.2

This time records got displayed only for 2 columns(i.e. excluding ‘Planet’ column) and it is because PANDAS follows the same concept with columns as well.

Solution-16.3

You might scratch your head after seeing the above result as this time with loc even after providing the range 20:30 for rows we got the same result that we got by providing the range 20:31 with iloc.

So, to clear out the confusion, this is the basic difference b/w iloc and loc.

loc returns the data inclusive index boundary values, however, iloc does not consider the end boundary value.

Congratulations, we come to the end of this blog. In the upcoming blogs, we will explore the SQL like functions and Advance functionalities of Pandas.

As I promised at the start of the blog that I’ll share some bonus tips related to Pandas. So here you go:

BONUS Tips

1. Future warning — .ix is deprecated

Bonus Tip-1

Always read such warnings carefully as it will aware you which of the features might not exist in future releases.

2. Single or multiple column selection

It is advisable to use square bracket notation whenever you are selecting single or multiple columns from a DataFrame. Because of the following reasons:

  1. Gives good code readability
  2. Provides good understanding to the developer who might work on your code in future

3. Deep copy and Shallow copy

Let’s say, you created a child object from a parent object, then in:

  • Shallow copy: Child object will always refer to the parent object and any change in the child will be automatically reflected in the parent object
  • However, in Deep copy: Child object will be a copy of its parent without any reference, thus both will have their individuality and any change in the child will not be reflected in the parent

In Pandas, we generally follow the shallow copy that means we just create a new variable with reference to the predecessor/previous object.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Thank you and happy learning!!!

Blog-8: Several PANDAS operations

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!