Deep Dive in Machine Learning with Python

Part — VIII: Several PANDAS operations

Rajesh Sharma
Analytics Vidhya
8 min readDec 26, 2019

--

Pandas logo from pydata.org

Welcome to the eighth blog of Deep Dive in Machine Learning with Python, in the last blog we covered how to start working with Pandas using ‘gapminder’ dataset. In today’s blog, we will focus on how to perform various Pandas operations on the dataset.

As I carry a great interest in medical data, so, in this blog, we will work with the Autism Spectrum Disorder Adolescent Data which is available on the UCI Machine Learning Repository. You can download this data from the provided link(it will get downloaded as an Arff file).

At the end of this blog, I will also share one Bonus tip.

What is the ARFF File?

ARFF stands for an Attribute-relation file format. It is an ASCII text file that contains the data instances sharing a set of attributes. ARFF files were developed at the Dept. of Computer Science of The University of Waikato, for use with the Weka machine learning software.

If you want to know more about ARFF files then access this link.

What is Autism Spectrum Disorder?

Autism spectrum disorder (ASD) is a developmental disorder that affects communication and behavior. It is described as a ‘developmental disorder’ because its symptoms generally appear in the first two years of life, however, it can be diagnosed at any age.

If you want to learn more about ASD then I recommend you to check out this link of NIH(National Institute of Mental Health), here you will find details about ASD’s signs, symptoms, causes, risks, treatments and clinical trials.

Dataset Description

If you open the downloaded ARFF file in the notepad editor then you will find three blocks/tags:

  • Relation
  • Attribute
  • Data

Refer to below image:

ARFF file opened in notepad editor

ARFF file majorly contains two components:

  • Header
  • Data

Relation and Attribute tags together make the header section. And, the Data section contains the actual data declaration line.

Features of the dataset

Kindly refer to the below image for the dataset features description:

Feature description in doc file downloaded from UCI

You can also go through this description(doc) file downloaded along with the ARFF file.

Problem-1: How to import/load the ARFF file?

Solution-1.1

Here, in the above cell, we imported the ARFF file and displayed its data.

Solution-1.1

We have loaded the file in ‘adr_data’ object that is of ‘tuple’ datatype.

Problem-2: How to create Pandas DataFrame from above ADR_DATA object?

Solution-2.1

If you observe the above output closely then you will find the character ‘b’ associated with the data values of columns that are not numeric(like age and result). This is the python’s way of displaying the values or arrays of the bytes and it represents that you are dealing with ASCII characters.

The character ‘b’ is only for representation and it is not a part of the data.

Solution-2.2

So, here you got the dataset with no additional character(refer to bonus tip for more details).

Problem-3: How to check the datatype of a column?

CASE-I

Solution-3.1

Here, in the above example, we got output as ‘O’ that refers to ‘Object’ dtype in Pandas and means the string format.

CASE-II

Solution-3.2

In the above example, we got output as ‘float64’ that is the pandas dtype and means the float format.

Problem-4: How to apply where conditions on the pandas dataframe?

Solution-4

CASE-I

Solution-4.1

In the above example, we applied the condition to have the records where ‘relation’ is ‘Parent’.

CASE-II

Solution-4.2

In the above example, we applied two conditions separated by AND(&) to filter the records.

CASE-III

Solution-4.3

In the above example, we applied two conditions separated by OR(|) to filter the records.

CASE-IV

Solution-4.4

In the above example, we applied the negative condition by using ~(tilde) to the filter records.

CASE-V

Solution-4.5

In the above example, we applied the condition to filter the records by using a list of values.

Problem-5: How to sort the pandas DataFrame based on a column?

CASE-I

Solution-5.1

In the above example, we sorted the DataFrame by using the column ‘gender’.

CASE-II

Solution-5.2

In the example, we sorted the dataframe based on two columns ‘ethnicity’ and ‘gender’.

Problem-6: How to group the pandas dataframe?

CASE-I

Solution-6.1

In the above example, we grouped the ‘adr_data_df’ to found the count the categories of ‘ethnicity’ for columns ‘gender’ and ‘relation’.

CASE-II

Solution-6.2

In the above example, we grouped the ‘adr_data_df’ to found the count the categories of ‘ethnicity’ for all the columns.

CASE-III

Solution-6.3

Another way to count the categories of a column by using the series function(value_counts()).

Problem-7: How to change the datatype of an exiting DataFrame column?

Before changing datatype

Converting float column to integer

After changing the datatype of ‘result’ column

Problem-8: How to convert the STR column values to UPPERCASE?

Solution-8

In this example, we changed the ‘ethnicity’ column values to UPPERCASE.

Problem-9: How to add a new column in the DataFrame by applying an operation on an existing column?

Solution-9

In the above example, we added a new column ‘caps_autism’ having values from column ‘austim’.

Problem-10: How to delete the column from a DataFrame?

CASE-I

Solution-10.1

In the above example, we have deleted the newly added column ‘caps_autism’, however, this deletion didn’t happen on the DataFrame.

Solution-10.2

As you can see in the above cell, ‘caps_autism’ is still in the DataFrame. Now, to get this column removed from the DataFrame you need to use the ‘inplace’ parameter which performs the ‘drop’ operation on the DataFrame.

Solution-10.3

Here you go, now the column ‘caps_autism’ has been deleted from the DataFrame. You might be wondering that what is the role of ‘axis=1’, so this parameter is telling that ‘caps_autism’ exists in the horizontal line of the DataFrame(refer to below image).

Solution-10.4

Problem-11: How to set the new index of the DataFrame?

Solution-11

So, we added the new index to the DataFrame i.e. ‘gender’.

Problem-12: How to drop the duplicates from the DataFrame?

Solution-12

Here, we dropped the single duplicate record from the ‘adr_data_df’.

Problem-13: How to reset the index of the DataFrame?

Solution-13

In the above example, we reset the index of the DataFrame and the ‘gender’ column again got added to the DataFrame as a non-index column. If you don’t want to get this column added into the dataframe then provide the ‘drop=True’ parameter in the reset_index.

Problem-14: How to rank the data of the DataFrame?

CASE-I

Solution-14.1

In the above example, we ranked the DataFrame based on column ‘ethnicity’ and used the ‘method = min’. Min refers to the lowest rank of the record, so it gave the rank 1 to the first 6 records with ethnicity ‘?’ then it gave the 7th rank to the data with ethnicity ‘ASIAN’.

CASE-II

Solution-14.2

In the above example, we ranked the DataFrame based on column ‘ethnicity’ and used the ‘method = dense’. Dense refers to ‘min’, but rank always increases by 1 between groups, so it gave the rank 1 to the first 6 records with ethnicity ‘?’ then it gave the 2nd rank to the data with ethnicity ‘ASIAN’.

Problem-15: How to fill the missing values in the dataframe?

CASE-I: IsNull()

Solution-15.1

CASE-II: ISNA()

Solution-15.2

Thus, in this dataset there are no NULL or NA records, however, there are some records with ‘?’.

Solution-15.3

In the ‘ethnicity’ column we have 6 records with ‘?’. So, how we can fill such values?

Solution-15.4

Congratulations, we come to the end of this blog. To summarize, we covered various Pandas operations that we perform on the dataset.

BONUS Tip

1. Data encoding in pandas

Bonus Tip-1

Thus, by applying the user-defined function on the dataframe we changed the data encoding.

Don’t worry about ‘apply’ and ‘lambda’ that we used in the ‘apply_decode’ function, as in the upcoming blogs we will focus on these topics.

Thank you and happy learning!!!

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Blog-9: Advanced PANDAS operations

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!