A Step-by-step coding guide to Pandas library

Prashant Ojha
May 30, 2021 · 13 min read

Introduction to Python’s Pandas library for beginners. A simple step-by-step comprehensive coding guide to getting started in data science.

If you are a newbie, you are welcome to download my code file and follow along the article for better understanding.

What is Pandas? Pandas is a software library written for the Python programming language for data manipulation and data analysis, it offers data structures and operations for manipulating numerical tables and time series.

Pandas is the most powerful and fast DataFrame object for data manipulation and data analysis.

For awesome analytics cartoons— https://timoelliott.com/blog/cartoons/more-analytics-cartoons

We will cover the following :

  1. Overview
  2. Importing library and loading the datasets.
  3. Accessing the main DataFrame components.
  4. Understanding data types in python.
  5. Selection and operation | Series & DataFrame
  6. Method Chaining
  7. Filling missing values
  8. Comparing missing values
  9. Renaming row and column names
  10. Creating and Deleting Columns
  11. Ordering Columns for better use
  12. Transposing the direction of DataFrame

I. Overview

There are two widely used data structures in pandas, Series and DataFrame, It is vital to understand every component of series and DataFrame understand.

Let’s start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get started. The fundamental behaviour about data types, indexing, and axis labelling apply across all of the objects.

Visually, the outputted display of a pandas DataFrame (in a Jupyter Notebook) appears to be nothing more than an ordinary table of data consisting of rows and columns. Hiding beneath the surface are the three components — the index, columns, and data (also known as values) that you must be aware of in order to maximize the DataFrame’s full potential. Analyze the labelled anatomy of the DataFrame:

Anatomy of DataFrame

Now, let us understand what is method and attributes in python, knowing the difference between these two will help to understand the basics better. A method is a function defined in the class. An attribute is an instance variable defined in the class.

For a quick simplification let’s take an example of an elephant.

Attribute == Characteristics

What are the characteristics of an elephant? It has a trunk for a nose, large fan-like ears, herbivorous and ivory tusk…etc.

Method == Actions

What are the methods (actions) of an elephant? It can walk, drink water from its trunk, run, eat, grass…etc.

Before deep-diving into the pandas, discussing the approach is vital to understand the basics clearly. Here, I’ll be using two datasets Netflix and Spotify. The two primary goals in the following exercise are to understand the anatomy and working of Pandas, and what are the different methods to approach any dataset to understand it better. I am diverting from the traditional approach of choosing one dataset and doing exploratory data analysis to understand business insights from the chosen dataset only. Rather, we will be focusing on the workings of pandas, their methods and attributes so that you can use them on any dataset in future. Therefore I’ll be switching from one dataset to another as per requirement.

Also, I’ll cover exploratory data analysis in the next exercise where we’ll see how using the following methods and attributes one can gain business insights from the dataset.

II. Importing library and loading the datasets

III. Accessing the main DataFrame components

Each of the three DataFrame components — the index, columns, and data — may be accessed directly from a DataFrame. Each of these components is itself a Python object with its own unique attributes and methods. It will often be the case that you would like to perform operations on the individual components and not on the DataFrame.

The output of the columns attribute appears to be just a sequence of the column names. The fully qualified class name of the object for the variable columns and index is pandas.core.indexes.base.Index.

Here, Pandas is the package name, followed by name of the module i.e., core. indexes. base and ends with the name of the type Index.

As you can see, the value of the DataFrame attribute returned a NumPy n-dimensional array. Most pandas rely heavily on the n-dimensional array.

IV. Understanding data types in python

Datatypes in python

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represent some kind of measurements. Categorical data, on the other hand, represent discrete, finite values.

Pandas do not broadly classify data as either continuous or categorical. kindly visit python documentation website to see the distinct precise definations.

V. Selection and operation | Series & DataFrame

Passing a column name as a string using dot notation to select a Series of data. Alternatively, you can also do the same task with netflix[“director”]. It is interesting to see using dot notation the old column (name of the series) has actually become an attribute of the series.

> to_frame( ) It is possible to turn this Series into a one-column DataFrame with the to_frame method. This method will use the Series name as the new column name.

> head( )/tail( ) Once we read the dataset into a pandas DataFrame, we will take an overview of the data, the head or tail method helps us to get the overview in tabular form.

> info( ) This helps us to get a concise summary of the dataset, it comes handy when doing exploratory data analysis of the data.

> dtypes It shows datatypes of each column, in pandas one column can have only one data type.

> type( ) This method returns class type of the argument (object) passed as a parameter, this function is mostly used for debugging purposes.

> value_counts( ) The data type of the Series usually determines which of the methods will be the most useful. For instance, one of the most useful methods for the object data type Series is value_counts, which counts all the occurrences of each unique value.

So in this case the value_count is more useful for series with the object data type, but sometimes it can also provide insights into a numeric data type.

> shape, size and len( ) The first two shape and size attributes can return the counting of the elements and we can also use len( ) method for the same. These two returns the value of missing and non-missing values both.

> count( ) This method only returns the value of non-nan values, or non-missing values. so remember this one is different from the above one.

> min( ), max( ), Mean( ), median( ), sum( ) These are some of the basic summary statistics methods provided by pandas. They can help you to check any errors in data distribution and can also help you to find any outlier data in the dataset. They are mostly used with numeric data types but when used with object data type it yields a totally different output.

> describe( ) This dedicated stats method helps to calculate the summary statistics of the dataset with some quantiles at once, it returns a Series.

>quantile( ) This method is used to calculate the quantile of numeric data, we can pass values like ([0.1, 0.2, .03, .04, . 05]) to study the distribution of data. It returns a scalar value when passed a single value.

VI. Method Chaining

Method chaining, also known as named parameter idiom, is a common syntax for invoking multiple method calls in object-oriented programming languages. In Python, every variable is an object, and all objects have attributes and methods that refer to or return more objects.

Each method returns an object, allowing the calls to be chained together using dot notation in a single statement without requiring variables to store the intermediate results. This method helps developer to avoid the cognitive burden of naming the variable and keeping the variable in mind.

One of the most common methods to append to the chain is the head method. This suppresses long output. For shorter chains, there isn’t a great need to place each method on a different line.

Instead of calling notnull() or isnull() method and getting Boolean values in return, we can sum the result which can provide insight into the data, also we can ask for mean(), median(), count() etc… values in return for better interpretation and understanding of data.

VII. Filling missing values

>isnull( ).value_counts() count( ) method returns a value less than the total number of elements in case of missing value in the data. isnull( ) method helps find whether each value is missing or not,

> fillna( ).count() It is possible to replace all missing values within a series with fillna method.

> dropna( ).count()In some cases, the number of missing values in a particular column is more, so we can drop that column.

VIII . Comparing missing values

Here, I want to discuss an unusual object by python i.e, Numpy NaN(np. nan) object. Panadas uses this to represent a missing value.

Even the python’s None object evaluates as True when it is compared to itself

Instrestingly, it is not equal to itself and any other comparison against np.nan also return False.

Above I created a copy of Netflix dataset using the copy() method and then compared it to Netflix original dataset and you can see the result below. This works as expected but becomes problematic whenever you attempt to compare DataFrames with missing values. In the director column, the first instance is false because it is a nan value. This problem occurs because the nan is not equal to nan.

To solve the above problem, The correct way to compare two entire DataFrames with one another is not with the equals operator but with the equals method, as shown on left.

IX. Renaming row and column names

One of the most basic and common operations on a DataFrame is to rename the row or column names. Good column names help us to associate better with the data it is containing.

I am just showing here to change lower case letter to upper case letter, but you can very well change the name of the columns.

There are multiple ways to rename row and column labels. It is possible to store the values of rows and columns in a list and then reassign the index and column attributes directly to a Python list. This assignment works when the list has the same number of elements as the row and column labels. The following code uses the tolist() method on each Index object to create a Python list of labels. It then modifies a couple of values in the list and reassigns the list to the index and columns of the attribute.

X. Creating and Deleting Columns

During data analysis, it is extremely likely that you will need to create new columns to represent new variables. Commonly, these new columns will be created from previous columns already in the dataset. The simplest way to create a new column is to assign it a scalar value. Place the name of the new column as a string into the indexing operator.

Several columns contain numeric data on a song which is some kind of measurement value. We can also create a new column by adding a few existing columns and dividing them with an existing column — this method may help you to feature engineer the dataset where you can eliminate multiple columns and focus on just a few new ones.

In DataFrame, using the insert method, one can insert a new column into a specific place not just at the end. The insert method takes the integer position of the new column as its first argument, the name of the new column as its second, and the values as its third. You will need to use the get_loc index method to find the integer location of the column name. The insert method modifies the calling DataFrame in-place, so there won’t be an assignment statement.

XI. Ordering Columns for better use

We get the first view of the data, with the head method that shows only 3 instances of data.

Here we create a new dataframe by selecting multiples columns in a list from the dataset and the indexing operator. There are instances when one column of a DataFrame needs to be selected. This is done by passing a single element list to the indexing operator.

Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names into a list first.

Here, on the left, we are grouping the columns not the basis of data type using dtype.value_count() method chaining technique.

Above we have seen that columns can be selected by passing a string to the dataset, but you can also select and create a new dataframe using data type. An alternative method to select columns is with the filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all column names that contain the exact string, song

The filter method allows columns to be searched through regular expressions with the regex parameter. Here, we search for all columns that have a “_” somewhere in their name.

The columns don’t appear to have any logical ordering to them. By organizing the names sensibly we can get more understanding into the data. For this, I created new variables with column names as string stored in them and divided the columns into 4 parts and created new columns accordingly.

Concatenating all the lists together to get the final column order.

Also, ensuring that this list contains all the columns from the original and then passing the list with the new column order to the indexing operator of the DataFrame to reorder the column.

XII. Transpoing the direction of DataFrame

Many DataFrame methods have an axis parameter. This important parameter controls the direction in which the operation takes place. Axis parameters can only be one of two values, either 0 or 1, and are aliased respectively as the index and columns of the string.

Changing the axis parameter to 1/columns transposes the operation so that each row of data has a count of its non-missing values, keeping this trick in mind helps more information output about data.

Hey there, I am looking forward to your comments and suggestions.
Kindly clap, subscribe and share.

This is the first post of my upcoming DataWrangling Series, please look forward to it.

Geek Culture

Proud to geek out. Follow to join our +1.5M monthly readers.