Python Pandas — Tutorial

Published in

IntelliPaat

11 min readJan 2, 2019

Pandas is a Python library which is simple yet powerful tool for Data Analysis. Pandas is one of the most widely used Python packages. This package is comprising of many data structures and tools for effective Data Manipulation and analysis. Python with Pandas is used everywhere including Commercial, Academic, Economics, Finance, Analytics, Statistics, etc. And if you are going to work with Data using Python you need to learn Pandas as well.

Some of the key features of Pandas are:

· It provides DataFrame object with default and customized indexing which is very fast and efficient.

· There are tools available for loading different file formats data into in-memory data objects.

· It is easy to perform data alignment and integrated handling of missing data in Pandas.

· It is very simple to perform pivoting and reshaping of data sets in Pandas.

· It also provides indexing, label-based slicing, and sub-setting of large data sets.

· We can easily insert and delete the columns from a data structure

· Data for aggregation and transformations can be done using group by

· High-performance merge and join of data can be done using Pandas.

· It also provides Time Series functionality.

In this tutorial, we will use Pandas to analyze data on product reviews from Amazon, a popular ecommerce website. This dataset consist of information related to Various product reviews information of Amazon Website which includes

· Product_Review_Phrase — Description of a product according to their Review

· Product_Title — Product Name

· Website_URL — Link of the product in website

· Platform — Whether product is available in Website or Mobile App or Both

· Product_Rating — Average product rating as per customers

· Product_Category — Category of product

· Is_Amazon_Advantage_Product — Whether a product is a premium product or not

· Product_Launch_Year — Year of launch of the product in website

· Product_Launch_Month — Month of launch of the product in website

· Product_Launch_Day — Date of launch of the product in website

While analyzing the product reviews, we will learn how to implement key Pandas concepts like indexing, plotting etc.

The data is in the CSV (Comma Separated Values) format — each record is separated by a comma “,” and rows are separated by a new line. There are approx. 1841 rows, including a header row, and 10 columns in the file.

Before we get started, a quick note on the software environments– here in this tutorial, we will be using Python 3.5. Our examples will be done using Jupyter notebook.

Content:

· Importing Convention for Pandas

· Pandas Series Objects.

· DataFrame in Pandas.

· Importing Data with Pandas.

· Indexing the DataFrames with Pandas

· Sorting DataFrames with Pandas:

· Pandas DataFrame Methods

· DataFrame Math Operations with Pandas

· Filtering in Pandas

· Pandas Data Visualization with Data Plotting

So, let’s begin with

Importing Conventions for Pandas:

The very first and the most important operation one needs keep in mind is to import the Pandas library properly.

import pandas as pd

Also, while importing we abbreviate it as pd.

Series Objects and creating Series Objects:

A Series can contain any type of data, including mixed types. Now let’s have a look at how we can create series objects in Pandas with some examples.

Example:

series1 = pd.Series([1, 2, 3, 4]

series1

Just to make sure this object that we have created just now is indeed a series object we can use type() on the object above.

type(series1)

Example 2:

Further we can specify the index the of the series object as shown below:

series2 = pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])

series2

DataFrames and creating DataFrames:

DataFrames in pandas are defined as two-dimensional labeled data structures with columns of potentially different types.

Create a DataFrame by from a List: Let’s take a List of integers and then create an DataFrame by using that List.

list1 = [1,2,3,4,5]

list1

Create a DataFrame by from a List of Lists: Now we will create a DataFrame by using a List of Lists.

list_of_lists = [[‘apple’,10],[‘mango’,12],[‘banana’,13]]

df = pd.DataFrame(list_of_lists,columns=[‘fruit’,’count’],dtype=int)

Create a DataFrame by from a Dict: We can also create DataFrames with the help of Dictionary.

dict1 = {‘fruit’:[‘apple’, ‘mango’, ‘banana’],’count’:[10,12,13]}

df = pd.DataFrame(dict1)

Note: Since we are familiar with DataFrame and Series Objects keep in mind that each column in a DataFrame is a Series Object.

Importing Data with Pandas

Here, we will first read the data. The data is stored as a csv that is comma-separated values, where each column is separated by a comma “,” and each row by a new line. Here are the first few rows of the Amazon_Products_Review.csv file:

As you can see, each row in the data represents a single product that was reviewed by Amazon. Here, we also have a leading column that contains row index values. Currently, we will not discuss about this column, but later we’ll dive into what are index values. To work with the data in Python, first step is to import the file into a Pandas DataFrame. A DataFrame is nothing but a way to represent and work with tabular data. and tabular data has rows and columns.

Our file is of .csv format. So, pd.read_csv() function is going to help us read the data stored in that file. This function will take input as a csv file and return the output as a DataFrame.

import pandas as pd

Product_Review=pd.read_csv(“Amazon_Products_Review.csv”)

Let’s inspect the type of Product_Review by using type() function.

type(Product_Review)

For file types other than .csv, the importing conventions are mentioned below:

· pd.read_table(“filename”)

· pd.read_excel(“filename”)

· pd.read_sql(query, connection_object)

· pd.read_json(json_string)

Working with DataFrame:

Now that the DataFrame is ready, let’s have a look at some of the operations in pandas.

· Product_Review.head() — This will print the first 5 rows of the DataFrame.

· Product_Review.tail() — This will print the last 5 rows of the DataFrame.

· Product_Review.shape- Gives the number of rows and columns. In our DataFrame we have 1840 rows and 11columns.

· Product_Review.info — This will give us the information of Index, Datatype and Memory in the DataFrame.

· Product_Review.describe -Summary statistics for numerical columns.

Did you noticed, here it has been read in everything properly — we have 1840 rows and 11columns.

One of the big advantages of Python Pandas over Python NumPy is that Pandas allows us to have columns with different data types. Here, in Product_Reviews has columns that store float values, like Product_Rating, String values, Product_Review_Phrase, and integers, like Product_Launch_year.

Now as the data is read properly, next we will work on indexing the Product_Reviews, so that we can get the rows and columns as per our requirement.

Indexing the DataFrames with Pandas:

Now, let’s say we want to select and have a look at a chunk of data from our DataFrame. There are two ways of achieving the same.

First, selecting by position and second, selecting by label.

Selecting by Position: Using iloc we can retrieve rows and columns by position. Here we need to specify the positions of the rows and columns.

Example1:

Suppose we want only the first column out of the DataFrame. Then we would use iloc on the DataFrame as shown below:

Product_Review.iloc[:,0]

This snippet of code shows that we want to have a look at all the rows of first column. Keep in mind that position of first column(or first row)always starts with 0. As we wanted all the rows, we specified just a colon “:” without mentioning any position.

Example 2:

Again, say we want to have a look at the first 5 rows of 4th column. We need to specify the position of the rows as 0:5. Which means that we want to view the first 5 rows from position 0 to 4(Note that, position 5 is excluded). Also, instead of writing 0:5 we can leave off the first position value, like :5 (But if we write 0: this mean 0th position to last position).

Product_Review.iloc[0:5,4]

Also, in the example show above, instead of writing 0:5 we can leave off the first position value, like :5. This has the same meaning. But if we write 0: this means indexing of 0th position to last position.

Similar examples:

· Product_Review.iloc[:,:] ->View entire DataFrame

· Product_Review.iloc[6:,4:] -> View from Rows 6 and column 4 onwards

Now let’s update our DataFrame by removing the first column, which contains no useful information.

Now if you are aware of the numpy indexing methodologies you might have noticed that it is quite similar to pandas indexing by position. But unlike numpy, each of the columns and rows in Pandas has a label. Yes, selecting by position is easy. But for large DataFrames, keeping track of columns and their corresponding positions becomes complicated. That’s when our next method of indexing comes in handy.

Selecting by Label:

The second method is selecting by Label of the columns. .loc allows us to index using labels instead of positions. Let’s illustrate this method with some examples.

Example1:

(Selecting some rows of one column)

Display the first five rows of the Product_Title using .loc method like this:

Prodcut_Reviews.loc[:5,”Product_Title”]

Example2:

(selecting some rows of more than one column)

Display the first five rows of the column Product_Title and Product_Rating.

Product_Reviews.loc[:5,”Product_Title”,”Product_Rating”]

Sorting DataFrames with Pandas:

Apart from indexing another very simple yet useful feature offered by Pandas is the sorting of DataFrame. To get a clear idea of sorting feature let’s look at the following examples.

To Sort the DataFrame based on the values of a column. Say we want to sort the Product_Rating column by its values.

Product_Review.sort_values(by=‘Product_Rating’)

Now if you want to sort the Product_Rating column by its values(Descending Order).

Product_Review.sort_values(‘Product_Rating’, ascending=False)

Pandas DataFrame methods:

There are some special methods available in Pandas which makes our calculation easier. Let’s apply those methods in our Product_Review DataFrame.

Mean of all the columns in our DataFrame

Product_Review.mean()

2. Median of each column in our DataFrame

Product_Review.median()

3. Standard deviation of each column in our DataFrame.

Product_Review.std()

4. Maximun Value of each column in our DataFrame.

5. Minimum of each column in our DataFrame.

Product_Review.min()

6. Number of non-null values in each DataFrame column

Product_Review.count()

7. Summary statistics for numerical columns

Product_Review.describe()

Mathematical Operations in Pandas:

We can also perform mathematical operations on Series Objects or DataFrame objects.

For example, we can divide every value in the Product Rating column by 2.

Product_Review[“Product_Rating”] /2

Note: All the common mathematical operators that work in Python, like +, -, *, /, and ^ will also work in a DataFrame or a Series Object.

Filtering DataFrames:

Now that we have learnt about mathematical operations in Pandas, let’s have a look at the filtering methods available in Pandas and use them in our DataFrames.

Say, I want to find all the Footwear that has a Product Rating greater than 3.

First, let us generate a Boolean series with our filtering condition and see first 5 results.

filter1 = Product_Review[“Product_Rating”] > 3

filter1.head()

Now that we have got the Boolean Series, we use it to select only rows in a DataFrame where the Series contains the value True. So that we get the rows in Product_Review where Product_Rating is greater than 3:

filtered_new = Product_Review[filter1]

filtered_new.head()

Let’s make it a bit complicated by adding more than one condition. Since we wanted to have a look at the Footwear that has a Product Rating greater than 3. We will add our second condition in Product_Category column of the DataFrame.

filter2 = (Product_Review[“Product_Rating”] > 3) & (Product_Review[“Product_Category”] == “ Footwear”)

filtered_review = Product_Review[filter2]

In the example shown above we have seen filtering conditions with AND Boolean Operator (&). Similarly, OR operator(|) can also be applied when necessary.

Till now we have learnt how to do Data Manipulation using Pandas library. Pandas library also offer Data Visualization feature for better understanding of the data. Let us see how Data Visualization with Pandas work.

Data Visualization using Pandas:

Data Visualization with Pandas is carried out with following ways.

1. Histogram

2. Scatter Plot

Note: Call %matplotlib inline to set up plotting inside the Jupyter notebook.

Histogram:

%matplotlib inline

Product_Review[Product_Review[“Product_Category”] == “Footwear”][“Product_Rating”].plot(kind=”hist”)

In the histogram shown above we have seen the frequency of the Footwear based on the Product Rating.

Analysis: So, let us analyze from the histogram. It appears that Footwear with a Product Rating of 5 is higher. Or we can also say that Footwear with low Product Rating is very less in number.

2. Scatter Plot:

Now we will have a look at the scatter plot of the Product Ratings based on the Product Launch Year.

Note: Here both x and y columns need to numeric

Product_Review.plot.scatter(x=”Product_Launch_Year”,y=”Product_Rating”)

Analysis: From the scatter plot shown above we can analyze how product rating of the products launched in the year 2016 changes. It appears that low Product Ratings are les in number as the density near the low Product Rating is less.

This brings us to the end of the Pythons Pandas Library Tutorial. In this tutorial, we have learnt how different features in Pandas library work and how to use for the better understanding of our Data.

Refer to our Cheat Sheet in Pandas.

Yes, Pandas library plays a very important role in Data Science and Data Analysis. To have a deep understanding of Python Libraries like Pandas, numpy etc. our Data Science in Python Course is a must complete, which not only covers the various technique of how Python is deployed for Data Science also work with various libraries for Data Cleaning, Data Manipulating, Data Analysis and much more Depth.

Originally published at www.intellipaat.com on December 19, 2018.