Python Pandas Guide - Learn Pandas For Data Analysis

Published in

Edureka

9 min readApr 5, 2018

In this blog, we will be discussing data analysis using Pandas in Python. Before talking about Pandas, one must understand the concept of Numpy arrays. Why? Because Pandas is an open source software library which is built on top of NumPy. In this Python Pandas Tutorial, I will take you through the following topics, which will serve as fundamentals for the upcoming blogs:

What is Pandas?
Pandas Operation

2.1 Slicing the data frame

2.2 Merging & Joining

2.3 Concatenation

2.4 Changing the index

2.5 Change Column Headers

2.6 Data Munging

3. Use-Case: Analyze youth unemployment data

Let’s get started. :-)

What is Python Pandas?

Pandas is used for data manipulation, analysis, and cleaning. Python pandas is well suited for different kinds of data, such as:

Tabular data with heterogeneously-typed columns
Ordered and unordered time series data
Arbitrary matrix data with row & column labels
Unlabelled data
Any other form of observational or statistical data sets

How to install Pandas?

To install Python Pandas, go to your command line/ terminal and type “pip install pandas” or else, if you have anaconda installed in your system, just type in “conda install pandas”. Once the installation is completed, go to your IDE (Jupyter, PyCharm etc.) and simply import it by typing: “import pandas as pd”

Moving ahead in Python pandas tutorial, let’s take a look at some of its operations:

Python Pandas Operations

Using Python pandas, you can perform a lot of operations with series, data frames, missing data, group by etc. Some of the common operations for data manipulation are listed below:

Now, let us understand all these operations one by one.

Slicing the Data Frame

In order to perform slicing on data, you need a data frame. Don’t worry, a data frame is a 2-dimensional data structure and a most common pandas object. So first, let’s create a data frame.

Refer the below code for its implementation in PyCharm:

import pandas as pdXYZ_web= {'Day':[1,2,3,4,5,6], "Visitors":[1000, 700,6000,1000,400,350], "Bounce_Rate":[20,20, 23,15,10,34]}df= pd.DataFrame(XYZ_web)print(df)

Output:

     Bounce_Rate Day Visitors0     20          1   10001     20          2   7002     23          3   60003     15          4   10004     10          5   4005     34          6   350

The code above will convert a dictionary into a pandas Data Frame along with the index to the left. Now, let us slice a particular column from this data frame. Refer to the image below:

Slicing in Pandas - Python Pandas Tutorial

print(df.head(2))

Output:

   Bounce_Rate Day Visitors0      20         1   10001      20         2    700

Similarly, if you want the last two rows of the data, type in the below command:

print(df.tail(2))

Output:

  Bounce_Rate Day Visitors4      10      5    4005      34      6    350

Next, in this Python Pandas tutorial, let us perform merging and joining.

Merging & Joining

In merging, you can merge two data frames to form a single data frame. You can also decide which columns you want to make common. Let me implement that practically, first I will create three data frames, which has some key-value pairs and then merge the data frames together. Refer the code below:

  HPI   IND_GDP Int_Rate0  80      50      21  90      45      12  70      45      23  60      67      3

Code:

import pandas as pddf1= pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])df2=pd.DataFrame({ "HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])merged= pd.merge(df1,df2)print(merged)

As you can see above, the two data frames have merged into a single data frame. Now, you can also specify the column which you want to make common. For example, I want the “HPI” column to be common and for everything else, I want separate columns. So, let me implement that practically:

df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])merged= pd.merge(df1,df2,on ="HPI")print(merged)

Output:

      IND_GDP  Int_Rate  Low_Tier_HPI  Unemployment2001     50      2         50.0            1.02002     45      1         NaN             NaN2003     45      2         45.0            3.02004     67      3         67.0            5.02004     67      3         34.0            6.0

Next, let us understand joining in python pandas tutorial. It is yet another convenient method to combine two differently indexed data frames into a single result data frame. This is quite similar to the “merge” operation, except the joining operation will be on the “index” instead of the “columns”. Let us implement it practically.

df1 = pd.DataFrame({"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])df2 = pd.DataFrame({"Low_Tier_HPI":[50,45,67,34],"Unemployment":[1,3,5,6]}, index=[2001, 2003,2004,2004])joined= df1.join(df2)print(joined)

Output:

      IND_GDP  Int_Rate Low_Tier_HPI  Unemployment2001     50       2         50.0           1.02002     45       1         NaN            NaN2003     45       2         45.0           3.02004     67       3         67.0           5.02004     67       3         34.0           6.0

As you can notice in the above output, in year 2002(index), there is no value attached to columns “low_tier_HPI” and “unemployment”, therefore it has printed NaN (Not a Number). Later in 2004, both the values are available, therefore it has printed the respective values.

Moving ahead in Python pandas tutorial, let us understand how to concatenate two data frames.

Concatenation

Concatenation basically glues the data frames together. You can select the dimension on which you want to concatenate. For that, just use “pd.concat” and pass in the list of data frames to concatenate together. Consider the below example.

df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])concat= pd.concat([df1,df2])print(concat)

Output:

        HPI  IND_GDP Int_Rate2001    80    50       22002    90    45       12003    70    45       22004    60    67       32005    80    50       22006    90    45       12007    70    45       22008    60    67       3

As you can see above, the two data frames are glued together in a single data frame, where the index starts from 2001 all the way up to 2008. Next, you can also specify axis=1 in order to join, merge or concatenate along the columns. Refer the code below:

df1 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3], "IND_GDP":[50,45,45,67]}, index=[2001, 2002,2003,2004])df2 = pd.DataFrame({"HPI":[80,90,70,60],"Int_Rate":[2,1,2,3],"IND_GDP":[50,45,45,67]}, index=[2005, 2006,2007,2008])concat= pd.concat([df1,df2],axis=1)print(concat)

Output:

       HPI  IND_GDP  Int_Rate HPI  IND_GDP Int_Rate2001   80.0  50.0       2.0   NaN    NaN     NaN2002   90.0  45.0       1.0   NaN    NaN     NaN2003   70.0  45.0       2.0   NaN    NaN     NaN2004   60.0  67.0       3.0   NaN    NaN     NaN2005   NaN   NaN        NaN   80.0   50.0    2.02006   NaN   NaN        NaN   90.0   45.0    1.02007   NaN   NaN        NaN   70.0   45.0    2.02008   NaN   NaN        NaN   60.0   67.0    3.0

As you can above, there are a bunch of missing values. This happens because the data frames didn’t have values for all the indexes you want to concatenate on. Therefore, you should make sure that you have all the information lining up correctly when you join or concatenate on the axis.

Change the index

Next, in the python pandas tutorial, we’ll understand how to change the index values in a data frame. For example, let us create a data frame with some key-value pairs in a dictionary and change the index values. Consider the example below:

Let us see how it actually happens:

import pandas as pddf= pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]})df.set_index("Day", inplace= True)print(df)

Output:

Day   Bounce_Rate  Visitors1      20           2002      45           1003      60           2304      10           300

As you can notice in the output above, the index value has been changed with respect to the “Day” column.

Change the Column Headers

Let us now change the headers of the column in this python pandas tutorial. Let us take the same example, where I will change the column header from “Visitors” to “Users”. So, let me implement it practically.

import pandas as pddf = pd.DataFrame({"Day":[1,2,3,4], "Visitors":[200, 100,230,300], "Bounce_Rate":[20,45,60,10]})df = df.rename(columns={"Visitors":"Users"})print(df)

Output:

   Bounce_Rate  Day  Users0    20         1    2001    45         2    1002    60         3    2303    10         4    300

As you see above, column header “Visitors” has been changed to “Users”. Next, in python pandas tutorial, let us perform data munging.

Data Munging

In Data munging, you can convert a particular data into a different format. For example, if you have a .csv file, you can convert it into .html or any other data format as well. So, let me implement this practically.

import pandas as pdcountry= pd.read_csv("D:\\Users\\Aayushi\\Downloads\\world-bank-youth-unemployment\\API_ILO_country_YU.csv",index_col=0)country.to_html('edu.html')

Once you run this code, a HTML file will be created named “edu.html”. You can directly copy the path of the file and paste it in your browser which displays the data in a HTML format. Refer the below screenshot:

Output in HTML Format - Python Pandas Tutorial

Next, in python pandas tutorial, let’s have a look at a use-case which talks about the global youth unemployment.

Use Case to Analyze Youth Unemployment Data

Problem Statement:

You are given a dataset which comprises of the percentage of unemployed youth globally from 2010 to 2014. You have to use this dataset and find the change in the percentage of youth for every country from 2010–2011.

First, let us understand the dataset which contains the columns as Country Name, Country Code and the year from 2010 to 2014. Now using pandas, we will use “pd.read_csv” to read the .csv file format file.
Refer the screenshot below:

Snapshot of csv file - Python Pandas Tutorial

Let us move ahead and perform data analysis in which we are going to find out the percentage change in the unemployed youth between 2010 to 2011. Then we will visualize the same using Matplotlib library, which is a powerful library for visualization in Python. It can be used in Python scripts, shell, web application servers, and other GUI toolkits.

Now, let us implement the code in PyCharm:

import pandas as pdimport matplotlib.pyplot as pltfrom matplotlib import stylestyle.use('fivethirtyeight')country= pd.read_csv("D:\\Users\\Aayushi\\Downloads\\world-bank-youth-unemployment\\API_ILO_country_YU.csv",index_col=0)df= country.head(5)df= df.set_index(["Country Code"])sd = sd.reindex(columns=['2010','2011'])db= sd.diff(axis=1)db.plot(kind="bar")plt.show()

As you can see above, I have performed the analysis on the top 5 rows of the country data frame. Next, I have defined an index value to be “Country Code” and then re-index the column to 2010 and 2011. Then, we have one more data frame db, which prints the difference between the two columns or the percentage change of unemployed youth from 2010 to 2011. Finally, I have plotted a barplot using Matplotlib library in Python.

Now if you noticed in the above plot, in Afghanistan(AFG) between 2010 to 2011, there has been a rise in unemployed youth of approx. 0.25%. Then in Angola(AGO), there is a negative trend which means that the percentage of unemployed youth has been reduced. Similarly, you can perform analysis on different sets of data.

So, folks, with this we come to an end to this article. If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Python and Data Science.

1. Python Tutorial
2. Python Programming Language
3. Python Functions
4. File Handling in Python
5. Python Numpy Tutorial
6. Scikit Learn Machine Learning
7. Matplotlib Tutorial
8. Tkinter Tutorial
9. Requests Tutorial
10. PyGame Tutorial
11. OpenCV Tutorial
12. Web Scraping With Python
13. PyCharm Tutorial
14. Machine Learning Tutorial
15. Linear Regression Algorithm from scratch in Python
16. Python for Data Science
17. Python Regex
18. Loops in Python
19. Python Projects
20. Machine Learning Projects
21. Arrays in Python
22. Sets in Python
23. Multithreading in Python
24. Python Interview Questions
25. Java vs Python
26. How To Become A Python Developer?
27. Python Lambda Functions
28. How Netflix uses Python?
29. What is Socket Programming in Python
30. Python Database Connection
31. Golang vs Python
32. Python Seaborn Tutorial
33. Python Career Opportunities

Originally published at www.edureka.co on April 5, 2018.

Python Pandas Guide - Learn Pandas For Data Analysis

What is Python Pandas?

How to install Pandas?

Python Pandas Operations

Slicing the Data Frame

Merging & Joining

Concatenation

Change the index

Change the Column Headers

Data Munging

Use Case to Analyze Youth Unemployment Data

Problem Statement:

Written by Aayushi Johari