Python Pandas for Beginners — A Complete Guide (Part 2)

Python Geeks
Oct 25 · 9 min read

In the previous post of the series, we understand the basic concepts in Pandas such as “what is Pandas?”, Series and DataFrame. If you don't remember, click here to get back to Part 1.

In this article, we continue learning Python Pandas. We will know how to read DataFrame from file and the most important Pandas operator for beginners.

How to read and write DataFrame

It is quite simple to read data with the support of the Pandas library. We can load data from various data sources such as CSV, JSON or Excel file. Because of the learning purpose, we will try to load data with all kinds of data sources.

JSON file

Assuming that we have a JSON file that contains data related to product inventory as bellow.

{
"CAN": {
"Headphone": 1,
"Laptop": 5
},
"SGN": {
"Headphone": 3,
"Laptop": 8
},
"SIN": {
"Headphone": 0,
"Laptop": 5
}
}

Then we can read JSON file easily with read_json function of pandas.

import pandas as pddata = pd.read_json("data.json")
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5 #

Writing data to JSON file

data.to_json("exported_data.json")

CSV file

Using the same demo data in the previous example. However, the file format is CSV. The data in the file as below:

,CAN,SGN,SIN
headphone,1,3,0 laptop,5,8,5

Loading data to DataFrame using read_csv

import pandas as pddata = pd.read_csv("data.csv")
print(data)
# Result:
# Unnamed: 0 CAN SGN SIN
# 0 headphone 1 3 0
# 1 laptop 5 8 5

In the above code, read_csv generates the index column as default. However, we want the first column (headphone, laptop) is the index column. So passing index_col parameter to read_csv to let it know which column will be indexed.

import pandas as pd
data = pd.read_csv("data.csv", index_col=0)
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5

Writing DataFrame to CSV file

data.to_csv("exported_data.csv")

Excel file

Reading data from excel file using read_excel function.

import pandas as pddata = pd.read_excel("data.xlsx", index_col=0)
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5

Writing DataFrame to excel file

data.to_excel('exported_data.xlsx', sheet_name='Sheet1')

Important Pandas Operators

Before learning pandas operator, we will create the DataFrame and use it in the example of pandas operators section.

import pandas as pd
import numpy as np
dates = pd.date_range('20190101', periods=50)
data = pd.DataFrame(np.random.randn(50, 4), index=dates, columns=list('ABCD'))

Viewing your data

Now we have a DataFrame with 50 rows. It’s too big to print. In a real-life situation, the number of rows will be larger many times. Therefore, the first thing we should know is how to print out a few rows to keep as a visual reference.

We will use .head() to display the first 5 rows of your data

data.head()# Result: 
# A B C D
# 2019-01-01 -1.005372 0.142613 -0.181516 1.036709
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-03 -1.703856 0.126258 1.080593 -0.421066
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644
# 2019-01-05 -1.408334 -0.860980 0.052589 1.104063

What do we do, if we would like to print the first 10 rows instead of 5 rows? In this case, we can pass a number to .head() to define the number of rows you want to print. For example:

data.head(10) # Result: 
# A B C D
# 2019-01-01 -1.005372 0.142613 -0.181516 1.036709
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-03 -1.703856 0.126258 1.080593 -0.421066
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644
# 2019-01-05 -1.408334 -0.860980 0.052589 1.104063
# 2019-01-06 0.418509 0.607834 0.017931 0.748909
# 2019-01-07 0.336740 -0.406930 1.420234 -1.702862
# 2019-01-08 0.739592 1.051292 -0.757623 -1.156324
# 2019-01-09 -0.225760 0.416810 0.128996 -1.450296
# 2019-01-10 0.527811 1.771893 -0.551995 1.101953

We can use .tail() to get the last 5 rows of the data. And the same with .head(), we can pass a number to determine the number of rows will be printed.

data.tail(7) # Result: 
# A B C D
# 2019-02-13 0.834528 2.912336 -0.957908 0.758701
# 2019-02-14 -0.866577 -0.886605 -0.339376 1.296223
# 2019-02-15 0.324452 -1.030220 0.854473 1.471936
# 2019-02-16 2.657040 -1.169546 -1.746896 -0.745877
# 2019-02-17 1.494073 -0.709933 -0.086347 -0.512125
# 2019-02-18 -1.455421 -0.370378 1.475331 -0.867604
# 2019-02-19 -1.542814 0.355690 -0.705522 0.069457

Showing data info

.info() will show you the summary info of your data

data.info() # Result 
# <class 'pandas.core.frame.DataFrame'>
# DatetimeIndex: 50 entries, 2019-01-01 to 2019-02-19
# Freq: D # Data columns (total 4 columns):
# A 50 non-null float64
# B 50 non-null float64
# C 50 non-null float64
# D 50 non-null float64 # dtypes: float64(4)
# memory usage: 2.0 KB

DataFrame selection, addition, and deletion

Until now, we known to understand Pandas data structure, how to create, load and write DataFrame. In this section, we will learn the methods of selecting, adding and deleting that you'll need to use constantly.

By Column

The simple way to select a column of data is by using brackets. For example:

data['A'].head() # Result 
# 2019-01-01 -1.005372
# 2019-01-02 0.790087
# 2019-01-03 -1.703856
# 2019-01-04 -0.558818
# 2019-01-05 -1.408334
# Freq: D, Name: A, dtype: float64

We can select multiple columns inside the brackets.

data[['A', 'D']].head()# Result 
# A D
# 2019-01-01 -1.005372 1.036709
# 2019-01-02 0.790087 1.035578
# 2019-01-03 -1.703856 -0.421066
# 2019-01-04 -0.558818 -0.275644
# 2019-01-05 -1.408334 1.104063

By Row

Another choice for indexing data is using .loc and .iloc. Remember that this syntax is used for indexing by rows only. Below is its syntax:

If we pass only 1 parameter to .loc or .iloc . It understands the parameter is rows as default. Now we try to get the row labeled “2019–02–19”.

data.loc['2019-02-19'] # Result 
# A -1.542814
# B 0.355690
# C -0.705522
# D 0.069457
# Name: 2019-02-19 00:00:00, dtype: float64

Now we will select the number the specific column in the result

data.loc['2019-02-19', ['A', 'B']] # Result 
# A -1.542814
# B 0.355690
# Name: 2019-02-19 00:00:00, dtype: float64

Because the row label is time so we also can get rows by time range.

data.loc['2019-02-02':'2019-02-5'] # Result 
# A B C D
# 2019-02-02 1.343458 -0.940805 -0.671073 -0.102026
# 2019-02-03 0.815890 1.190464 -0.479341 -0.105754
# 2019-02-04 -0.256803 0.595940 0.279049 2.010431
# 2019-02-05 0.311328 -0.111790 -0.515169 -0.014552

Similar to .loc, we can use .iloc to select one or more specific index numbers. Below example for selecting multiple index numbers:

data.iloc[[1,3],:] # Result 
# A B C D
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644

Or selecting a slice of the indexes:

data.iloc[10:13] # Result 
# A B C D
# 2019-01-11 0.530191 -1.301325 -0.050186 0.267728
# 2019-01-12 -0.652778 -0.611594 0.291840 -0.935883
# 2019-01-13 1.602400 -0.137409 1.002766 0.820419

Be aware that we’re getting rows [10:13] but what we get is data from 10th row to 12th row. The 13th row is not in the result.

Slicing .iloc follows the same rule as slicing the lists. The object at the last index is not included in the result.

Selection by callable

.loc, .iloc and [] indexing as well can accept a callable. The function must be a function with one argument. For best practice, the Python developer usually uses lambda. You can refer our article for more detail about lambda function ( Python Lambda Function )

data.loc[lambda row: row.D > 0] # Result: 
# A B C D
# 2019-01-01 -0.862970 1.312603 -0.427154 1.222134
# 2019-01-03 0.173703 -0.258924 -0.014464 0.320602
# 2019-01-04 0.252102 -1.914325 0.649628 0.885115
# 2019-01-06 -0.461437 0.063366 0.569461 0.427437
# 2019-01-08 -0.956210 0.067605 -2.273661 0.833149
# 2019-01-09 0.340620 -1.976085 -1.144538 0.425282
# 2019-01-11 0.530191 -1.301325 -0.050186 0.267728
# 2019-01-13 1.602400 -0.137409 1.002766 0.820419
# 2019-01-15 -1.083804 1.790858 -0.301093 0.674832
# 2019-01-16 -0.143779 2.104730 -1.228123 1.568482
# 2019-01-22 -0.821942 1.381137 -1.812166 0.785756
# 2019-01-23 -1.404428 0.563167 2.138703 2.387186
# 2019-01-28 0.271000 -0.633658 -0.839952 0.816997
# 2019-01-29 -0.259307 1.638976 -0.648043 0.715521
# 2019-02-01 0.109348 -0.624031 -0.283261 0.138264
# 2019-02-03 -1.319024 -0.131162 2.011317 1.312116
# 2019-02-04 0.910435 -0.636221 -0.857388 0.456990

Handling duplicated data

Handling duplicated data is a common case while working with a large data set. We can iterate over each row of data and check if it is duplicated to others, but it will be very slow and complicated.

Fortunately, the DataFrame class provides a function to help the developer to handle duplicated data easily. We can find duplicated data either by row or by column. Now we will figure out how it works.

  • duplicated It will return a Boolean series with True at the place of each duplicated.
  • drop_duplicates removes duplicate rows.

By default, the first observed row is considered as unique. However, each method has a keep parameter to specify which row is kept.

  • keep='first': The default value, mark duplicated except for the first observed data.
  • keep='last': mark duplicated except for the last observed data.
  • keep=False: mark all are duplicated.

Before doing the examples, we’re going to create data as below

import pandas as pd employees = [
('Mark', 34, 'Toronto'),
('Kana', 30, 'Delhi'),
('Tam', 26, 'Ha Noi'),
('Kana', 30, 'Delhi'),
('Kana', 30, 'Delhi'),
('Kana', 30, 'Delhi'),
('Hashima', 40, 'London'),
('Rook', 30, 'Delhi')
]
data = pd.DataFrame(employees, columns=['Name', 'Age', 'City']) print(data)
# Name Age City
# 0 Mark 34 Toronto
# 1 Kana 30 Delhi
# 2 Tam 26 Ha Noi
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi
# 6 Hashima 40 London
# 7 Rook 30 Delhi

By Row

Finding the duplicated rows:

# The same with data.duplicated(keep='first')
data.duplicated()
# Result
# 0 False
# 1 False
# 2 False
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# dtype: bool

The result is a Series data that each row marks that if it is duplicated or not. If we use the result for selecting the data, we will get the value of duplicated results.

data[data.duplicated()] # Result 
# Name Age City
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi

Dropping the duplicated row, keep the first observed data as default

data.drop_duplicates() # Result 
# Name Age City
# 0 Mark 34 Toronto
# 1 Kana 30 Delhi
# 2 Tam 26 Ha Noi
# 6 Hashima 40 London
# 7 Rook 30 Delhi

Now let see what is the difference if keep='last'

data.drop_duplicates(keep='last')# Result
# Name Age City
# 0 Mark 34 Toronto
# 2 Tam 26 Ha Noi
# 5 Kana 30 Delhi
# 6 Hashima 40 London
# 7 Rook 30 Delhi

And now keep=False

data.drop_duplicates(keep=False) # Result 
# Name Age City
# 0 Mark 34 Toronto
# 2 Tam 26 Ha Noi
# 6 Hashima 40 London
# 7 Rook 30 Delhi

By Column

We are able to select specific columns to determine if data is duplicated

data[data.duplicated(['Age', 'City'])] # Result 
# Name Age City
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi
# 7 Rook 30 Delhi

Other examples for drop_duplicates by the selected column will be available in our GitHub. Check it for more detail if you need it.

Applying Function

Now, for example, we would like to get the employees who are over 30 years old. Don’t iterate over each employee and check if the employee is over 30 years old or not. Instead, take advantage of the Pandas library. It is apply() function.

data[data['Age'].apply(lambda x: x > 30)]# Result 
# Name Age City
# 0 Mark 34 Toronto
# 6 Hashima 40 London

Summary

Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. This series just described very basic knowledge about Python Pandas for Beginner. Find full examples in this series on our GitHub project.

If you found your own useful information through the series, please share it for others, leave a comment and subscribe to us for getting notification of future posts. We are going to really really appreciate your time to read the post and share it with other Python Geeks.

Reference

Like every article from us, we always provide you the references that we used to make the post. In case you want to know more about the Pandas library, you can refer below links:


Originally published at https://pythongeeks.net on October 25, 2019.

Python Geeks

Written by

Learning, sharing and shaping your Python skill. Contact us if you want to be a writer for this community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade