Pandas Library

Published in

Analytics Vidhya

10 min readJun 23, 2021

Pandas is one of the most popular python library used for working with data of all types. It was first released in January of 2008 by Wes McKinney. Pandas library is open source and has a huge community backing it. It is written in C and Cython which makes it very fast and optimized for performance with the features of high-level languages like simpler syntax.

Why Pandas?

There is mainly two feature offered by pandas that makes it very compelling to learn it for any developer. The features are as follows: It allows reading and working on data from a wide range of file formats like CSV, JSON, HTML, XML, MS Excel, SQL, HDF5, etc. Because it being an open-source library with a huge community, it provides many methods for data filtering, manipulating, merging, grouping and visualizing. Pandas also have many methods for doing statistical analysis.

Installing Pandas

We can install pandas library just like any other python library. Use any of the following commands to install pandas:

Anaconda users can use the following command:

Importing Pandas

After installing pandas, the first thing we need to do is to import it. The following command can be used:

import pandas as pd

Pandas can be imported as anything but it is most common to import it as ‘pd’.

Data Structures offered by Pandas

To master the basics of pandas library first you need to get familiar with two of the data structures it provides.

Series
DataFrame

Series

A series is basically a one-dimensional array that stores a sequence of data. Data stored in the panda series is associated with data labels called index.

Let’s start by defining our own series in pandas

my_series = pd.Series([-2, 0, 2,4,6,8,10])
my_series0    -2
1     0
2     2
3     4
4     6
5     8
6    10
dtype: int64

Let's check the data type

type(my_series)pandas.core.series.Seriesprint("Data values: ", my_series.values, "and data index: ", my_series.index)Data values:  [-2  0  2  4  6  8 10] and data index:  RangeIndex(start=0, stop=7, step=1)

Since we did not provide any index for this series, pandas has by default assigned index from 0 to N-1 where N is the length of data.

We can explicitly define the index as follows:

my_series2 = pd.Series([0,1,2,3,4,5,6], index=['a','b','c', True, True, 2, 3])

The index for a series can be of any datatype.

print("Data values: ", my_series2.values, "and data index: ", my_series2.index)Data values:  [0 1 2 3 4 5 6] and data index:  Index(['a', 'b', 'c', True, True, 2, 3], dtype='object')

Indexes can be used to select one or a set of data from series.

my_series2['a']0my_series2[True]True    3
True    4
dtype: int64

A list of indexes must be provided while selecting a set of values

my_series2[['b', True, 3]]b       1
True    3
True    4
3       6
dtype: int64

Operations on pandas series

numbers = pd.Series(range(10,20))
numbers0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

Data filtering

numbers[numbers % 2 == 0]0    10
2    12
4    14
6    16
8    18
dtype: int64

Scalar manipulation

numbers + 1, numbers *2(0    11
 1    12
 2    13
 3    14
 4    15
 5    16
 6    17
 7    18
 8    19
 9    20
 dtype: int64,
 0    20
 1    22
 2    24
 3    26
 4    28
 5    30
 6    32
 7    34
 8    36
 9    38
 dtype: int64)

Mathematical and custom functions

Pandas support many methods provided by the NumPy library

import numpy as np   
np.sin(numbers)0   -0.544021
1   -0.999990
2   -0.536573
3    0.420167
4    0.990607
5    0.650288
6   -0.287903
7   -0.961397
8   -0.750987
9    0.149877
dtype: float64

We can define our own functions and use the ‘.apply’ method provided by pandas to use our custom function on the pandas series

def power2(ds):
    return ds * dsnumbers.apply(power2)0    100
1    121
2    144
3    169
4    196
5    225
6    256
7    289
8    324
9    361
dtype: int64

Some other useful functions

numbers['null'] = None  # Adding None value and index null to the existing series

Pandas function .isnull() and notnull() can be used to find missing datas in series.

This is another important feature of the pandas library. Pandas make it very easy to work with missing data.

numbers.isnull()0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
null     True
dtype: boolpd.notnull(numbers)0        True
1        True
2        True
3        True
4        True
5        True
6        True
7        True
8        True
9        True
null    False
dtype: bool

Combining 2 Series

alphabets = pd.Series(['a','b','c','d','e'])
alphabets0    a
1    b
2    c
3    d
4    e
dtype: objectjoined = alphabets.append(numbers)
joined0          a
1          b
2          c
3          d
4          e
0         10
1         11
2         12
3         13
4         14
5         15
6         16
7         17
8         18
9         19
null    None
dtype: objectjoined2 = pd.concat([alphabets, numbers], axis = 0)  # axis = 0 is defined by default so not need to write it againjoined20          a
1          b
2          c
3          d
4          e
0         10
1         11
2         12
3         13
4         14
5         15
6         16
7         17
8         18
9         19
null    None
dtype: object

We can create new pandas series by combining 2 existing data series by any of the above methods

joined3 = pd.concat([alphabets, numbers], axis = 1)joined3

Note that combining data series with axis = 1 produces a different result than two of the methods above.

This brings us to another data structure offered by pandas called DataFrame

DataFrame

DataFrame is the rectangular arrangement of data in a table.

Source: https://jovian.ai/aakashns/python-pandas-data-analysis

Another way of thinking about data frame is as a python dictionary having multiple values associated with a single key. The data stored in the data frame can be of any type like integer, string, boolean, etc.

Let’s start by creating a data frame

df = pd.DataFrame({'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
                  'capital': ['Kathmandu', 'Ottawa','Canberra', 'New Delhi', 'Washington'],
                  'population (in million)': [28.09, 37.09, 24.99, 1353, 328.2],
                  'area (in sqkm)': [147516, 9.985*1e6, 7.692*1e6, 3.287*1e6, 9.834*1e6]})   # 1e6 is scientific notation for writing a million

df

.info() method helps us to get some basic information about the dataframe.

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   country                  5 non-null      object 
 1   capital                  5 non-null      object 
 2   population (in million)  5 non-null      float64
 3   area (in sqkm)           5 non-null      float64
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes

.describe() method provides statistical information about the dataframe.

df.describe()

count: sum std: standard deviation

Retrieving data from dataframe

This is a simple example of the datasets that you will be working on using pandas

We can use the methods like .head() and .tail() to select 5 elements of the data frame from top and bottom respectively.

df.head()

We can provide a parameter n, where n is the number of rows to select.

df.tail(3)

df.columns can be used to get the columns of the dataframe.

df.columnsIndex(['country', 'capital', 'population (in million)', 'area (in sqkm)'], dtype='object')

We can retrieve one or more columns from a dataframe using two of the methods below.

df['country']0        Nepal
1       Canada
2    Australia
3        India
4          USA
Name: country, dtype: objectdf.country0        Nepal
1       Canada
2    Australia
3        India
4          USA
Name: country, dtype: object

One column of a data frame is a series.

type(df.country)pandas.core.series.Series

We need to provide column names as a python list to retrieve more than one column.

df[['country', 'capital']]

type(df[['country', 'capital']])pandas.core.frame.DataFrame

.shape methods return the size/shape of the table/dataframe.

df.shape(5, 4)

.sample() method can be used to retrieve a random sample of data from dataframe.

df.sample(3)

.at method can be used to retrieve the element from a particular row and column directly.

df.at[3, 'capital']'New Delhi'df.at[4, 'population (in million)']328.2

.loc method can be used to retrieve a particular row from the dataframe.

df.loc[3]country                        India
capital                    New Delhi
population (in million)         1353
area (in sqkm)             3.287e+06
Name: 3, dtype: objectdf.loc[0]country                        Nepal
capital                    Kathmandu
population (in million)        28.09
area (in sqkm)                147516
Name: 0, dtype: object

To access multiple rows;

df.loc[1:3]

df.at[0, 'capital']'Kathmandu'new_df = dfnew_df

new_df.at[0, 'capital']'Kathmandu'new_df.at[0, 'capital'] = 'Paris'df.at[0, 'capital'], new_df.at[0, 'capital']('Paris', 'Paris')

Changing the element of the new_df data frame changes the element of the df data frame. This is because df and new_df both access the same data on the computer’s memory. We can mitigate this problem by using the .copy() method. This creates a new data frame in a different memory address. Hence, changing values in one data frame does not change the values in the different data frame.

df = new_df.copy()df.at[0, 'capital'] = "Kathmandu"df.at[0, 'capital'], new_df.at[0, 'capital']('Kathmandu', 'Paris')

Performing basic analysis on data

.sum() method calculates the sum of all data in a particular column.

df['population (in million)'].sum()1771.3700000000001df.sum()country                                  NepalCanadaAustraliaIndiaUSA
capital                    KathmanduOttawaCanberraNew DelhiWashington
population (in million)                                       1771.37
area (in sqkm)                                            3.09455e+07
dtype: object

Calculating the overall population density of all the countries.

pop_dens = df['population (in million)'].sum()*1e6/df['area (in sqkm)'].sum()
print("Population Density: " + str(round(pop_dens, 1)) + " per sqkm.")Population Density: 57.2 per sqkm.

Create a new column called population density in the original dataframe.

df['population_density'] = df['population (in million)']*1e6/df['area (in sqkm)']df

Get the countries whose population density is greater than 100 per sqkm.

high_dens = df.population_density > 100
high_dens0     True
1    False
2    False
3     True
4    False
Name: population_density, dtype: bool

Boolean values can be used to select rows in dataframe.

df[high_dens]

Population density gives us a better understanding of the country rather than population. So we can drop the population column.

inplace needs to be True to permanently delete a column from a data frame.

df.drop(columns=['population (in million)'], inplace=True)df

Sorting

.sort_values() method can be used to sort values in a dataframe.

df.sort_values('population_density', ascending=False)

Merging multiple dataframes

city_df = pd.DataFrame({
    'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
    'city': ['Pokhara', 'Toronto', 'Sydney', 'Mumbai', 'Los Angeles']
})
city_df

To merge two data frames, they must have at least one common columns. ‘on’ is the column to join on two data frames. In this case, both of the data frames have a country column in common.

merged_df = df.merge(city_df, on='country')
merged_df

df = merged_df
df

Data aggregation and grouping

continent_df = pd.DataFrame({
    'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
    'continent': ['Asia', 'North America', 'Australia', 'Asia', 'North America']
})
continent_df

df = df.merge(continent_df, on='country')
df

We can use the groupby function to create a group for each continent, select the columns we wish to aggregate, and aggregate them using the .sum() method.

df.groupby('continent').sum()

Data visualization in pandas

Pandas provide a .plot() method to plot different types of graphs like line, bar, pie, hist, scatter, area, etc which can become very handy to plot simple graphs without importing any other libraries.

df.population_density.plot()<AxesSubplot:>

df['area (in sqkm)'].plot(kind='pie', )<AxesSubplot:ylabel='area (in sqkm)'>

df.population_density.plot(kind='bar')<AxesSubplot:>

Reading and Writing to different file formats in pandas

We can read and write from/to different file formats in pandas by using pandas defined methods like .read_csv(‘filepath’) and .to_csv(‘filepath).

Example:

my_csv_file = pd.read_csv(‘../my_file.csv’)

While using file path format in windows, write ‘r’ (lowercase r) before the quotation starts.

my_csv_file = pd.read_csv(r’D:\Users\aarya\files\my_file.csv’)