Pandas Library
Pandas is one of the most popular python library used for working with data of all types. It was first released in January of 2008 by Wes McKinney. Pandas library is open source and has a huge community backing it. It is written in C and Cython which makes it very fast and optimized for performance with the features of high-level languages like simpler syntax.
Why Pandas?
There is mainly two feature offered by pandas that makes it very compelling to learn it for any developer. The features are as follows: It allows reading and working on data from a wide range of file formats like CSV, JSON, HTML, XML, MS Excel, SQL, HDF5, etc. Because it being an open-source library with a huge community, it provides many methods for data filtering, manipulating, merging, grouping and visualizing. Pandas also have many methods for doing statistical analysis.
Installing Pandas
We can install pandas library just like any other python library. Use any of the following commands to install pandas:
Anaconda users can use the following command:
Importing Pandas
After installing pandas, the first thing we need to do is to import it. The following command can be used:
import pandas as pd
Pandas can be imported as anything but it is most common to import it as ‘pd’.
Data Structures offered by Pandas
To master the basics of pandas library first you need to get familiar with two of the data structures it provides.
- Series
- DataFrame
Series
A series is basically a one-dimensional array that stores a sequence of data. Data stored in the panda series is associated with data labels called index.
Let’s start by defining our own series in pandas
my_series = pd.Series([-2, 0, 2,4,6,8,10])
my_series0 -2
1 0
2 2
3 4
4 6
5 8
6 10
dtype: int64
Let's check the data type
type(my_series)pandas.core.series.Seriesprint("Data values: ", my_series.values, "and data index: ", my_series.index)Data values: [-2 0 2 4 6 8 10] and data index: RangeIndex(start=0, stop=7, step=1)
Since we did not provide any index for this series, pandas has by default assigned index from 0 to N-1 where N is the length of data.
We can explicitly define the index as follows:
my_series2 = pd.Series([0,1,2,3,4,5,6], index=['a','b','c', True, True, 2, 3])
The index for a series can be of any datatype.
print("Data values: ", my_series2.values, "and data index: ", my_series2.index)Data values: [0 1 2 3 4 5 6] and data index: Index(['a', 'b', 'c', True, True, 2, 3], dtype='object')
Indexes can be used to select one or a set of data from series.
my_series2['a']0my_series2[True]True 3
True 4
dtype: int64
A list of indexes must be provided while selecting a set of values
my_series2[['b', True, 3]]b 1
True 3
True 4
3 6
dtype: int64
Operations on pandas series
numbers = pd.Series(range(10,20))
numbers0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
dtype: int64
Data filtering
numbers[numbers % 2 == 0]0 10
2 12
4 14
6 16
8 18
dtype: int64
Scalar manipulation
numbers + 1, numbers *2(0 11
1 12
2 13
3 14
4 15
5 16
6 17
7 18
8 19
9 20
dtype: int64,
0 20
1 22
2 24
3 26
4 28
5 30
6 32
7 34
8 36
9 38
dtype: int64)
Mathematical and custom functions
Pandas support many methods provided by the NumPy library
import numpy as np
np.sin(numbers)0 -0.544021
1 -0.999990
2 -0.536573
3 0.420167
4 0.990607
5 0.650288
6 -0.287903
7 -0.961397
8 -0.750987
9 0.149877
dtype: float64
We can define our own functions and use the ‘.apply’ method provided by pandas to use our custom function on the pandas series
def power2(ds):
return ds * dsnumbers.apply(power2)0 100
1 121
2 144
3 169
4 196
5 225
6 256
7 289
8 324
9 361
dtype: int64
Some other useful functions
numbers['null'] = None # Adding None value and index null to the existing series
Pandas function .isnull() and notnull() can be used to find missing datas in series.
This is another important feature of the pandas library. Pandas make it very easy to work with missing data.
numbers.isnull()0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
null True
dtype: boolpd.notnull(numbers)0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
null False
dtype: bool
Combining 2 Series
alphabets = pd.Series(['a','b','c','d','e'])
alphabets0 a
1 b
2 c
3 d
4 e
dtype: objectjoined = alphabets.append(numbers)
joined0 a
1 b
2 c
3 d
4 e
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
null None
dtype: objectjoined2 = pd.concat([alphabets, numbers], axis = 0) # axis = 0 is defined by default so not need to write it againjoined20 a
1 b
2 c
3 d
4 e
0 10
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
null None
dtype: object
We can create new pandas series by combining 2 existing data series by any of the above methods
joined3 = pd.concat([alphabets, numbers], axis = 1)joined3
Note that combining data series with axis = 1 produces a different result than two of the methods above.
This brings us to another data structure offered by pandas called DataFrame
DataFrame
DataFrame is the rectangular arrangement of data in a table.
Source: https://jovian.ai/aakashns/python-pandas-data-analysis
Another way of thinking about data frame is as a python dictionary having multiple values associated with a single key. The data stored in the data frame can be of any type like integer, string, boolean, etc.
Let’s start by creating a data frame
df = pd.DataFrame({'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
'capital': ['Kathmandu', 'Ottawa','Canberra', 'New Delhi', 'Washington'],
'population (in million)': [28.09, 37.09, 24.99, 1353, 328.2],
'area (in sqkm)': [147516, 9.985*1e6, 7.692*1e6, 3.287*1e6, 9.834*1e6]}) # 1e6 is scientific notation for writing a million
df
.info() method helps us to get some basic information about the dataframe.
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 5 non-null object
1 capital 5 non-null object
2 population (in million) 5 non-null float64
3 area (in sqkm) 5 non-null float64
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes
.describe() method provides statistical information about the dataframe.
df.describe()
count: sum std: standard deviation
Retrieving data from dataframe
This is a simple example of the datasets that you will be working on using pandas
We can use the methods like .head() and .tail() to select 5 elements of the data frame from top and bottom respectively.
df.head()
We can provide a parameter n, where n is the number of rows to select.
df.tail(3)
df.columns can be used to get the columns of the dataframe.
df.columnsIndex(['country', 'capital', 'population (in million)', 'area (in sqkm)'], dtype='object')
We can retrieve one or more columns from a dataframe using two of the methods below.
df['country']0 Nepal
1 Canada
2 Australia
3 India
4 USA
Name: country, dtype: objectdf.country0 Nepal
1 Canada
2 Australia
3 India
4 USA
Name: country, dtype: object
One column of a data frame is a series.
type(df.country)pandas.core.series.Series
We need to provide column names as a python list to retrieve more than one column.
df[['country', 'capital']]
type(df[['country', 'capital']])pandas.core.frame.DataFrame
.shape methods return the size/shape of the table/dataframe.
df.shape(5, 4)
.sample() method can be used to retrieve a random sample of data from dataframe.
df.sample(3)
.at method can be used to retrieve the element from a particular row and column directly.
df.at[3, 'capital']'New Delhi'df.at[4, 'population (in million)']328.2
.loc method can be used to retrieve a particular row from the dataframe.
df.loc[3]country India
capital New Delhi
population (in million) 1353
area (in sqkm) 3.287e+06
Name: 3, dtype: objectdf.loc[0]country Nepal
capital Kathmandu
population (in million) 28.09
area (in sqkm) 147516
Name: 0, dtype: object
To access multiple rows;
df.loc[1:3]
df.at[0, 'capital']'Kathmandu'new_df = dfnew_df
new_df.at[0, 'capital']'Kathmandu'new_df.at[0, 'capital'] = 'Paris'df.at[0, 'capital'], new_df.at[0, 'capital']('Paris', 'Paris')
Changing the element of the new_df data frame changes the element of the df data frame. This is because df and new_df both access the same data on the computer’s memory. We can mitigate this problem by using the .copy() method. This creates a new data frame in a different memory address. Hence, changing values in one data frame does not change the values in the different data frame.
df = new_df.copy()df.at[0, 'capital'] = "Kathmandu"df.at[0, 'capital'], new_df.at[0, 'capital']('Kathmandu', 'Paris')
Performing basic analysis on data
.sum() method calculates the sum of all data in a particular column.
df['population (in million)'].sum()1771.3700000000001df.sum()country NepalCanadaAustraliaIndiaUSA
capital KathmanduOttawaCanberraNew DelhiWashington
population (in million) 1771.37
area (in sqkm) 3.09455e+07
dtype: object
Calculating the overall population density of all the countries.
pop_dens = df['population (in million)'].sum()*1e6/df['area (in sqkm)'].sum()
print("Population Density: " + str(round(pop_dens, 1)) + " per sqkm.")Population Density: 57.2 per sqkm.
Create a new column called population density in the original dataframe.
df['population_density'] = df['population (in million)']*1e6/df['area (in sqkm)']df
Get the countries whose population density is greater than 100 per sqkm.
high_dens = df.population_density > 100
high_dens0 True
1 False
2 False
3 True
4 False
Name: population_density, dtype: bool
Boolean values can be used to select rows in dataframe.
df[high_dens]
Population density gives us a better understanding of the country rather than population. So we can drop the population column.
inplace needs to be True to permanently delete a column from a data frame.
df.drop(columns=['population (in million)'], inplace=True)df
Sorting
.sort_values() method can be used to sort values in a dataframe.
df.sort_values('population_density', ascending=False)
Merging multiple dataframes
city_df = pd.DataFrame({
'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
'city': ['Pokhara', 'Toronto', 'Sydney', 'Mumbai', 'Los Angeles']
})
city_df
To merge two data frames, they must have at least one common columns. ‘on’ is the column to join on two data frames. In this case, both of the data frames have a country column in common.
merged_df = df.merge(city_df, on='country')
merged_df
df = merged_df
df
Data aggregation and grouping
continent_df = pd.DataFrame({
'country':['Nepal', 'Canada', 'Australia', 'India', 'USA'],
'continent': ['Asia', 'North America', 'Australia', 'Asia', 'North America']
})
continent_df
df = df.merge(continent_df, on='country')
df
We can use the groupby function to create a group for each continent, select the columns we wish to aggregate, and aggregate them using the .sum() method.
df.groupby('continent').sum()
Data visualization in pandas
Pandas provide a .plot() method to plot different types of graphs like line, bar, pie, hist, scatter, area, etc which can become very handy to plot simple graphs without importing any other libraries.
df.population_density.plot()<AxesSubplot:>
df['area (in sqkm)'].plot(kind='pie', )<AxesSubplot:ylabel='area (in sqkm)'>
df.population_density.plot(kind='bar')<AxesSubplot:>
Reading and Writing to different file formats in pandas
We can read and write from/to different file formats in pandas by using pandas defined methods like .read_csv(‘filepath’) and .to_csv(‘filepath).
Example:
my_csv_file = pd.read_csv(‘../my_file.csv’)
While using file path format in windows, write ‘r’ (lowercase r) before the quotation starts.
my_csv_file = pd.read_csv(r’D:\Users\aarya\files\my_file.csv’)