Get Started With Pandas In 10 Mins

Learn Pandas in 10 minutes

Ektamanvar
Analytics Vidhya
9 min readJun 21, 2020

--

Image source from medium post

This article is divided into four parts:

  1. Introduction to pandas
  2. What pandas is capable of
  3. Practical Implementation
  4. Conclusion

1. Introduction to Pandas

In simple terms, pandas is a fast, powerful, flexible, and easy to use software library written for a python programming language to do data analysis and data manipulation.

2. What pandas is capable of

  1. Importing and Exporting Data
  2. Viewing and Inspecting Data
  3. Selecting and Filtering Data
  4. Grouping and Sorting Data
  5. Handling Missing and Duplicate Data
  6. Setting and Resetting Index
  7. Reshaping Data
  8. Merging Data
  9. Date Formating

3. Practical Implementation

Let’s start by importing the panda's library. Below is the syntax to import the library. Import is the keyword, pandas is a library name, and pd is the alias. By using alias all the functions of pandas are accessed.

import pandas as pd

1. Importing and Exporting Data

#Importing data
data = pd.read_csv('titanic.csv')
#Exporting data
data.to_csv('output.csv')

I am using a titanic dataset from Kaggle you can find here.

The first step in the data science project is to import the dataset. Often you will work with a common separate value(CSV) file. read_csv() function is used to import the dataset and to_csv() is used to export the dataset.

Data is the variable name and the variable is used to store any kind of data. Any variable name can be used. Different file formats of data can be imported and exported like Excel, Html, JSON, etc.

This is how our imported data look likes
This is how imported dataset looks like

2. Viewing and Inspecting Data

#Returns by default top five rows
data.head()
#Returns by default last five rows
data.tail()
#Returns column names
data.columns
#Returns column data types
data.dtypes
#Returns (Rows, Columns) of the dataset
data.shape
#Returns total count of non NA values for each column 
data.count()
#Age, Cabin, Embarked columns have NA values
#Returns basic statistical information of numeric columns only
data.describe()
#Returns information of string type columns
data.describe(include=object)
#Returns information about dataframe like index dtype and column dtypes, non-null values and memory usage
data.info()
#Returns distinct count of observations for each column
data.nunique()
#Returns count of unique values for series(columns in our case)
data['Pclass'].value_counts()
#Rename the column name
data.rename(columns= {“Name”:”Full Name”})
#Drop columns
data.drop(columns=[‘Name’,’Survived’,’Pclass’])

3. Selecting and Filtering Data

#Selects one column data
data['Name']
#Selects more than one column data
data[['Sex','Name']]
#Filters data based on condition
data[data['Age']>50]
#Filters all rows and 3 columns
#loc() function filters by labels
data.loc[:, ['Sex','Pclass','Age']]
#Filters 100 to 400 rows and 2 columns
data.loc[100:400 :,['Survived','Sex']]
#Filters all rows and columns from survived to sex columns
data.loc[:, 'Survived':'Sex']
#Filters rows based on condition (sex=female) and all columns
data.loc[data['Sex']=='female', ]
#Filters all rows and 1 to 4 columns
#iloc() function filters by index
data.iloc[ :,1:4]
#index starts from zero and ignores last index while filtering
#Filters all rows and 1,4,6 columns
data.iloc[:,[1,4,6]]
#Filters rows from 150 to 400 and and 2,7,3 columns
data.iloc[150:400:,[2,7,3]]

4. Grouping and Sorting Data

#Returns Survived column wise count
data.groupby('Survived').count()
#Returns max age based on sex column
data.groupby('Sex')['Age'].max()
#Returns min age based on sex, survived columns
data.groupby(['Sex','Survived'])['Age'].min()
#Multiple aggregation functions  
#Returns min,max of age based on parch & survived
data.groupby(['Parch','Survived'])['Age'].agg([min, max])
#Sorts the dataset based on specified column(default it sorts in ascending order)
data.sort_values(by='Name')
#Sorts the dataset based on specified column in decending order
data.sort_values(by='Name', ascending=False)

5. Handling Missing Data

#Returns True or False 
data.isnull()
data.isna()
#True -> NA
#False -> Not NA
#Fills all NA's with zero for both string nd numeric 
#NA values can be filled with mean,median or mode as well.
data.fillna(0)
#Drops rows if the row has at least one NA value
data.dropna(how='any')

See how rows got decreased from 891 to 183, be cautious when you use dropna() function because it drops a lot of valuable data.

#Drops rows if the row have all NA values
data.dropna(how=’all’)

In our case, we don't have any such row that has all NA values.

#Drops columns if the column have at least one NA value
data.dropna(axis=’columns’, how=’any’)

Age, Cabin, Embarked columns dropped.

#Checks if the whole row appears elsewhere with the same values. 
data.duplicated()
#Checks if there are any duplicate values of a particular column
data.duplicated(‘Age’)
#Drops duplicate records 
data.drop_duplicates()

In our case, we don't have any duplicate records.

#Drops duplicates from particular column
data.drop_duplicates(‘Age’)

See how rows got decreased from 891 to 89, be cautious when you use drop_duplicates() function because it drops a lot of valuable data. In our case obviously with the same age group traveled in titanic. Just used for example only.

6. Setting and Resetting Index

#Sets index based on specified column 
data.set_index(‘Sex’)
#Sets index based on specified columns
data.set_index([‘Sex’,’PassengerId’])
#Reset index 
data1.reset_index()

Now Sex, PassengerId columns are reset back to normal columns.

7. Reshaping Data

#melt() function is used to convert dataframe from wide format to long formatdata.melt(id_vars='PassengerId')
#melt() function can be used for particular columns as welldata.melt(id_vars='PassengerId', value_vars=['Survived','Sex'], var_name='Columns', value_name='Column_values')
#pivot() function is used to reshape dataframe based on index/columns values. Results into multiindex
# don't support aggregation function
data.pivot(index='PassengerId', columns='Sex', values=['Survived','Pclass','Age'])
#pivot_table is used for data aggregation 
#Obersvations are filled with sum of age values
data.pivot_table(index=’PassengerId’, columns=’Sex’, values=’Age’, aggfunc=’sum’)

8. Merging Data

Let's create data frames to demonstrate Merge and Concat functions

df1 = pd.DataFrame({“x1”: [“a”,”b”,”c”,”d”], “x2”:[12.0, 23.2, 56, 45.4]})
df1
df2 = pd.DataFrame({“x1”: [“a”,”b”,”c”,”e”], “x3”:[9.5, 37.0, 77,38.9]})
df2
#Returns come data based on x1 column: inner join (a,b,c)
pd.merge(df1,df2,on=”x1")

#Merges both data based on x1 column: outer join (a,b,c,d,e)
pd.merge(df1,df2,on="x1", how="outer")
#Merges common data from both dataset and remaining data from left dataset
pd.merge(df1,df2,on=”x1", how=”left”)
#Merges common data from both dataset and remaining data from right dataset
pd.merge(df1,df2,on="x1", how="right")

Please note while merging if data is not there then it will replace with NaN values.

#Concat function by default performs outer join and works row wise
pd.concat([df1,df2])
#Axis will be labeled 0, …, n - 1
pd.concat([df1,df2], ignore_index=True)
#Concatnates column wise
pd.concat([df1,df2], axis=1)
pd.concat([df1,df2], join="inner")

9. Date Formating

I am using the apple stock price dataset from the google finance website you can find here.

#Load the dataset 
data1 = pd.read_csv(‘applestock.csv’)
data1.head()
#Check the datatypes
data1.dtypes
#date column is object type not datetime format
#to_datetime() — converts any format to datetime formatdata1[‘Date’] = pd.to_datetime(data1[‘Date’])
data1.dtypes
#Extract year, month, day from date columndata1[‘Year’] = data1[‘Date’].dt.year
data1[‘Month’] = data1[‘Date’].dt.month
data1[‘day’] = data1[‘Date’].dt.day
data1.head()
#pd.DatetimeIndex() — sets date as index
data1.index = pd.DatetimeIndex(data1[‘Date’])
#Once we set date as index, needs to del date column
data1.drop(columns=[‘Date’], inplace=True)
data1
#Resample() — resamples time series data based on specified frequency
data1[‘High’].resample(‘M’).sum()
#date_range() - creates array of datetime#Creates 10 dates starting from 2020-01-05 with WEEK frequency
date1 = pd.date_range(start='2020-01-05', periods=10, freq='W')
#Creates 10 dates ending date is 2020-03-10 with MONTH frequency
date2 = pd.date_range(end='2020-03-10', periods=10, freq='M')
#Creates 10 dates ending date is 2020-03-10 with MONTH frequency
date3 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='SM')
pd.DataFrame({"Date1": date1, "Date2": date2, "Date3": date3})

4. Conclusion

Once you go through this post you will be able to do data analysis using pandas like loading, Inspecting, Selecting, Filtering, Grouping, Sorting, and much more. If you want to practice the above examples here is the link to my jupyter notebook. These are only basic pandas functions you can do a more advanced level of data analysis also.

Thank you for reading :)

--

--