Pandas: Python’s library

Bhargav Sharma
3 min readOct 10, 2022

Pandas: It is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool.

#can be installed using below command
pip install pandas

Loading data using pandas:

Throughout this article, we will be using the below table. Feel free to download small practice tables from this link for practice purposes.

BankWages.csv
import pandas as pd
data=pd.read_csv("BankWages.csv") #reading csv files
data_excel=pd.read_excel("BankWages.xlsx") #reading excel files
data_tab=pd.read_csv("BankWages.txt",delimeter='\t') #reading text files.
data.head() #output: returns top 5 rows of the table by default ------(Fig.image1)
data.head(3) #output: returns top 3 rows of the table
Fig.image1

Accessing the data:

df.columns #output:returns all columns name as output.
df['job'] or df.job #output: returns a specific column as output.
df['job','education'] #output: returns columns specified inside.
df.iloc[1:3] #iloc stands for integer location we can find any of the data through indices. returns rows from index 1 to 2 where as 3 is esclusive.
df.iloc[2,1] #output:'admin' ->fetches the particular data at index 1 in 2nd row
df.loc[df.job=='admin'] #output: returns all the rows having job as 'admin' ->loc works opposite of iloc it doesn't require indices in loc searching can be done using conditions,texts,rows and columns.

Reading each row:

for index,row in df.iterrows():
print(index,row['job'])

Sorting/Describing Data:

df.describe() #output: returns statistical analysis on the data
--(Fig.image2)
Fig.image2
df.sort_values(list_of_values,ascending="True/False")
df.sort_values('education',ascending=True) #sorts the whole table with respect to education column.

Manipulation in data:

df['total']=216 #output: it adds up a new column to data with 216 as it's default value.
df['total']=df.iloc[:,2:4].sum(axis=1) #hypothetical for the current data but we can do the sum of the columns for a new column named 'total' where as the axis=1 means we have to add them up in horizontal derection.
df=df.drop(columns=['total']) #output: delets the particular column specified.

Exporting the data :

df.to_csv('my_new_file.csv',index=False)df.to_excel('my_new_file.xlsx',index=False)df.to_csv('my_new_file.txt',index=False,sep='\t')

Filtering data:

df.loc[(df.job=='admin') & (df.gender=='male')] #output: returns the rows with job as admin and gender as male.--Note:In pandas in place of and,or,not we have to use &,|,~ otherwise it will lead to an error--(Fig.image3)df=df.reset_index() #output:just to reset the indexing and starts it with 0 for our convenience
Fig.image3
#using regex
import re
df.loc(df.job.str.contains('^a[A-Z]'*),flags=re.I,regex=True)
#contains used for finding a particular substring in string where as flag shows that regex will work irrespective of case(a==A)

Conditional Changes:

df=df.loc[df['job']=='admin','job']='HR' #output: changes the job to HR where the job was admin.
--(Fig.image4)
#we can also modify multiple values on specific conditions
df=df.loc[df['job']=='admin',['job','gender']]=['HR','female']
Fig.image4

Group By:

df.groupby(['job']).mean() #aggregate the data according to job and the find out the mean for specific field.
--(Fig.image3)
df.groupby(['job']).count() #aggregate the data by job and count the total rows.
Fig.image3

Dealing with large datasets:

for df in pd.read_csv('file.csv',chunksize=5): 
print(df)
#chunksize is the number of rows we want to read

--

--

Bhargav Sharma

A student with a passion for technology and data-driven solutions.