Data Manipulation Using Pandas
Why we need Data Manipulation ?
Real world data is so messy , we by doing certain operations make data meaningful based on one’s requirement this process of transforming messy data into insightful information can be done by data manipulation.There are various language that do support data manipulation (eg:-sql,R,Excel..etc). In this blog we will broadly discuss Pandas for data manipulation.In this section I will take titanic dataset for broader understanding.
1. Load Dataset :
Seaborn will load example dataset that is present in online repository.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset("titanic")
2. Read first/last five rows :
It is so hectic to go through each and every row of dataset so for the Cursory Glance we see first /last five row.
df.head()
By default it will show first five rows of given datset. df.tail()
By default it will show last five rows of given dataset.
3. Description of each features :
df.dtypes
It gives you proper understanding about each columns and it’s datatype.
Note:- String in python is similar to object in pandas
4. Statistical description :
It will Describe all numerical variable on statistical point of view like mean,median,percentile,quantile,minimum,maximum etc.This can be done using df.describe()
5. Description of Categorical data :
We cannot directly describe categorical data statistically,we will have to do several steps to do the same.
a=(df.dtypes=="object")
print(a)
Upper code return True
if dtype will be categorical else it will return False
Now we will do statistical description of categorical data.
b = df.dtypes[a] # give the output if "a" returns True value
b.describe()
6. Printing all columns :
When we are dealing with large dataset we don’t have any idea about each and every column so df.columns
prints all the columns that is present in given dataset.
7. Null values Operations :
When we are doing cursory glance of any dataset we need to make sure no any null value should be present in the given dataset.df.isnull().sum()
count the null values from each features and gives final counted value as an output.
Well we have detected null values so our next task is to eliminate all these null values permanently. df.dropna(inplace=True)
if inplace will be True it will eliminate null value permanently, if inplace will be False then it will eliminate null value temprorarily.
df.dropna(axis=1,thresh=2,inplace=True
suppose you have a dataframe and you don’t want to eliminate the column which has atleast two non nan value then we use df.dropna(thresh=2,axis=1)
for the same.
8. Filling null Values :
nan simply means there is no any data present at that datapoint so we can replace the missing value by adding some data manually.this can be done by df.fillna(value=3)
it will replace all nan value by 3.
9. Returning specific index :
Suppose You want to return the index of maximum fare this can be done by using below code;
import numpy as np
import pandas as pd
a=np.where(df["fare"]==max(df["fare"]))
print("index of maximum fare will be =: ",a)[out]>> index of maximum fare will be =: (array([136, 152], dtype=int64),)
10. Selecting Specific columns :
There is a task given to you to count the the people with age below 40 years old present in titanic.This will be done in two step.First of all you will select the feature contains age and an then apply sum()
on it.
a=df["age"] #Printing specific column age
b= a < 40 #People with age less than 40 years old
b.sum() #total number of people with age less than 40 years old[out] >> 114
11. Creating new column in given dataframe :
Suppose your boss has given a task to add a column with first letter of embarked town of each passenger present in titanic.
embark_town=df["embark_town"].astype("str")
df["new_column"]=[i[0] for i in embark_town]
df.columns
12. Creating dataframe in matrix format :
from numpy.random import randn as rn
np.random.seed(101)
matrix_data=rn(5,4)
row_label=["A","B","C","D","E"]
column_label=["P","Q","R","S"]
df1=pd.DataFrame(matrix_data,row_label,column_label)
df1
Why I have taken np.random.seed(101)
?
when we take random value without seed it will give different-different value as an output after every execution to get similar(constant) value as an output after every execution i have used seed .
13. Dropping rows and columns :
Suppose you don’t want adult male column in your dataset you can remove it (temproraily or permanently) using df.drop(inplace=True/false)
.
df2=df.drop(“adult_male”,axis=1) #axis=1 represent removal of data #from column and axis=0 represent removal of data from row
print(df2)
14. Index Set and Reset :
user has all the authority to alter various index related operations i.e:-changing index name,setting default index,remove column header etc.
df1.reset_index() #Gives default index value but previus index value
#will also be present after reseting
df1.reset_index(drop=True,inplace=True) #drop the previous index
#value and assign the default indexnew_default=["k","L","M","N","P"]
df1.set_index("new_default") #setting the index values
15. Multi-indexing :
multi indexing is hierarchical indexing in this situation index inside index will be present.
from numpy.random import randn
np.random.seed(101)
matrix_data=randn(12,6)
list1=["P","P","P","Q","Q","Q","R","R","R","S","S","S"]
list2=[1,2,3,1,2,3,1,2,3,1,2,3]
multi_index=tuple(zip(list1,list2))
print("printing multi index value \n","-"*120,"\n",(multi_index))Multi_index=pd.MultiIndex.from_tuples(multi_index)
df3=pd.DataFrame(matrix_data,index=multi_index,columns["J","K","L","M","N","O"])
print(df3)
Acessing Data from hierarchical index ;
a=df3.loc["P"].loc[[3],["A","C","E"]]
print(a)[out] >> A C E
3 0.188695 -0.933237 0.190794
16. Groupby Operation :
When we will have similar set of data we will use Groupby operation.this operation cannot be done independently we will have to do some mathematical or aggregation operation along with groupby to get required output.
data={"Company":["GooGle","GooGle","Microsoft","Microsoft","FaceBook","FaceBook"], "Person":["Akhil","Anand","Kspa","Abhishek","shiv","Aashutosh"],"Sales":[100,200,300,400,500,600]}
df6=pd.DataFrame(data)
df6.groupby("Company") #If we only do groupby we won't get output
#for getting output we will have to associate some aggregate #operation along with groupby
print(df6.groupby("Company").mean())[out]>> Sales
Company
FaceBook 550
GooGle 150
Microsoft 350
groupby swarm the similar data into single category for furthur mathematical operation.
df=df6.groupby("Company").describe()
df
Here all similar data has been collected into their respective group then statistical description has been done.
17. Accessing Row inside Row and column inside column index :
i.>Accessing column inside column index :
let’s understand from given image what does it means from column inside column.
We will access it by following code;
new_d["Sales"].loc[["GooGle"],["std","mean"] #Here i have accessed #"GooGle" column which is present inside "Sales" column
ii.> Accessing row inside row index :
Take a keen eye on it because the method of accessing row inside row has some difference from accessing column inside column index.
df1.loc["Sales"].loc[["count","mean"],["FaceBook","GooGle"]]
#here we are acessing "count" and "means" row index which is place d #inside "Sales" row.
18. Concatenation :
Concatenation can be done in two ways one is horizontal concatenation (axis=0) another is vertical concatenation(axis=1).
let’s understand it by an example;
19. Merge Operation :
Merge operation’s functonality is similar to joins in SQL. It is of four types which are:-
i.> Left merge :- It will give every attribute from left table.
ii.> Right merge :- It will give every from right table.
iii.>Inner merge :- It will give common attribute occured between left table and right table as an output.
iv.>Outer merge :- It will give combined attribute of left table and right table as an output.
20. Joins :
it’s Function and working are quiet similar to merge operation but;
merge :- It will join two dataframe based on their common column values.
join:- It will join two dataframe based on their common index values.
21. apply() :
It is used to reduce perplexity of existing column by doing certain opertions. This method is used to add multiple column by doing various complex mathematical operations without changing original nature of existing dataframe.let’s take an example to understand this concept precisely.
Suppose Any company which has been selling more than 18000 product is considered as top notch company else company will be considered as mediocre. Now we will do apply function for the same
what is the difference between apply() and map() ?
map() does better operations with list and apply() does better operations with Series and DataFrame.
22. iloc() and loc() :
iloc() used to extract/access data by it’s defauult location. loc() used to extract/access data by it’s column name.
Conclusion:
This is all from my Side . Please give your valueable feedback for further improvement of my blog.