Let’s do Some Manipulation of CSV(comma-separated values)using Pandas Library.
Published in
3 min readAug 2, 2021
What is Pandas?
Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name of Pandas is derived from the word Panel Data, which means Econometrics from Multidimensional data.
Benefits of using CSV file:
- CSV files are plain-text files, making them easier for the website developer to create
- Since they’re plain text, they’re easier to import into a spreadsheet or another storage database, regardless of the specific software you’re using.
- To better organize large amounts of data
I hope now you understood about the pandas & CSV file.
Now let’s do some manipulation of CSV using Pandas.
- First of all, we have to import pandas, and read CSV files, for that use this code
import pandas as pddf = pd.read_csv('data.csv') #Here iam using a csv file named "data"
- using “to_string()” we can print the entire DataFrame.
import pandas as pddf = pd.read_csv('data.csv')print(df.to_string())
- using “fillna() ” we can replace the empty space with a value.
import pandas as pddf = pd.read_csv('data.csv')df.fillna(5000,inplace=True)
#Here iam giving '5000' to replace empty values.print(df.to_string())
- using “dropna()” method it will remove all rows containg NULL values.
- we can also replace the empty cells by calculate the mean, median or mode value of the column.
# Mean = the average value (the sum of all values divided by number of values).x = df["Calories"].mean()df["Calories"].fillna(x,inplace=True)print(df.to_string())
------------------------------------------------------------# Mode = the value that appears most frequently.x = df["Calories"].mode()[0]df['Calories'].fillna(x, inplace=True)print(df.to_string())------------------------------------------------------------
# Median = the value in the middle, after you have sorted all values ascending.x = df["Calories"].median()df["Calories"].fillna(x,inplace=True)print(df.to_string())
- Cells with data of wrong format , we can convert it into a correct format by using “to_datetime()”
df['Date'] = pd.to_datetime(df['Date'])print(df.to_string())
If we want to make changes in the original DataFrame, use the "inplace = True”
argument.
- We can remove the rows that contain wrong data.
# Removing rows# The loc property is used to access a group of rows and columns by label(s) or a boolean array.for x in df.index: if df.loc[x, "Duration"] > 120: df.drop(x, inplace = True)print(df.to_string())
- Using the “
duplicated()"
method we can find out the duplicate values present in our data set.
import pandas as pddf = pd.read_csv('data.csv')print(df.duplicated())
- Using “drop_duplicates()” we can remove the duplicates values.
import pandas as pddf = pd.read_csv('data.csv')df.drop_duplicates(inplace = True)print(df.to_string())
Hope you found the article useful!
Thanks for Reading!!