Let’s do Some Manipulation of CSV(comma-separated values)using Pandas Library.

Published in

featurepreneur

3 min readAug 2, 2021

What is Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name of Pandas is derived from the word Panel Data, which means Econometrics from Multidimensional data.

Benefits of using CSV file:

CSV files are plain-text files, making them easier for the website developer to create
Since they’re plain text, they’re easier to import into a spreadsheet or another storage database, regardless of the specific software you’re using.
To better organize large amounts of data

I hope now you understood about the pandas & CSV file.

Now let’s do some manipulation of CSV using Pandas.

First of all, we have to import pandas, and read CSV files, for that use this code

import pandas as pddf = pd.read_csv('data.csv')  #Here iam using a csv file named "data"

using “to_string()” we can print the entire DataFrame.

import pandas as pddf = pd.read_csv('data.csv')print(df.to_string())

using “fillna() ” we can replace the empty space with a value.

import pandas as pddf = pd.read_csv('data.csv')df.fillna(5000,inplace=True)
#Here iam giving '5000' to replace empty values.print(df.to_string())

using “dropna()” method it will remove all rows containg NULL values.

we can also replace the empty cells by calculate the mean, median or mode value of the column.

# Mean = the average value (the sum of all values divided by number of values).x = df["Calories"].mean()df["Calories"].fillna(x,inplace=True)print(df.to_string())
------------------------------------------------------------# Mode = the value that appears most frequently.x = df["Calories"].mode()[0]df['Calories'].fillna(x, inplace=True)print(df.to_string())------------------------------------------------------------
# Median = the value in the middle, after you have sorted all values ascending.x = df["Calories"].median()df["Calories"].fillna(x,inplace=True)print(df.to_string())

Cells with data of wrong format , we can convert it into a correct format by using “to_datetime()”

df['Date'] = pd.to_datetime(df['Date'])print(df.to_string())

before using “to_string()” date in row 4 is wrong.

after using “to_string()” function ,4'th row date fixed.

If we want to make changes in the original DataFrame, use the "inplace = True” argument.

We can remove the rows that contain wrong data.

# Removing rows# The loc property is used to access a group of rows and columns by label(s) or a boolean array.for x in df.index: if df.loc[x, "Duration"] > 120:  df.drop(x, inplace = True)print(df.to_string())

here we have a wrong data in row 3 (Duration=450)

Using the “duplicated()" method we can find out the duplicate values present in our data set.

import pandas as pddf = pd.read_csv('data.csv')print(df.duplicated())

Returns ‘True’ for every row that is a duplicate, othwerwise ‘False’.

Using “drop_duplicates()” we can remove the duplicates values.

import pandas as pddf = pd.read_csv('data.csv')df.drop_duplicates(inplace = True)print(df.to_string())

Hope you found the article useful!

Thanks for Reading!!

Let’s do Some Manipulation of CSV(comma-separated values)using Pandas Library.

What is Pandas?

Written by Aswin Satheesh