Let’s do Some Manipulation of CSV(comma-separated values)using Pandas Library.

Aswin Satheesh
featurepreneur
Published in
3 min readAug 2, 2021

What is Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. The name of Pandas is derived from the word Panel Data, which means Econometrics from Multidimensional data.

Benefits of using CSV file:

  • CSV files are plain-text files, making them easier for the website developer to create
  • Since they’re plain text, they’re easier to import into a spreadsheet or another storage database, regardless of the specific software you’re using.
  • To better organize large amounts of data

I hope now you understood about the pandas & CSV file.

Now let’s do some manipulation of CSV using Pandas.

  • First of all, we have to import pandas, and read CSV files, for that use this code
import pandas as pddf = pd.read_csv('data.csv')  #Here iam using a csv file named "data"
  • using “to_string()” we can print the entire DataFrame.
import pandas as pddf = pd.read_csv('data.csv')print(df.to_string())
  • using “fillna() ” we can replace the empty space with a value.
import pandas as pddf = pd.read_csv('data.csv')df.fillna(5000,inplace=True)
#Here iam giving '5000' to replace empty values.
print(df.to_string())
before using fillna() method.
after using fillna() method.
  • using “dropna()” method it will remove all rows containg NULL values.
before using dropna() method.
after using dropna() method.
  • we can also replace the empty cells by calculate the mean, median or mode value of the column.
# Mean = the average value (the sum of all values divided by number of values).x = df["Calories"].mean()df["Calories"].fillna(x,inplace=True)print(df.to_string())
------------------------------------------------------------
# Mode = the value that appears most frequently.x = df["Calories"].mode()[0]df['Calories'].fillna(x, inplace=True)print(df.to_string())------------------------------------------------------------
# Median = the value in the middle, after you have sorted all values ascending.
x = df["Calories"].median()df["Calories"].fillna(x,inplace=True)print(df.to_string())
  • Cells with data of wrong format , we can convert it into a correct format by using “to_datetime()”
df['Date'] = pd.to_datetime(df['Date'])print(df.to_string())
before using “to_string()” date in row 4 is wrong.
after using “to_string()” function ,4'th row date fixed.

If we want to make changes in the original DataFrame, use the "inplace = True” argument.

  • We can remove the rows that contain wrong data.
# Removing rows# The loc property is used to access a group of rows and columns by label(s) or a boolean array.for x in df.index: if df.loc[x, "Duration"] > 120:  df.drop(x, inplace = True)print(df.to_string())
here we have a wrong data in row 3 (Duration=450)
we have removed the 3rd row.
  • Using the “duplicated()" method we can find out the duplicate values present in our data set.
values in row 1 and 2 are equal.
import pandas as pddf = pd.read_csv('data.csv')print(df.duplicated())
Returns ‘True’ for every row that is a duplicate, othwerwise ‘False’.
  • Using “drop_duplicates()” we can remove the duplicates values.
import pandas as pddf = pd.read_csv('data.csv')df.drop_duplicates(inplace = True)print(df.to_string())
row 2 has been removed from the result.

Hope you found the article useful!

Thanks for Reading!!

--

--