Data Cleaning With Python

Arowolo Christopher Olalekan
8 min readNov 9, 2023

--

How I cleaned one of the dirtiest data you can find on the internet using Pandas and Numpy

Introduction

What is Data Cleaning?

photo credit: google.com

Before we talk about what data cleaning is, one of the simple truths about data to every data professional is that data is rarely available in the form we all dream about. As a Data professional in Nigeria, data is never available, you don't get it on a platter, you either gather through surveys or scrap data off the internet. Neither of these methods would give you clean data a million times, this is why data cleaning is one of the most important processes to take after you have successfully gotten the data you need. Now, what really is data cleaning?

Data Cleaning is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset to improve its quality and reliability. It is a critical step in data preprocessing that helps ensure that data is accurate, complete, and consistent for effective analysis and decision-making.

Scope Of Data Cleaning

Data cleaning involves several processes, here are a few of them:

  1. Handling Missing Data: Addressing missing values by either imputing them using statistical techniques or removing the rows or columns containing missing data.
  2. Removing Duplicates: Identifying and eliminating duplicate records to avoid redundancy and ensure data accuracy.
  3. Standardizing Data: Converting data into a consistent format to facilitate comparisons and analysis.
  4. Handling Outliers: Identifying and dealing with data points that deviate significantly from the rest of the dataset to prevent them from skewing analysis results.
  5. Resolving Inconsistencies: Ensuring that the data type is uniform and conforms to the predefined type.

Cleaning The Movie Data.

Data cleaning can be done with several tech stacks like SQL, Ms Excel, or Business Analysis tools like Power BI and Tableau. For this Project, I decided to use Python, because it possesses a variety of Libraries that can be deployed to fix a problem, some of the popular ones are Pandas and Numpy. The data was obtained from kaggle.com and was scraped off IMDb, Netflix, and Tvshows websites.

movie data before cleaning

So it began,

  1. loaded the CSV file into my jupyter notebook
#first, i would import the data to my jupyter 
import pandas as pd
import numpy as np

file_path = "C:/Users/user/Desktop/archive122/movies.csv"
#read the data into the notebook
movie_df = pd.read_csv(file_path)
#print the dataset
movie_df
Data after Importing to my notebook.

2. I took a sneak peek at the dataset using the dotinfo() function

#checking the dataset
movie_df.info()
A sneak peek of the data.

3. Handled the Null Values by removing rows with null in the text data, I also replaced Null values with their mean value in the Numerical data columns, different strokes for different foes.. lol yeah? here is what it looked like..

#first step in my cleaning process, is to get rid of null values. 
#here we notice a drop in the numbers of non_null, meaning there are null values in there. lets take a look
movie_df.isna()

#this supports the claim, but lets look inwards to the count of these null cells
movie_df.isna().sum()
#There were enough Null values in the dataset.

#let us treat them one after the other.
#for the year column, we will drop the rows where year is null.

movie_df = movie_df.dropna(subset = ['YEAR'])
#Drop rows with Null values made more sense for an unbiase analysis.
#print the dataset
movie_df

#next is the genre column, imagine a movie without a genre classification, it should be dropped to avoid confusion
movie_df = movie_df.dropna(subset = ['GENRE'])
#print the dataset
movie_df

#Next is the rating column.
#Rating column can be replaced with the average rating in the column.
movie_df['RATING'].fillna(movie_df['RATING'].mean(), inplace = True)
#hit with a warning error.
#print dataset
movie_df

movie_df['RATING'] = movie_df['RATING'].round(1)
#print the dataset
movie_df

#up to the next column VOTES.i need to replace the null values with the mean value, however there is a problem, the column is a string,
#so i might need to convert first before finding the mean value.
try:
#replace the ',' with ''.
movie_df['VOTES'] = movie_df['VOTES'].str.replace(',' , '').astype(float)
except ValueError as e:
#print error if any
print(f'Error: {e}')
#print the dataset
print(movie_df)

#next, i'd get the mean value and replace the null values with it.
movie_df['VOTES'].fillna(movie_df['VOTES'].mean(), inplace = True)
#print to show changes
movie_df
#next, i want to round up to 1dp.
movie_df['VOTES'] = movie_df['VOTES'].round(1)
#print the dataset
movie_df

#next is the runtime column
movie_df['RunTime'].fillna(movie_df['RunTime'].mean(), inplace = True)
#print the dataset
movie_df
#next, i would round off to 1dp
movie_df['RunTime'] = movie_df['RunTime'].round(1)
#print the dataset
movie_df

#so Gross has so many null values in the column, i would replace the null with ZERO.
movie_df['Gross'].fillna(0, inplace = True)
movie_df
#job is done on that. next i would be changing their data type to the appropriate one.
After handling null values.

4. Handled Incorrect Data type. Some columns had incorrect data type, and some columns like year column, and gross column data type were not in the appropriate type and were converted.

Some columns needed to be cleaned before they could be converted, so I cleaned them before changing the data type.

#the year is in string, it should be changed to the date data type.
movie_df['YEAR']
#to achieve that, first i must strip the column off any delimiter that is not a digit

#regex helps to filter the values in the columns and ensure they are trated as the condition set.
new_date = movie_df['YEAR'].str.replace('[^0-9]', '', regex = True)
#print the dataset
new_date
# there are rows with extra values aside the dates, as shown below.

# Trim the strings by removing extra digits behind the first 4.
trimmed_dates = [date[:4] for date in new_date]
# Displaying the trimmed strings
for trimmed_date in trimmed_dates:
print(trimmed_date)
#replace the values with the year column

#replace the column with the trimmed date values
movie_df.loc[:, 'YEAR'] = trimmed_dates
#print the dataset
movie_df

#to double check the column
movie_df['YEAR'].unique()

#next, i will change the datatype to datestamp
movie_df['YEAR'] = movie_df['YEAR'].astype('datetime64')
#print the dataset
movie_df

#next,the genre column.
#let me get the distinct values first.
movie_df['GENRE'].unique()
#this is to show the extent of the dirty data.

#first i will get rid of the white spaces.
movie_df['GENRE'] = movie_df['GENRE'].str.strip()
#print the dataset
movie_df

#i want to spilt the key words into columns
#get the unique vales first
movie_df['GENRE'].unique()

#i will split the keywords into columns
# Split the keywords into separate columns using get_dummies
#get_dummies would split the keywords into columns and change it to a categorical data type, where 1 means present and 0 means otherwise
genre_df = movie_df['GENRE'].str.get_dummies(', ')
#print the new columns dataset
genre_df

#change the dataframe to a categorical data type.
genre_df = genre_df.astype('category')
#print the dataset
genre_df.info()

#next, i'll drop the old column and replace it with this new data frame
movie_df.drop('GENRE', axis = 1, inplace = True)
#print the dataset
movie_df

#then i would concat the new dataframe to the old dataframe
movie_df = pd.concat([movie_df, genre_df], axis=1)
#print the dataset after joining the new columns
movie_df

#next is the one-line column
#trim the whitespaces off
movie_df['ONE-LINE'] = movie_df['ONE-LINE'].str.strip()
#print the dataset
movie_df

#next,is the stars column
#first, get rid of the white spaces
movie_df['STARS'] = movie_df['STARS'].str.strip()
#print the dataset
movie_df

#lets take a closer look at the column again
movie_df['STARS']
#i need to replace the ''\n'' with blanks and also replace the '|' with blank
movie_df['STARS'] = movie_df['STARS'].str.replace('\n', '').str.replace('|', ' ')
movie_df['STARS']
#okay, this looks better.

#next, is the gross column, lets take a look
movie_df['Gross'].unique()

#i will change the data type but first, have to strip these string off.
movie_df['Gross'] = movie_df['Gross'].str.replace('$M', '')

#check for the changes made
movie_df['Gross'].unique()
#that did not go as planned, okay would try again

movie_df['Gross'] = movie_df['Gross'].str.replace('$','').str.replace('M','')
movie_df['Gross']
#okay, lets look at the unique values

movie_df['Gross'].unique()
#next, i would change the data type to float

movie_df['Gross'] = movie_df['Gross'].astype('numeric')
#cannot convert cos of presence of some nan number, to pass that, we coerce the error

movie_df['Gross'] = pd.to_numeric(movie_df['Gross'], errors='coerce')
# Round the 'float_column' to 1 decimal place
movie_df['Gross'] = movie_df['Gross'].round(1)

#the values been replaced with NaN, next i would change the NaN to zero.
movie_df['Gross']
movie_df['Gross'].fillna(0, inplace = True)
#print the dataset
movie_df
After effecting the data type change and column cleaning.

5. Handled Duplicated Values. There were 3034 duplicated values in the dataset, I dropped all before proceeding to check by columns, this is because I noticed some of the values in the Movies column contain duplicated names, as the unique identifier in the table, duplicated names are not allowed. I dropped the rows that have duplicated names.


#show the duplicated rows
movie_df[movie_df.duplicated()]
#drop duplicates
movie_df = movie_df.drop_duplicates(keep = 'first')
#view the changes made in the dataframe
movie_df

#several rows have been dropped, but to ensure there are no duplicates, we check by the columns
movie_df['MOVIES'].duplicated()
#drop duplicates by columns
movie_df['MOVIES'] = movie_df['MOVIES'].drop_duplicates(keep = False)
#check for duplicate
movie_df['MOVIES'].duplicated()

#i want to show the duplicated values in the column movies, for a closer look
duplicate = movie_df[movie_df.duplicated(subset = 'MOVIES', keep = False)]
duplicate
#the movie title columns contain NaN, this is unacceptable as the column is a unique identifier.

#i want to drop the NaN values from the dataset, but i would take a closer look before i do that
movie_df['MOVIES']

#there are definately NaN values there so i would drop them now.
movie_df = movie_df.drop_duplicates(subset = 'MOVIES', keep = False)
movie_df

#check to see if the changes were okay
movie_df['MOVIES'].duplicated()
data after handling the duplicated rows.

After handling the duplicated data, I took a sneak peek at the dataset, before adding some quick data type corrections as shown below

#next i want to round VOTES TO a whole munber
movie_df['VOTES'] = movie_df['VOTES'].round(0)

#then i want to round the runtime column to a whole number
movie_df['RunTime'] = movie_df['RunTime'].round(0)
movie_df

#Votes should be in int, not floats
movie_df['VOTES'] = movie_df['VOTES'].astype('int64')
movie_df

#Runtime is also supposed to be in minutes, the right data type for that is int.
movie_df['RunTime'] = movie_df['RunTime'].astype('int64')
movie_df

Final Look at the Data

movie_df.info()
final peek at the data attribue.
#final shape of the dataset
movie_df.shape
final shape
#final look at the dataset
movie_df
what the data looked like after cleaning.

I saved the cleaned CSV file to my local machine.

#finally, i would save the dataset to my desktop
file_path = 'C:/Users/user/Desktop/cleaned_movie_data.csv'
movie_df.to_csv(file_path, index = False)
#see you next time! caio!

This is what I looks like on Ms Excel

After cleaning the data.

and it's a wrap!

To follow through with the process, my GitHub documentation contains a proper run-through with my jupyter notebook to help you relate better to the process.

Thank you for reading through. Please let me know your thoughts and corrections.

kindly follow me on X

@ogbeniherooo

--

--