Data Cleaning Using Python
Why does data need to be cleaned?
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data — Wikipedia.
The data we get is not always clean, most of it tends to be messy and inconsistent. We have to clean up the data. Hence, having clean data will get higher quality information and be very helpful in making decisions.
For this article, I am working with the Dubizzle used Car Sales Dataset from Kaggle which has inconsistent data.
Step 1: Import the Library
Step 2: Import the dataset
Import the dataset that does it used in cleaning data. The Dubizzle used Car Sales Dataset can be accessed here.
Step 3: Look into your data
Before cleaning the data, you must know what type of data you are working with and how the data is structured, and you can find out what data is not representative in the analysis phase that you will do. Here, you can eliminate certain columns that are not needed in your analysis.
Based on the output, we can see that certain columns have a different number of rows than other and the variable types don’t match.
Step 4: Handling Missing Values
You can use the function “isna()” to detect missing values. It can detacting NAN values in data.
We get output as boolean types “True” and “False”. “True” means it contains the missing value in that variable and “False” doesn’t contain it. To find out how many missing values each variable has, you can use the “sum” function.
You can use several methods for handling missing values, namely “drop” or “fill”. Those methods have done by deleting or filling in columns or rows with values. It depends on how the data we have.
Step 5: Convert Data Type
Before you change the type of variable “price_in_aed” you need to replace “,” from each of the rows. You can use the “str.replace” function since that variable type is a string.
For “price_in_aed” and “kilometers”, you can use “astype(‘float’)” to change numeric type and “year” as “astype(‘int64’)”. In addition, use the datetime function to change the variable type to “year-month-day”.
Step 6: Remove Outliers
Using the boxplot to find out outliers in your data. Boxplot is useful, showing the minimum and maximum values, median, interquartile of the data. In the chart, the outliers are shown as points which those make easy to see.
To detect outliers in the data, use interquantile range(IQR). IQR tells us the variation in the dataset. Before getting the IQR, define Q1 and Q3 using the “.quantile” function. The minimum and maximum value defined to know any point outside this range is outliers.
The above plot shows outliers that are outside of the maximum value. After removing outliers, you can use the “len” function to know how much different data is.
And that’s it. You have successfully cleaned the dataset. It must be remembered that everyone has their own data cleaning method and each dataset has a different character. The most important thing is understanding your dataset and try to explore it further. I made the steps to clean the data, I hope this article can help you clean the dataset which is the initial phase in the analysis process. You can also check the full notebook here.