Step-by-step Basic Data Cleaning in R
When working with data, your insights are only as good as your data.
If we have a lot of garbage values in your data, this will affect the analytics and the outputs a lot. For this, it is very important to properly clean and structure the data we are going to work on.
What is data cleaning?
Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
When we load a dataset, especially data collected on our own, we get a lot of undesirable values in a dataset, values may be in a format on which we cannot perform computations or there may be a significant proportion of missing values.
All of these data cleanings can be easily done in R. To get started, we take an easy dataset on which we can perform the data cleaning. You can find the data set in the link given below.
1. Loading the Dataset
data <- read.csv(“TempByCity.csv”)
head(data,5)
2. Viewing the Dataset
We start with viewing the basic structure of the dataset. This is important because we want to assess how to proceed with the cleaning and what all data or values are required to be cleaned.
## dimension of the data
dim(data)## structure of the data
str(data)## summary of the data
summary(data)
3. Getting into Data cleaning
a) Handle the missing data
You can’t ignore missing data because many algorithms will not accept missing values. In a dataset, if there are fewer rows containing missing values, then we can simply remove them. Else, we can go for imputation methods.
To find the total number of NA values
sum(is.na(data))
[1] 274
This is a large number of rows, hence we go for data imputation — filling the missing values.
sum(is.na(temp$AverageTemperature))
[1] 124## checking the mean and median values of Avg. Temperature
temp %>%
summarize(mean=mean(AverageTemperature,na.rm=TRUE),median=median(AverageTemperature,na.rm=TRUE))mean median
1 17.05889 19.28
Since the mean and median are close by, we replace the missing values with the median because it is more accurate.
temp <- temp %>%
mutate(AverageTemperature=replace(AverageTemperature,is.na(AverageTemperature),median(AverageTemperature,na.rm = TRUE)))## check again if any missing value is left
sum(is.na(temp$AverageTemperature))
[1] 0
Similarly, we replace the missing values with the median in AvgTemp
# checking for na values
sum(is.na(temp$Avg.Temp.Unc))
unique(temp$Avg.Temp.Unc)
[1] 0
NULLtemp <- temp %>% mutate(Avg.Temp.Unc=replace(Avg.Temp.Unc,is.na(Avg.Temp.Unc),median(Avg.Temp.Unc,na.rm = TRUE)))sum(is.na(temp$Avg.Temp.Unc))
[1] 0
The rest of the columns do not have any missing values. So we proceed onto our next step.
b) Fix Structural Errors
In this dataset, the date column is in ‘character’ format. We cannot infer much from such a data type. Hence, we split the date into day, month and year , then change them to numeric format.
library(tidyverse)## checking for na values
sum(is.na(temp$dt))
[1] 0temp <- temp %>%
separate(dt,c("year","month","day"))## changing them to integer
temp$year <- as.integer(temp$year)
temp$month <- as.integer(temp$month)
temp$day <- as.integer(temp$day)head(data,3)
c) Removing unwanted values from the data
In the latitude and longitude, we have the data with coordinates and hence, the E,W, N and S values. But, we cannot feed such data into many models and nor can we perform numerical computations on them. Hence we remove the letters towards the end and split them into different data frames when we want to perform statistics on them.
## checking for NA values
sum(is.na(temp$Latitude))## filtering the rows with N values
tempN<- temp %>%
dplyr::filter(endsWith(Latitude,'N'))
head(tempN,5)## removing n form tempN dataframe
tempN <- tempN %>%
mutate(Latitude=gsub("N","",Latitude))head(tempN$Latitude,5)
[1] 49.03 8.84 18.48 23.31 53.84# changing the datatype of Latitude to numeric
tempN$Latitude <- as.numeric(tempN$Latitude)
sum(is.na(tempN$Latitude))
[1] 0## for S values
## filtering the rows with S valuestempS<- temp %>%
dplyr::filter(endsWith(Latitude,'S'))## removing n form tempN dataframe
tempS <- tempS %>%
mutate(Latitude=gsub("S","",Latitude))head(tempS$Latitude,5)
[1] "4.02" "7.23" "7.23" "21.70" "15.27"tempS$Latitude <- as.numeric(tempS$Latitude)
sum(is.na(tempS$Latitude))
[1] 0
Now, you have a clean and error-free dataset to start with your statistics and feed them into models.
Hope you learned something new and found the story useful. Connect with me on LinkedIn.