Step-by-step Basic Data Cleaning in R

Joyeeta Dey
4 min readJan 12, 2022

--

When working with data, your insights are only as good as your data.

If we have a lot of garbage values in your data, this will affect the analytics and the outputs a lot. For this, it is very important to properly clean and structure the data we are going to work on.

What is data cleaning?

Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

When we load a dataset, especially data collected on our own, we get a lot of undesirable values in a dataset, values may be in a format on which we cannot perform computations or there may be a significant proportion of missing values.

All of these data cleanings can be easily done in R. To get started, we take an easy dataset on which we can perform the data cleaning. You can find the data set in the link given below.

1. Loading the Dataset

data <- read.csv(“TempByCity.csv”)
head(data,5)
dataset

2. Viewing the Dataset

We start with viewing the basic structure of the dataset. This is important because we want to assess how to proceed with the cleaning and what all data or values are required to be cleaned.

## dimension of the data
dim(data)
## structure of the data
str(data)
## summary of the data
summary(data)
dimension and structure of the dataframe
summary of the data

3. Getting into Data cleaning

a) Handle the missing data

You can’t ignore missing data because many algorithms will not accept missing values. In a dataset, if there are fewer rows containing missing values, then we can simply remove them. Else, we can go for imputation methods.

To find the total number of NA values

sum(is.na(data))
[1] 274

This is a large number of rows, hence we go for data imputation — filling the missing values.

sum(is.na(temp$AverageTemperature))
[1] 124
## checking the mean and median values of Avg. Temperature
temp %>%
summarize(mean=mean(AverageTemperature,na.rm=TRUE),median=median(AverageTemperature,na.rm=TRUE))
mean median
1 17.05889 19.28

Since the mean and median are close by, we replace the missing values with the median because it is more accurate.

temp <- temp %>%
mutate(AverageTemperature=replace(AverageTemperature,is.na(AverageTemperature),median(AverageTemperature,na.rm = TRUE)))
## check again if any missing value is left
sum(is.na(temp$AverageTemperature))
[1] 0

Similarly, we replace the missing values with the median in AvgTemp

# checking for na values
sum(is.na(temp$Avg.Temp.Unc))
unique(temp$Avg.Temp.Unc)
[1] 0
NULL
temp <- temp %>% mutate(Avg.Temp.Unc=replace(Avg.Temp.Unc,is.na(Avg.Temp.Unc),median(Avg.Temp.Unc,na.rm = TRUE)))sum(is.na(temp$Avg.Temp.Unc))
[1] 0

The rest of the columns do not have any missing values. So we proceed onto our next step.

b) Fix Structural Errors

In this dataset, the date column is in ‘character’ format. We cannot infer much from such a data type. Hence, we split the date into day, month and year , then change them to numeric format.

library(tidyverse)## checking for na values
sum(is.na(temp$dt))
[1] 0
temp <- temp %>%
separate(dt,c("year","month","day"))
## changing them to integer
temp$year <- as.integer(temp$year)
temp$month <- as.integer(temp$month)
temp$day <- as.integer(temp$day)
head(data,3)

c) Removing unwanted values from the data

In the latitude and longitude, we have the data with coordinates and hence, the E,W, N and S values. But, we cannot feed such data into many models and nor can we perform numerical computations on them. Hence we remove the letters towards the end and split them into different data frames when we want to perform statistics on them.

## checking for NA values
sum(is.na(temp$Latitude))
## filtering the rows with N values
tempN<- temp %>%
dplyr::filter(endsWith(Latitude,'N'))
head(tempN,5)
## removing n form tempN dataframe
tempN <- tempN %>%
mutate(Latitude=gsub("N","",Latitude))
head(tempN$Latitude,5)
[1] 49.03 8.84 18.48 23.31 53.84
# changing the datatype of Latitude to numeric
tempN$Latitude <- as.numeric(tempN$Latitude)
sum(is.na(tempN$Latitude))
[1] 0
## for S values
## filtering the rows with S values
tempS<- temp %>%
dplyr::filter(endsWith(Latitude,'S'))
## removing n form tempN dataframe
tempS <- tempS %>%
mutate(Latitude=gsub("S","",Latitude))
head(tempS$Latitude,5)
[1] "4.02" "7.23" "7.23" "21.70" "15.27"
tempS$Latitude <- as.numeric(tempS$Latitude)
sum(is.na(tempS$Latitude))
[1] 0

Now, you have a clean and error-free dataset to start with your statistics and feed them into models.

Hope you learned something new and found the story useful. Connect with me on LinkedIn.

--

--

Joyeeta Dey

-CSE undergraduate, Front End and Back-End Developer, Machine Learning and Deep Learning Enthusiast , Highly interested in Data Structures and Algorithms.