How to Manipulate Your Data in R using Dplyr Package
There are two obvious truths in life:
First, cats know something that we don’t know.
And the second, no data is perfect.
Without further arguing the first one, let’s jump straight to the second one. As a data professional, or a candidate-data professional, almost every single day you will encounter a dataset that which you have to work on to carve something meaningful out of it. But before working on it, you have to mold it into something workable (I have legitimate doubts that such a word called “workable” even exists, but anyways).
This process is called data manipulation. It is simply a way of interacting with your data to reach an end like producing a predictive model, analyzing a boring sales data, or whatever it is. There are various tools to conduct data manipulation. You could pick Excel, you could pick Google’s spreadsheet products, you could pick Python etc. But in this article, you and I will go with R.
Assumption: Before diving into the manipulation part you already made a gentle introduction with your data. You imported the dataset into R environment, you completed a decent EDA process, you even pet your neighbor's cute Husky by tapping his head, chanting “good boii”.
So I’ll skip that part.
Alright. Let’s get things done. I’ll use “dplyr” package to utilize its lovely functions to manipulate Starwars dataset. Starwars comes with dplyr so you can simply call it by writing its name. By the way, for those who don’t have a idea about dplyr, first you have to install it.
install.packages("dplyr")
And then deploy it.
library(dplyr)
Call the dataset and view it.
starwarsView(starwars)
Lovely, isn’t it? Let’s utilize our package. The mighty dplyr.
We’re gonna use the following verbs to play with our data:
select()filter()mutate()arrange()na.omit()rename()sample_n()
Let’s review them one by one. But first, let me introduce our magical operator to you: the pipe operator. It looks like %>% this and has some unknown dark powers. However, for now you just think of it as a part of dplyr grammar and voice it as “and then” in your head. Just know that you must use it at the end of each line (except the last one) and you’re gonna be okay. Trust me, I’m a data scientist.
starwars_2 <- starwars %>%
select(where(is.numeric))
This code returns you the only numeric variables in your dataset. Who needs nominal variables, right?
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit()
na.omit() helps you the get rid of the observations containing NA values. Remember, we hate them.
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit() %>%
arrange(desc(height))
arrange() function helps us to sort the dataset in accordance with the selected argument.
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit() %>%
arrange(desc(height)) %>%
filter(height > 100)
But it’s not enough. We despise characters who is shorter than 100 cm by using filter(). Come and sit with the big boys.
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit() %>%
arrange(desc(height)) %>%
filter(height > 100) %>%
mutate(hm_index = height/mass)
mutate() function adds a brand new variable at the end of the dataset. It simply helps you to select multiple variables in your dataset and create a new one by processing them.
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit() %>%
arrange(desc(height)) %>%
filter(height > 100) %>%
mutate(hm_index = height/mass) %>%
rename(year = birth_year)
rename(). I think it is self-evident.
starwars_2 <- starwars %>%
select(where(is.numeric)) %>%
na.omit() %>%
arrange(desc(height)) %>%
filter(height > 100) %>%
mutate(hm_index = height/mass) %>%
rename(year = birth_year) %>%
sample_n(20)
And sample_n(). It basically picks random observations from your dataset and presents it to you. How kind, isn’t it?
Voila!
Congrats you jus took something out of the world and replaced it with a distorted one!
Joking. You just completed a data manipulation process in R using dplyr package. Run below code to see how it looks like:
View(starwars_2)
Okay. My job here is done mate. Take care. Byes.