Getting Started with Data Manipulation in R with dplyr

Let’s face it! Most of our time and effort in the journey from data to insights is spent in data manipulation and clean-up. If you’re using R as a part of your data analytics workflow, then the dplyr package is a life saver. dplyr was created to enable efficient manipulation of data with the advantages of speed and simplicity of coding.
With the ‘verb’ functions and chaining (pipe operator) it’s easier to perform complex data manipulation steps. In this article, we’ll look at the main functions within dplyr and their usage.
Before we dive into the details, let’s look at a quick example that highlights the capabilities of dplyr. For illustration, we’re using ‘hflights’ dataset that includes data on all flights that departed Houston, TX in 2011.
Let’s say that you want to perform the following operations on the data —
- Step 1: Filter for flights originating from IAH airport
- Step 2: Count total flights and delayed flights by each carrier
- Step 3: Convert it to a Delayed per thousand (DPH) metric
- Step 4: Sort the result by DPH in descending order
dresult<-d%>%
filter(Origin == "IAH")%>%
mutate(FDelayed = if_else(DepDelay>0,TRUE,FALSE,missing=NULL))%>%
group_by(UniqueCarrier)%>%
summarise(No=n(),NumDelayed=sum(FDelayed,na.rm=TRUE))%>%
mutate(DPH=100*(NumDelayed/No))%>%
arrange(desc(DPH))As you can see, these 5 lines of code quickly got us the answers we were looking for.
Piping
The concept of “piping” is a game changer!
“Piping” commands together creates R code that is concise, which makes it easier to write and easier to understand.
How does it work? The result set from each “transformation” (filter, mutate, select, etc) will get passed to the next piped function below it; avoiding the need to save the data from each step to a new or existing data frame.
Now lets take a step back and look at each of the main functions.
slice function - will handle sampling or filtering rows based on the “row position”. It will perform similar functionality as the built-in head() function, but allowing much more flexibility because you can specify a range of any rows in your dataset.
Quicktip : Use negative numbers to “exclude” rows. Example: To exclude the first 10 rows, and keep the remaining rows in the dataset.
d1 <- slice(hflights,1:10)
slice(hflights, -1, -10)
select function - will refine your column selections. You can also change column names within the select function. Columns can be renamed within select function as well. One can also drop columns by using “-” before the column names.

d2 <-select(d,CarrierName=UniqueCarrier,FlightNum,DepDelay,ArrDelay)filter function is similar to SQL “where” clause. It returns a subset of rows that match given filter criterion. Logical and Boolean operators including (<, >, !=,==, |, &, is.na(), !is.na(), %in%) can be used in the filter statement.
d3 <- filter(d,Origin == "IAH" & DepDelay>0 & UniqueCarrier=="UA")
mutate function creates a new variable in your data frame based on existing variables. Unlike base R, if you’re creating multiple variables within the same mutate(), any previously created variable can reference the variable above it.
Quicktip : mutate() adds new variables and preserves the existing ones, while transmute() adds new ones and drops the existing ones

d4 <- mutate(d, totalTaxing = TaxiIn+TaxiOut)group_by and summarize functions — The group_by() function is used to group rows that have similar value for a column. This is often used with an aggregate function such as sum(), average(), median(), etc. Summarize() function condenses multiple values into a single one.
Frequently used summary functions include the following —

d6 <- d%>%
group_by(UniqueCarrier)%>%
summarise(TotalFlights=n())
arrange function will sort rows in the data frame based on a column value. This is similar to “order by” in SQL.
d5 <- arrange(d4,desc(ArrDelay))At first a bit overwhelming, but with just a little practice, you will soon master the most useful components of dplyr. You may find that every R script/notebook you write will be better with dplyr. Your R data processing will be more concise, understandable, and development time will be cut down dramatically.
So, the next time you want to perform data manipulation in R, dplyr is the way to go!

