Day 9 : Dplyr package — Manipulate data in R like you do with sql

SaiGayatri Vadali
4 min readDec 31, 2017

--

This article is the ninth one in the series, Getting started with data science in 30 days using R programming. You can read all other articles here.

The reason ,behind the immense popularity of sql among the software community , is invariably the best and easy data manipulation techniques which it provides with commands such as select(), groupby().

Did you ever try to do similar tasks with your data sets in R like selecting only a few rows or grouping them? If you have done that, you must be knowing about ‘dplyr’

‘Dplyr’ package, developed and maintained by Hadley Wickham, provides some of the best data manipulation methodologies which makes our data analysis very simple. Also, it has essential backend with itself. Hence , needs little involvement from computer. In the view of above , it is needed that every data scientist know how to use it. All it takes 5 minutes to know about it. So let’s dive in.

Before that, let’s get the package using install.packages() command and load it into the workspace. Also, get a sample data set and load into a data frame. I am using pima dataset in some examples.

>install.packages("dplyr") # downloading package into system
> library("dplyr")# loading it into workspace

Dplyr provides following commands

  1. filter
  2. select
  3. arrange
  4. mutate
  5. summarise
  6. group_by
  7. sample_n and sample_fac

Filter()

filter() helps in selecting required rows from the data set. It is like “select” command in sql. The first argument is the name of the data frame. Here in the example, it is ‘pima_dataframe1’ .The subsequent arguments are conditions.

> head(filter(pima_dataframe1, pima_dataframe1$X148> 165))
X6 X148 X72 X35 X0 X33.6 X0.627 X50 X1
1 8 183 64 0 0 23.3 0.672 32 1
2 2 197 70 45 543 30.5 0.158 53 1
3 10 168 74 0 0 38.0 0.537 34 1
4 1 189 60 23 846 30.1 0.398 59 1
5 5 166 72 19 175 25.8 0.587 51 1
6 7 196 90 0 0 39.8 0.451 41 1

select()

select() helps in selecting only required columns from a data set.

> head(select(pima_dataframe1,X6))
X6
1 1
2 8
3 1
4 0
5 5
6 3

arrange()

Arrange() helps in arranging the rows in accordance with sorting order of the given column.

> head(arrange(pima_dataframe1,pima_dataframe1$X148))
X6 X148 X72 X35 X0 X33.6 X0.627 X50 X1
1 1 0 48 20 0 24.7 0.140 22 0
2 1 0 74 20 23 27.7 0.299 21 0
3 1 0 68 35 0 32.0 0.389 22 0
4 5 0 80 32 0 41.0 0.346 37 1
5 6 0 68 41 0 39.0 0.727 41 1
6 5 44 62 0 0 25.0 0.587 36 0

We can arrange the rows in descending order also. Then the rows get sorted with respect to descending order of the mentioned column. Here it is ‘X148’ .

>View(arrange(pima_dataframe1, desc(pima_dataframe1$ X148)))

I am adding head() to give better visualisation of the data.

mutate()

Mutate() function enables us to add columns to the existing data frame. This can be extremely useful when we need to add some mathematical results like total sum of marks of students.

I added new column named ‘X123’ with values X6*2

> pima_dataframe2<-mutate(dataframe, X123= X6*2)
> head(pima_dataframe2)
X6 X148 X72 X35 X0 X33.6 X0.627 X50 X1 X123
1 1 85 66 29 0 26.6 0.351 31 0 2
2 8 183 64 0 0 23.3 0.672 32 1 16
3 1 89 66 23 94 28.1 0.167 21 0 2
4 0 137 40 35 168 43.1 2.288 33 1 0
5 5 116 74 0 0 25.6 0.201 30 0 10
6 3 78 50 32 88 31.0 0.248 26 1 6

summarise()

summarise function enables us to get a single row summarised result about the data set. summarise() is more useful when we use group_by() command.

group_by()

> dest <- group_by(pima_dataframe1,X6)
> summarise(dest,X148=n_distinct(X148),X1=n())
# A tibble: 17 x 3
X6 X148 X1
<int> <int> <int>
1 0 64 111
2 1 76 135
3 2 60 103
4 3 55 75
5 4 51 68
6 5 46 57
7 6 39 49
8 7 36 45
9 8 34 38
10 9 24 28
11 10 19 24
12 11 10 11
13 12 8 9
14 13 9 10
15 14 2 2
16 15 1 1
17 17 1 1

sample_n and sample_frac:

These are used for getting samples of data. sample_n is used for fixed number and sample_frac for fractional number.

> head(sample_n(pima_dataframe1,3))
X6 X148 X72 X35 X0 X33.6 X0.627 X50 X1
137 0 93 60 25 92 28.7 0.532 22 0
210 2 81 60 22 0 27.7 0.290 25 0
763 10 101 76 48 180 32.9 0.171 63 0

sample_frac syntax is given below

>sample_frac(pima_datafram1,0.1)

%>% chaining operator:

This %>% operator allows us to do different operations all at once. Let’s see how we can do that.

> flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)

We have done grouping first and then select() operation on the resultant data frame obtained from the first step. On the resultant data frame, summarise() and filter operations have been done subsequently.

These are the important functionalities which ‘dplyr’ package provides.

Hope you liked this article!! Let me know what you felt in the response section!!

Happy New Year!!

--

--

SaiGayatri Vadali

An inquisitive Machine Learning Engineer, yoga trainer, fitness freak and a passionate writer!