Day 9 : Dplyr package — Manipulate data in R like you do with sql

4 min readDec 31, 2017

This article is the ninth one in the series, Getting started with data science in 30 days using R programming. You can read all other articles here.

The reason ,behind the immense popularity of sql among the software community , is invariably the best and easy data manipulation techniques which it provides with commands such as select(), groupby().

Did you ever try to do similar tasks with your data sets in R like selecting only a few rows or grouping them? If you have done that, you must be knowing about ‘dplyr’

‘Dplyr’ package, developed and maintained by Hadley Wickham, provides some of the best data manipulation methodologies which makes our data analysis very simple. Also, it has essential backend with itself. Hence , needs little involvement from computer. In the view of above , it is needed that every data scientist know how to use it. All it takes 5 minutes to know about it. So let’s dive in.

Before that, let’s get the package using install.packages() command and load it into the workspace. Also, get a sample data set and load into a data frame. I am using pima dataset in some examples.

>install.packages("dplyr") # downloading package into system
> library("dplyr")# loading it into workspace

Dplyr provides following commands

filter
select
arrange
mutate
summarise
group_by
sample_n and sample_fac

Filter()

filter() helps in selecting required rows from the data set. It is like “select” command in sql. The first argument is the name of the data frame. Here in the example, it is ‘pima_dataframe1’ .The subsequent arguments are conditions.

> head(filter(pima_dataframe1, pima_dataframe1$X148> 165))
  X6 X148 X72 X35  X0 X33.6 X0.627 X50 X1
1  8  183  64   0   0  23.3  0.672  32  1
2  2  197  70  45 543  30.5  0.158  53  1
3 10  168  74   0   0  38.0  0.537  34  1
4  1  189  60  23 846  30.1  0.398  59  1
5  5  166  72  19 175  25.8  0.587  51  1
6  7  196  90   0   0  39.8  0.451  41  1

select()

select() helps in selecting only required columns from a data set.

> head(select(pima_dataframe1,X6))
  X6
1  1
2  8
3  1
4  0
5  5
6  3

arrange()

Arrange() helps in arranging the rows in accordance with sorting order of the given column.

> head(arrange(pima_dataframe1,pima_dataframe1$X148))
 X6 X148 X72 X35 X0 X33.6 X0.627 X50 X1
1 1 0 48 20 0 24.7 0.140 22 0
2 1 0 74 20 23 27.7 0.299 21 0
3 1 0 68 35 0 32.0 0.389 22 0
4 5 0 80 32 0 41.0 0.346 37 1
5 6 0 68 41 0 39.0 0.727 41 1
6 5 44 62 0 0 25.0 0.587 36 0

We can arrange the rows in descending order also. Then the rows get sorted with respect to descending order of the mentioned column. Here it is ‘X148’ .

>View(arrange(pima_dataframe1, desc(pima_dataframe1$ X148)))

I am adding head() to give better visualisation of the data.

mutate()

Mutate() function enables us to add columns to the existing data frame. This can be extremely useful when we need to add some mathematical results like total sum of marks of students.

I added new column named ‘X123’ with values X6*2

> pima_dataframe2<-mutate(dataframe, X123= X6*2)
> head(pima_dataframe2)
  X6 X148 X72 X35  X0 X33.6 X0.627 X50 X1 X123
1  1   85  66  29   0  26.6  0.351  31  0    2
2  8  183  64   0   0  23.3  0.672  32  1   16
3  1   89  66  23  94  28.1  0.167  21  0    2
4  0  137  40  35 168  43.1  2.288  33  1    0
5  5  116  74   0   0  25.6  0.201  30  0   10
6  3   78  50  32  88  31.0  0.248  26  1    6

summarise()

summarise function enables us to get a single row summarised result about the data set. summarise() is more useful when we use group_by() command.

group_by()

> dest <- group_by(pima_dataframe1,X6)
> summarise(dest,X148=n_distinct(X148),X1=n())
# A tibble: 17 x 3
      X6  X148    X1
   <int> <int> <int>
 1     0    64   111
 2     1    76   135
 3     2    60   103
 4     3    55    75
 5     4    51    68
 6     5    46    57
 7     6    39    49
 8     7    36    45
 9     8    34    38
10     9    24    28
11    10    19    24
12    11    10    11
13    12     8     9
14    13     9    10
15    14     2     2
16    15     1     1
17    17     1     1

sample_n and sample_frac:

These are used for getting samples of data. sample_n is used for fixed number and sample_frac for fractional number.

> head(sample_n(pima_dataframe1,3))
    X6 X148 X72 X35  X0 X33.6 X0.627 X50 X1
137  0   93  60  25  92  28.7  0.532  22  0
210  2   81  60  22   0  27.7  0.290  25  0
763 10  101  76  48 180  32.9  0.171  63  0

sample_frac syntax is given below

>sample_frac(pima_datafram1,0.1)

%>% chaining operator:

This %>% operator allows us to do different operations all at once. Let’s see how we can do that.

> flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

We have done grouping first and then select() operation on the resultant data frame obtained from the first step. On the resultant data frame, summarise() and filter operations have been done subsequently.

These are the important functionalities which ‘dplyr’ package provides.

Hope you liked this article!! Let me know what you felt in the response section!!

Day 9 : Dplyr package — Manipulate data in R like you do with sql

Happy New Year!!

Written by SaiGayatri Vadali