DPLYR AND THE PIPE!

Shivam Verma
Incedge & Co.
Published in
7 min readFeb 9, 2019

Welcome R enthusiast, this article is the follow up to the previous article Introduction to DPLYR.

In this article, we will learn more about the dplyr and the pipe operator.

In the last article, we learn about some functions of dplyr. Now, we will learn more about the function of the dplyr. Further, we will also learn about the Pipe operator from the magrittr package is used.

In this article, we will use the National Stock Exchange of India Dataset, which you can download from kaggle.

In [1]:

# Dplyr and pipe

In [2]:

data = read.csv(file.choose(),header = T)

In [3]:

dim(data) #dimensions of the dataset

Out[3]:

1. 846404
2. 13

Here we can see that the dataset has 846404 observation and 13 variables.

In [4]:

names(data) # names of the column

Out[4]:

'SYMBOL'
'SERIES'
'OPEN'
'HIGH'
'LOW'
'CLOSE'
'LAST'
'PREVCLOSE'
'TOTTRDQTY'
'TOTTRDVAL'
'TIMESTAMP'
'TOTALTRADES'
'ISIN'

Let’s us see what those variables mean. According to kaggle.com

SYMBOL: Symbol of the listed company.
SERIES: Series of equity. Values are [EQ, BE, BL, BT, GC and IL]
OPEN: The opening market price of the equity symbol on the date.
HIGH: The highest market price of the equity symbol on the date.
LOW: The lowest recorded market price of the equity symbol on the date.
CLOSE: The closing recorded price of the equity symbol on the date.
LAST: The last traded price of the equity symbol on the date.
PREVCLOSE: The previous day closing price of the equity symbol on the date.
TOTTRDQTY: Total traded quantity of the equity symbol on the date.
TOTTRDVAL: Total traded volume of the equity symbol on the date.
TIMESTAMP: Date of record.
TOTALTRADES: Total trades executed on the day.
ISIN: International Securities Identification Number.

Summarise Function

It is used in order to obtain the summary of the dataset. There is some summary function which can be used in order to obtain different results. The syntax of the summary function is:

summary( dataset, summary_function)

The summary function can be Max, Min, First, nth, n_distinct, mean, median, var, sd. We will see how to use these functions.

In [5]:

library(dplyr)

In [6]:

summarise(data, average = mean(data$OPEN))

Out[6]:

average
561.2588

In [7]:

summarise(data, middle_value = median(data$OPEN))

Out[7]:

middle_value
139.2

In [8]:

summarise(data, first_val = first(data$OPEN))

Out[8]:

first_val
37.8

In [9]:

summarise(data, last_val = last(data$OPEN))

Out[9]:

last_val
4.3

In [10]:

summarise(data, variance = var(data$OPEN))

Out[11]:

variance
4023076

In [11]:

summarise(data, std_dev = sd(data$OPEN))

Out[11]:

std_dev
2005.761

In [12]:

summarise(data, n_distinct = n_distinct(data$OPEN))

Out[12]:

n_distinct
58241

In [13]:

summarise(data, nth_value = nth(data$OPEN,5))

Out[13]:

nth_value
546.1

Let’s get started with the pipe operator.

The pipe operator passes the result after executing the first argument into the next argument. By using the pipe operator, we can execute complex queries easily. The pipe operator is denoted by %>%. The pipe operator helps us to combine multiple functions inside the same statement.

Pipe with the Select function.

In [14]:

# Pipe operator

In [15]:

data %>% select(OPEN,HIGH, CLOSE)%>% sample_n(5) # using select function

Out[15]:

Pipe with the Filter function.

In [16]:

data %>% filter(OPEN >50) %>% sample_n(5) # using filter function

Out[16]:

Pipe with the Mutate function.

In [17]:

data %>% mutate(Final = OPEN + HIGH + LOW + CLOSE) %>% select(Final, everything()) %>% sample_n(5) # using mutate function

Out[17]:

Pipe with the group by function.

In [18]:

data %>% group_by(data$SYMBOL)%>% filter(HIGH>100000 & CLOSE >100000)

Out[18]:

Pipe with Summarise function.

Summarise_all: Summarise all variables

This function needs one argument to operate which it will apply to the columns in the dataset. We can use the predefined function such as mean or create our custom function using the funs() or tilde.

In [19]:

# summarise all

In [20]:

data %>% summarise_all(mean, na.rm=TRUE)

Out[20]:

Summarise_if: It helps to get a summary as a result based on a condition. It requires two arguments:

  1. The information about the Columns which will be provided with the help of the function that returns a boolean value (TRUE or FALSE).
  2. Next, you need to provide the context on how to summarise the data, which also need to be a function.

In [21]:

data %>% summarise_if(is.numeric, mean, na.rm = T)

Out[21]:

Summarise_at: It also requires two arguments to operate.

  1. Columns to be considered.
  2. information about how to summarise that data.

In [22]:

data %>% summarise_at(vars(OPEN:TOTTRDVAL), mean ,na.rm = TRUE)

Out[22]:

The difference between these three functions is that the first summarise_all the variables in the dataset, summarise_if only summarises the variables which pass through the function, and Summarise_at summarises only the variables mentioned in the vars.

Pipe with Join function.

Join Function

Join function is used to join the two datasets. There are diiferent kinds of joins that can be execeuted. They all have the same syntax to follow:

inner_join(Dataset1, Dataset2, by = variable(primary key))
left_join(Dataset1, Dataset2, by = variable(primary key))
right_join(Dataset1, Dataset2, by = variable(primary key))
full_join(Dataset1, Dataset2, by = variable(primary key))
semi_join(Dataset1, Dataset2, by = variable(primary key))
anti_join(Dataset1, Dataset2, by = variable(primary key))

In [23]:

# JOINS

In [24]:

data_1 = data.frame(ID = c(101, 102, 103, 104, 105),
NAME = c('TOM', 'CHRIS', 'WALTER',
'JESSE', 'FREDDY'),
WIFE = c('SKYLER', 'MAGGIE', 'JULIE',
'SCOTTIE', 'MEGHAN'),
CHILDREN =c(5,2,7,1,3),
CAR = c(2,2,1,2,2),
RANDOM =rnorm(5))

In [25]:

data_1

Out[25]:

In [26]:

data_2 = data.frame(ID = c(101,103, 106,107, 108),
NAME = c('TOM', 'WALTER','BRYCE', 'HARRY','ROM'),
WIFE = c('SKYLER', 'JULIE','RACHAEL',
'OLIVIA', 'SOPHIA'),
CHILDREN =c(5,3,7,5,6),
CAR = c(2,1,1,1,1),
RANDOM =rnorm(5))

In [27]:

data_2

Out[27]:

In [28]:

# LEFT JOIN

In [29]:

left_join(data_1, data_2, by = "ID")

Out[29]:

In [30]:

# RIGHT JOIN

In [31]:

right_join(data_1, data_2, by = "ID")

Out[31]:

In [32]:

#INNER JOIN

In [33]:

inner_join(data_1, data_2, by = "ID")

Out[33]:

In [34]:

# FULL JOIN

In [35]:

full_join(data_1, data_2, by = "ID")

Out[35]:

In [36]:

# SEMI JOIN

In [37]:

semi_join(data_1, data_2, by = "ID")

Out[37]:

In [38]:

# ANTI JOIN

In [39]:

anti_join(data_1, data_2, by = "ID")

Out[39]:

UNION AND INTERSECTION

Well according to the Wikipedia, the union (denoted by ∪) of a collection of sets is the set of all elements in the collection. or in simple words, union will contain all the elements in the different datasets

The intersection of two sets A and B is the set that contains all elements of A that also belong to B (or equivalently, all elements of B that also belong to A), but no other elements.

In [40]:

# union

In [41]:

union(data_1, data_2)

Out[41]:

In [42]:

# UNION ALL FUNCTION

In [43]:

union_all(data_1, data_2)

Out[43]:

In [44]:

# intersection

In [45]:

intersect(data_1, data_2)

Out[45]:

Well, in this case, due to the difference in the value of the random variable we have no value in the intersection.

That marks the completion of this article. You can find the code here. In the next article, we will see how to use the ggpolt2 package and data cleaning in R.

Subscribe to our website for more interesting articles.

--

--