DPLYR AND THE PIPE!
by — Shivam Verma (our Incedge & Company technical head)
Welcome R enthusiast, this article is the follow up to the previous article Introduction to DPLYR.
In this article, we will learn more about the dplyr and the pipe operator.
In the last article, we learn about some functions of dplyr. Now, we will learn more about the function of the dplyr. Further, we will also learn about the Pipe operator from the magrittr package is used.
In this article, we will use the National Stock Exchange of India Dataset, which you can download from kaggle.
# Dplyr and pipe
data = read.csv(file.choose(),header = T)
dim(data) #dimensions of the dataset
Here we can see that the dataset has 846404 observation and 13 variables.
names(data) # names of the column
Let’s us see what those variables mean. According to kaggle.com
SYMBOL: Symbol of the listed company.
SERIES: Series of equity. Values are [EQ, BE, BL, BT, GC and IL]
OPEN: The opening market price of the equity symbol on the date.
HIGH: The highest market price of the equity symbol on the date.
LOW: The lowest recorded market price of the equity symbol on the date.
CLOSE: The closing recorded price of the equity symbol on the date.
LAST: The last traded price of the equity symbol on the date.
PREVCLOSE: The previous day closing price of the equity symbol on the date.
TOTTRDQTY: Total traded quantity of the equity symbol on the date.
TOTTRDVAL: Total traded volume of the equity symbol on the date.
TIMESTAMP: Date of record.
TOTALTRADES: Total trades executed on the day.
ISIN: International Securities Identification Number.
It is used in order to obtain the summary of the dataset. There is some summary function which can be used in order to obtain different results. The syntax of the summary function is:
summary( dataset, summary_function)
The summary function can be Max, Min, First, nth, n_distinct, mean, median, var, sd. We will see how to use these functions.
summarise(data, average = mean(data$OPEN))
summarise(data, middle_value = median(data$OPEN))
summarise(data, first_val = first(data$OPEN))
summarise(data, last_val = last(data$OPEN))
summarise(data, variance = var(data$OPEN))
summarise(data, std_dev = sd(data$OPEN))
summarise(data, n_distinct = n_distinct(data$OPEN))
summarise(data, nth_value = nth(data$OPEN,5))
Let’s get started with the pipe operator.
The pipe operator passes the result after executing the first argument into the next argument. By using the pipe operator, we can execute complex queries easily. The pipe operator is denoted by %>%. The pipe operator helps us to combine multiple functions inside the same statement.
Pipe with the Select function.
# Pipe operator
data %>% select(OPEN,HIGH, CLOSE)%>% sample_n(5) # using select function
Pipe with the Filter function.
data %>% filter(OPEN >50) %>% sample_n(5) # using filter function
Pipe with the Mutate function.
data %>% mutate(Final = OPEN + HIGH + LOW + CLOSE) %>% select(Final, everything()) %>% sample_n(5) # using mutate function
Pipe with the group by function.
data %>% group_by(data$SYMBOL)%>% filter(HIGH>100000 & CLOSE >100000)
Pipe with Summarise function.
Summarise_all: Summarise all variables
This function needs one argument to operate which it will apply to the columns in the dataset. We can use the predefined function such as mean or create our custom function using the funs() or tilde.
# summarise all
data %>% summarise_all(mean, na.rm=TRUE)
Summarise_if: It helps to get a summary as a result based on a condition. It requires two arguments:
- The information about the Columns which will be provided with the help of the function that returns a boolean value (TRUE or FALSE).
- Next, you need to provide the context on how to summarise the data, which also need to be a function.
data %>% summarise_if(is.numeric, mean, na.rm = T)
Summarise_at: It also requires two arguments to operate.
- Columns to be considered.
- information about how to summarise that data.
data %>% summarise_at(vars(OPEN:TOTTRDVAL), mean ,na.rm = TRUE)
The difference between these three functions is that the first summarise_all the variables in the dataset, summarise_if only summarises the variables which pass through the function, and Summarise_at summarises only the variables mentioned in the vars.
Pipe with Join function.
Join function is used to join the two datasets. There are diiferent kinds of joins that can be execeuted. They all have the same syntax to follow:
inner_join(Dataset1, Dataset2, by = variable(primary key))
left_join(Dataset1, Dataset2, by = variable(primary key))
right_join(Dataset1, Dataset2, by = variable(primary key))
full_join(Dataset1, Dataset2, by = variable(primary key))
semi_join(Dataset1, Dataset2, by = variable(primary key))
anti_join(Dataset1, Dataset2, by = variable(primary key))
data_1 = data.frame(ID = c(101, 102, 103, 104, 105),
NAME = c('TOM', 'CHRIS', 'WALTER',
WIFE = c('SKYLER', 'MAGGIE', 'JULIE',
CAR = c(2,2,1,2,2),
data_2 = data.frame(ID = c(101,103, 106,107, 108),
NAME = c('TOM', 'WALTER','BRYCE', 'HARRY','ROM'),
WIFE = c('SKYLER', 'JULIE','RACHAEL',
CAR = c(2,1,1,1,1),
# LEFT JOIN
left_join(data_1, data_2, by = "ID")
# RIGHT JOIN
right_join(data_1, data_2, by = "ID")
inner_join(data_1, data_2, by = "ID")
# FULL JOIN
full_join(data_1, data_2, by = "ID")
# SEMI JOIN
semi_join(data_1, data_2, by = "ID")
# ANTI JOIN
anti_join(data_1, data_2, by = "ID")
UNION AND INTERSECTION
Well according to the Wikipedia, the union (denoted by ∪) of a collection of sets is the set of all elements in the collection. or in simple words, union will contain all the elements in the different datasets
The intersection of two sets A and B is the set that contains all elements of A that also belong to B (or equivalently, all elements of B that also belong to A), but no other elements.
# UNION ALL FUNCTION
Well, in this case, due to the difference in the value of the random variable we have no value in the intersection.
That marks the completion of this article. You can find the code here. In the next article, we will see how to use the ggpolt2 package and data cleaning in R.