DPLYR AND THE PIPE!

Feb 9 · 7 min read

In the last article, we learn about some functions of dplyr. Now, we will learn more about the function of the dplyr. Further, we will also learn about the Pipe operator from the magrittr package is used.

In [1]:

`# Dplyr and pipe`

In [2]:

`data = read.csv(file.choose(),header = T)`

In [3]:

`dim(data) #dimensions of the dataset`

Out[3]:

`1. 8464042. 13`

Here we can see that the dataset has 846404 observation and 13 variables.

In [4]:

`names(data) # names of the column`

Out[4]:

`'SYMBOL''SERIES''OPEN''HIGH''LOW''CLOSE''LAST''PREVCLOSE''TOTTRDQTY''TOTTRDVAL''TIMESTAMP''TOTALTRADES''ISIN'`

Let’s us see what those variables mean. According to kaggle.com

SYMBOL: Symbol of the listed company.
SERIES: Series of equity. Values are [EQ, BE, BL, BT, GC and IL]
OPEN: The opening market price of the equity symbol on the date.
HIGH: The highest market price of the equity symbol on the date.
LOW: The lowest recorded market price of the equity symbol on the date.
CLOSE: The closing recorded price of the equity symbol on the date.
LAST: The last traded price of the equity symbol on the date.
PREVCLOSE: The previous day closing price of the equity symbol on the date.
TOTTRDQTY: Total traded quantity of the equity symbol on the date.
TOTTRDVAL: Total traded volume of the equity symbol on the date.
TIMESTAMP: Date of record.
ISIN: International Securities Identification Number.

Summarise Function

It is used in order to obtain the summary of the dataset. There is some summary function which can be used in order to obtain different results. The syntax of the summary function is:

summary( dataset, summary_function)

The summary function can be Max, Min, First, nth, n_distinct, mean, median, var, sd. We will see how to use these functions.

In [5]:

`library(dplyr)`

In [6]:

`summarise(data, average = mean(data\$OPEN))`

Out[6]:

`average561.2588`

In [7]:

`summarise(data, middle_value = median(data\$OPEN))`

Out[7]:

`middle_value139.2`

In [8]:

`summarise(data, first_val = first(data\$OPEN))`

Out[8]:

`first_val37.8`

In [9]:

`summarise(data, last_val = last(data\$OPEN))`

Out[9]:

`last_val4.3`

In [10]:

`summarise(data, variance = var(data\$OPEN))`

Out[11]:

`variance4023076`

In [11]:

`summarise(data, std_dev = sd(data\$OPEN))`

Out[11]:

`std_dev2005.761`

In [12]:

`summarise(data, n_distinct = n_distinct(data\$OPEN))`

Out[12]:

`n_distinct58241`

In [13]:

`summarise(data, nth_value = nth(data\$OPEN,5))`

Out[13]:

`nth_value546.1`

Let’s get started with the pipe operator.

The pipe operator passes the result after executing the first argument into the next argument. By using the pipe operator, we can execute complex queries easily. The pipe operator is denoted by %>%. The pipe operator helps us to combine multiple functions inside the same statement.

Pipe with the Select function.

In [14]:

`# Pipe operator`

In [15]:

`data %>% select(OPEN,HIGH, CLOSE)%>% sample_n(5) # using select function`

Out[15]:

Pipe with the Filter function.

In [16]:

`data %>% filter(OPEN >50) %>% sample_n(5) # using filter function`

Out[16]:

Pipe with the Mutate function.

In [17]:

`data %>% mutate(Final = OPEN + HIGH + LOW + CLOSE) %>% select(Final, everything()) %>% sample_n(5) # using mutate function`

Out[17]:

Pipe with the group by function.

In [18]:

`data %>% group_by(data\$SYMBOL)%>% filter(HIGH>100000 & CLOSE >100000)`

Out[18]:

Pipe with Summarise function.

Summarise_all: Summarise all variables

This function needs one argument to operate which it will apply to the columns in the dataset. We can use the predefined function such as mean or create our custom function using the funs() or tilde.

In [19]:

`# summarise all`

In [20]:

`data %>% summarise_all(mean, na.rm=TRUE)`

Out[20]:

Summarise_if: It helps to get a summary as a result based on a condition. It requires two arguments:

1. The information about the Columns which will be provided with the help of the function that returns a boolean value (TRUE or FALSE).
2. Next, you need to provide the context on how to summarise the data, which also need to be a function.

In [21]:

`data %>% summarise_if(is.numeric, mean, na.rm = T)`

Out[21]:

Summarise_at: It also requires two arguments to operate.

1. Columns to be considered.
2. information about how to summarise that data.

In [22]:

`data %>% summarise_at(vars(OPEN:TOTTRDVAL), mean ,na.rm = TRUE)`

Out[22]:

The difference between these three functions is that the first summarise_all the variables in the dataset, summarise_if only summarises the variables which pass through the function, and Summarise_at summarises only the variables mentioned in the vars.

Pipe with Join function.

Join Function

Join function is used to join the two datasets. There are diiferent kinds of joins that can be execeuted. They all have the same syntax to follow:

inner_join(Dataset1, Dataset2, by = variable(primary key))
left_join(Dataset1, Dataset2, by = variable(primary key))
right_join(Dataset1, Dataset2, by = variable(primary key))
full_join(Dataset1, Dataset2, by = variable(primary key))
semi_join(Dataset1, Dataset2, by = variable(primary key))
anti_join(Dataset1, Dataset2, by = variable(primary key))

In [23]:

`# JOINS`

In [24]:

`data_1 = data.frame(ID = c(101, 102, 103, 104, 105),                 NAME = c('TOM', 'CHRIS', 'WALTER',                          'JESSE', 'FREDDY'),                 WIFE = c('SKYLER', 'MAGGIE', 'JULIE',                          'SCOTTIE', 'MEGHAN'),                 CHILDREN =c(5,2,7,1,3),                 CAR = c(2,2,1,2,2),                 RANDOM =rnorm(5))`

In [25]:

`data_1`

Out[25]:

In [26]:

`data_2 = data.frame(ID = c(101,103, 106,107, 108),                 NAME = c('TOM', 'WALTER','BRYCE', 'HARRY','ROM'),                 WIFE = c('SKYLER', 'JULIE','RACHAEL',                          'OLIVIA',     'SOPHIA'),                 CHILDREN =c(5,3,7,5,6),                 CAR = c(2,1,1,1,1),                 RANDOM =rnorm(5))`

In [27]:

`data_2`

Out[27]:

In [28]:

`# LEFT JOIN`

In [29]:

`left_join(data_1, data_2, by = "ID")`

Out[29]:

In [30]:

`# RIGHT JOIN`

In [31]:

`right_join(data_1, data_2, by = "ID")`

Out[31]:

In [32]:

`#INNER JOIN`

In [33]:

`inner_join(data_1, data_2, by = "ID")`

Out[33]:

In [34]:

`# FULL JOIN`

In [35]:

`full_join(data_1, data_2, by = "ID")`

Out[35]:

In [36]:

`# SEMI JOIN`

In [37]:

`semi_join(data_1, data_2, by = "ID")`

Out[37]:

In [38]:

`# ANTI JOIN`

In [39]:

`anti_join(data_1, data_2, by = "ID")`

Out[39]:

UNION AND INTERSECTION

Well according to the Wikipedia, the union (denoted by ∪) of a collection of sets is the set of all elements in the collection. or in simple words, union will contain all the elements in the different datasets

The intersection of two sets A and B is the set that contains all elements of A that also belong to B (or equivalently, all elements of B that also belong to A), but no other elements.

In [40]:

`# union`

In [41]:

`union(data_1, data_2)`

Out[41]:

In [42]:

`# UNION ALL FUNCTION`

In [43]:

`union_all(data_1, data_2)`

Out[43]:

In [44]:

`# intersection`

In [45]:

`intersect(data_1, data_2)`

Out[45]:

Well, in this case, due to the difference in the value of the random variable we have no value in the intersection.

That marks the completion of this article. You can find the code here. In the next article, we will see how to use the ggpolt2 package and data cleaning in R.

Written by

Incedge

Incedge offers interactive R, Python, Machine Learning and Deep Learning blogs. Our blogs will help you get ahead in field of Data Science. Practical Exposure. Come along with us to make a world a better place to live.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade