Dplyr is the new manipulator

Shivam Verma
Incedge & Co.
Published in
8 min readJan 30, 2019

Welcome, R enthusiasts. Today, we will learn data manipulation using R language. So, let’s get started.

According to the R documentation, Dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).

It has three main goals:

  • Identify the most important data manipulation verbs and make them easy to use from R.
  • Provide blazing fast performance for in-memory data by writing key pieces in C++ (using Rcpp)
  • Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Dplyr is also known as the Grammer of data manipulation. It is a powerful data manipulation package that is used extensively in the Data Science and statistics Community. Dplyr contains a set of verbs( or function). These verbs are used in order to transform the raw data into meaningful insights. These verbs are:

  1. Select
  2. Mutate
  3. Filter
  4. Summarise
  5. Join
  6. Group by

In this post, will learn how to use dplyr and its function and transform our data in order to gain meaningful insights.

First, we will load the dplyr package.

We will use the Google play store dataset in this tutorial. This dataset contains the details of the application of the google play store. You can download the dataset from here.

First, we will analyse the structure of the dataset.

In [1]:

# INTRODUCTION TO DPLYR

In [2]:

install.packages("dplyr")
library(dplyr)

In [3]:

# read the dataset using the read.csv function

In [4]:

data=read.csv(file.choose(), header = T)

In [5]:

# check the structure of the dataset using the str function

In [6]:

str(data)

Out[6]:

‘data.frame’: 10841 obs. of 13 variables:

 $ App           : Factor w/ 9660 levels "- Free Comics - Comic 
$ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..:
$ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 .
$ Reviews : Factor w/ 6002 levels "0","1","10","100",..:
$ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30
$ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",
$ Type : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2
$ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92
$ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3
$ Genres : Factor w/ 120 levels "Action","Action;Action &
$ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016".
$ Current.Ver : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..:
$ Android.Ver : Factor w/ 35 levels "","1.0 and up",..: 17 17

It has 13 columns(or variables) and 10841 rows (or observation).

This dataset is based on the applications in the Google Play Store. Each column in the dataset represents the significant attribute of the application. The column names are explained below:

App: Name of the application.

Category: Category in which it is defined ( Arts, Business, Books, Games, Medical, Social etc).

Rating: Rating received by the application (1 to 5).

Reviews: customer reviews of the application.

Size: Total size of the application.

Installs: Total number of installation of the application on devices.

Type: Paid or free app.

Price: Price in $.

Content: Rating App target audience (teenagers, all etc).

Genres: It is the category of the application.

Last: Updated App was last updated on “date”.

Current.Ver: The current version of the app available for download.

Android.Ver: Android version for which the app is defined.

Using the head function we can have a glimpse of the dataset.

In [7]:

head(data) # top 5 observation of the dataset

Out[7]:

The Select() Function.

The select verb is used to select the data which matches the condition from the original dataset.

The syntax of select() function: select(dataset , Condition).

In [8]:

# select the App column from the dataset

In [9]:

app_name = select(data,App)

In [10]:

head(app_name)

Out[10]:

In [11]:

# check the dimension of the output using the dimension function

In [12]:

dim(app_name)

Out[12]:

1. 10481
2. 1

In [13]:

# Display the data by removing the columns from rating to size

In [14]:

select(data,-(Rating:Size))

Out[14]:

You can view the full result here.

There are some built-in functions which helps us to select the data in the select function.

Function Description

starts_with() : Starts with a prefix.

ends_with() : Ends with a prefix.

contains() : Contains a literal string.

matches() : Matches a regular expression.

num_range() : Numerical range like x01, x02, x03.

everything() : All the variables.

Select the variable in which the variable name (column name starts with “A”).

In [16]:

Name_Start_A = select(data, starts_with("A")) 
Name_Start_A

Out[16]:

You can view the full results here.

Select the data in which the variable name (column name ends with “s”)

In [18]:

Name_End_s = select(data, ends_with("s")) 
head(Name_End_s)

Out[18]:

You can view the full results here.

Select the variable which contains the name “app”.

In [19]:

# select the variable where the column name that contains with "app"

In [20]:

Name_App = select(data, contains("app")) 
Name_App
You can view the full results here.

Select the data in which the name matches “Rating”.

In [21]:

# select the data where the column name that matches with "rating"

In [22]:

Name_App = select(data, matches("rating")) # column name that matches the keyword rating
Name_App

Out[22]:

Select the data range from the 5 to 10.

In [23]:

# select the data where the column range from column 5 to column 10

In [24]:

Name_range = select(data, 5:10)
Name_range

Out[24]:

Select the data where the rating comes first then the rest of the data.

In [25]:

# Display the data where the first column is RATING and then the rest of the columns.

In [26]:

Name_everything = select(data, Rating, everything() ) # column name that contains the keyword app
Name_everything

Out[26]:

Some other functions regarding the selection.

  1. Distinct

The distinct function is used to remove the duplicate from the dataset.

In [27]:

# select the distinct data from the dataset

In [28]:

x_distinct = distinct(data)
x_distinct

Out[28]:

In [29]:

dim(x_distinct)

Out[29]:

1. 10358
2. 13

The output shows 10358 observation and 13 variables. It has been reduced from 10841 observation. There are many observation that contains the duplicates.

The .keep_all function in the distinct argument is used to retain all other variables in the output data.frame.

In [30]:

# distinct with .keep_all argument

In [31]:

x2 = distinct(data, Index, .keep_all= TRUE)
x2

Out[31]:

2. Sample_frac

Sample_frac(data, x%)

It returns the x% of the data.

In [32]:

#sample_frac

In [33]:

sample_frac(data,0.8)

Out[33]:

Filter Function

The filter is used to filter out the dataset which matches the condition

We can filter rows and retain only those values in which rating is equal to 4.0.

In [33]:

# filter function

In [34]:

rating_4 = filter(data, Rating == 4)
rating_4

Out[34]:

In [35]:

dim(rating_4)

Out [35]:

1. 568
2. 13

Or we can filter out data which contain only the category as the game.

In [36]:

Category_game = filter(data, Category == "GAME")
Category_game

In [37]:

dim(Category_game)

Out [37]:

1. 1144
2. 13

We can also use multiple conditions in a single statement.

In [38]:

and_filter = filter(data, Category == "GAME" & Rating == 4)
and_filter

Out[38]:

In [39]:

dim(and_filter)

Out [39]:

1. 55
2. 13

In [40]:

or_filter = filter(data, Category == "GAME" | Rating == 4)
or_filter

In [41]:

dim(or_filter)

Out [41]:

1. 1657
2. 13

The %in% operator

It can be used to select specific items in a column.

In [42]:

In_function = filter(data, Category %in% c("BEAUTY", "BUSINESS"))
In_function

Pattern Matching Using The filter function

GREPL function is used for searching the keyword in the dataset for pattern matching.

In [43]:

Grepl_funs = filter(data, grepl("one", App))
Grepl_funs

Out[43]:

In [44]:

dim(Grepl_funs)

Out [44]:

1. 163
2. 13

Sort Data

In order to sort the data, we use the arrange() function. The default condition is to sort in ascending order. We can specify to sort in descending order. In this example, we are sorting data by multiple variables. It’s syntax is:

arrange(dataset, variable name)

In [45]:

# sort the data

In [46]:

#Ascending order sorting

In [47]:

arrange(data, Rating, App) # here we sort using multiple column names

Out[47]:

Suppose you need to sort by descending order.

In [48]:

arrange(data,desc(Rating))

Mutate() Function :

It is used in order to create new columns or variables with custom values. Its syntax is:

mutate(dataset , conditions)

New rating system

We will create a new rating system where the rating is given between 0 to 0.5.

For this, we will use the mutate function.

In [49]:

# Mutate function

In [50]:

New_data = mutate(data, Rating_new = Rating/10)
# display the new rating and then the rest of the table
select(New_data, Rating_new, everything())

Out [50]:

group_by() function :

It is used to group the dataset by the column. Its syntax is:

group_by(data, variables)

In [51]:

# Group by function

In [52]:

#group by price

In [53]:

group_Price = group_by(data, Price = 0) 
group_Price

Out[53]:

In [54]:

#group by category

In [55]:

group_design = group_by(data, Category = "ART_AND_DESIGN")
group_design

Out [55]:

Thank you for sticking till the end. This is the end of this article but not about the dplyr. In the Dplyr 2.0, you will learn about the Pipe operator and summarise and join function. Stay tuned to Incedge.com.

Source code can be found on Github. I am happy to hear any questions or feedback.

Hope you like this article!! Don’t forget to like this article and share with others.

Thank You

--

--