Dplyr is the new manipulator
Welcome, R enthusiasts. Today, we will learn data manipulation using R language. So, let’s get started.
According to the R documentation, Dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).
It has three main goals:
- Identify the most important data manipulation verbs and make them easy to use from R.
- Provide blazing fast performance for in-memory data by writing key pieces in C++ (using Rcpp)
- Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.
Dplyr is also known as the Grammer of data manipulation. It is a powerful data manipulation package that is used extensively in the Data Science and statistics Community. Dplyr contains a set of verbs( or function). These verbs are used in order to transform the raw data into meaningful insights. These verbs are:
- Select
- Mutate
- Filter
- Summarise
- Join
- Group by
In this post, will learn how to use dplyr and its function and transform our data in order to gain meaningful insights.
First, we will load the dplyr package.
We will use the Google play store dataset in this tutorial. This dataset contains the details of the application of the google play store. You can download the dataset from here.
First, we will analyse the structure of the dataset.
In [1]:
# INTRODUCTION TO DPLYR
In [2]:
install.packages("dplyr")
library(dplyr)
In [3]:
# read the dataset using the read.csv function
In [4]:
data=read.csv(file.choose(), header = T)
In [5]:
# check the structure of the dataset using the str function
In [6]:
str(data)
Out[6]:
‘data.frame’: 10841 obs. of 13 variables:
$ App : Factor w/ 9660 levels "- Free Comics - Comic
$ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..:
$ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 .
$ Reviews : Factor w/ 6002 levels "0","1","10","100",..:
$ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30
$ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",
$ Type : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2
$ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92
$ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3
$ Genres : Factor w/ 120 levels "Action","Action;Action &
$ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016".
$ Current.Ver : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..:
$ Android.Ver : Factor w/ 35 levels "","1.0 and up",..: 17 17
It has 13 columns(or variables) and 10841 rows (or observation).
This dataset is based on the applications in the Google Play Store. Each column in the dataset represents the significant attribute of the application. The column names are explained below:
App: Name of the application.
Category: Category in which it is defined ( Arts, Business, Books, Games, Medical, Social etc).
Rating: Rating received by the application (1 to 5).
Reviews: customer reviews of the application.
Size: Total size of the application.
Installs: Total number of installation of the application on devices.
Type: Paid or free app.
Price: Price in $.
Content: Rating App target audience (teenagers, all etc).
Genres: It is the category of the application.
Last: Updated App was last updated on “date”.
Current.Ver: The current version of the app available for download.
Android.Ver: Android version for which the app is defined.
Using the head function we can have a glimpse of the dataset.
In [7]:
head(data) # top 5 observation of the dataset
Out[7]:
The Select() Function.
The select verb is used to select the data which matches the condition from the original dataset.
The syntax of select() function: select(dataset , Condition).
In [8]:
# select the App column from the dataset
In [9]:
app_name = select(data,App)
In [10]:
head(app_name)
Out[10]:
In [11]:
# check the dimension of the output using the dimension function
In [12]:
dim(app_name)
Out[12]:
1. 10481
2. 1
In [13]:
# Display the data by removing the columns from rating to size
In [14]:
select(data,-(Rating:Size))
Out[14]:
There are some built-in functions which helps us to select the data in the select function.
Function Description
starts_with() : Starts with a prefix.
ends_with() : Ends with a prefix.
contains() : Contains a literal string.
matches() : Matches a regular expression.
num_range() : Numerical range like x01, x02, x03.
everything() : All the variables.
Select the variable in which the variable name (column name starts with “A”).
In [16]:
Name_Start_A = select(data, starts_with("A"))
Name_Start_A
Out[16]:
Select the data in which the variable name (column name ends with “s”)
In [18]:
Name_End_s = select(data, ends_with("s"))
head(Name_End_s)
Out[18]:
Select the variable which contains the name “app”.
In [19]:
# select the variable where the column name that contains with "app"
In [20]:
Name_App = select(data, contains("app"))
Name_App
Select the data in which the name matches “Rating”.
In [21]:
# select the data where the column name that matches with "rating"
In [22]:
Name_App = select(data, matches("rating")) # column name that matches the keyword rating
Name_App
Out[22]:
Select the data range from the 5 to 10.
In [23]:
# select the data where the column range from column 5 to column 10
In [24]:
Name_range = select(data, 5:10)
Name_range
Out[24]:
Select the data where the rating comes first then the rest of the data.
In [25]:
# Display the data where the first column is RATING and then the rest of the columns.
In [26]:
Name_everything = select(data, Rating, everything() ) # column name that contains the keyword app
Name_everything
Out[26]:
Some other functions regarding the selection.
- Distinct
The distinct function is used to remove the duplicate from the dataset.
In [27]:
# select the distinct data from the dataset
In [28]:
x_distinct = distinct(data)
x_distinct
Out[28]:
In [29]:
dim(x_distinct)
Out[29]:
1. 10358
2. 13
The output shows 10358 observation and 13 variables. It has been reduced from 10841 observation. There are many observation that contains the duplicates.
The .keep_all function in the distinct argument is used to retain all other variables in the output data.frame.
In [30]:
# distinct with .keep_all argument
In [31]:
x2 = distinct(data, Index, .keep_all= TRUE)
x2
Out[31]:
2. Sample_frac
Sample_frac(data, x%)
It returns the x% of the data.
In [32]:
#sample_frac
In [33]:
sample_frac(data,0.8)
Out[33]:
Filter Function
The filter is used to filter out the dataset which matches the condition
We can filter rows and retain only those values in which rating is equal to 4.0.
In [33]:
# filter function
In [34]:
rating_4 = filter(data, Rating == 4)
rating_4
Out[34]:
In [35]:
dim(rating_4)
Out [35]:
1. 568
2. 13
Or we can filter out data which contain only the category as the game.
In [36]:
Category_game = filter(data, Category == "GAME")
Category_game
In [37]:
dim(Category_game)
Out [37]:
1. 1144
2. 13
We can also use multiple conditions in a single statement.
In [38]:
and_filter = filter(data, Category == "GAME" & Rating == 4)
and_filter
Out[38]:
In [39]:
dim(and_filter)
Out [39]:
1. 55
2. 13
In [40]:
or_filter = filter(data, Category == "GAME" | Rating == 4)
or_filter
In [41]:
dim(or_filter)
Out [41]:
1. 1657
2. 13
The %in% operator
It can be used to select specific items in a column.
In [42]:
In_function = filter(data, Category %in% c("BEAUTY", "BUSINESS"))
In_function
Pattern Matching Using The filter function
GREPL function is used for searching the keyword in the dataset for pattern matching.
In [43]:
Grepl_funs = filter(data, grepl("one", App))
Grepl_funs
Out[43]:
In [44]:
dim(Grepl_funs)
Out [44]:
1. 163
2. 13
Sort Data
In order to sort the data, we use the arrange() function. The default condition is to sort in ascending order. We can specify to sort in descending order. In this example, we are sorting data by multiple variables. It’s syntax is:
arrange(dataset, variable name)
In [45]:
# sort the data
In [46]:
#Ascending order sorting
In [47]:
arrange(data, Rating, App) # here we sort using multiple column names
Out[47]:
Suppose you need to sort by descending order.
In [48]:
arrange(data,desc(Rating))
Mutate() Function :
It is used in order to create new columns or variables with custom values. Its syntax is:
mutate(dataset , conditions)
New rating system
We will create a new rating system where the rating is given between 0 to 0.5.
For this, we will use the mutate function.
In [49]:
# Mutate function
In [50]:
New_data = mutate(data, Rating_new = Rating/10)
# display the new rating and then the rest of the table
select(New_data, Rating_new, everything())
Out [50]:
group_by() function :
It is used to group the dataset by the column. Its syntax is:
group_by(data, variables)
In [51]:
# Group by function
In [52]:
#group by price
In [53]:
group_Price = group_by(data, Price = 0)
group_Price
Out[53]:
In [54]:
#group by category
In [55]:
group_design = group_by(data, Category = "ART_AND_DESIGN")
group_design
Out [55]:
Thank you for sticking till the end. This is the end of this article but not about the dplyr. In the Dplyr 2.0, you will learn about the Pipe operator and summarise and join function. Stay tuned to Incedge.com.
Source code can be found on Github. I am happy to hear any questions or feedback.
Hope you like this article!! Don’t forget to like this article and share with others.
Thank You