Welcome, R enthusiasts. Today, we will learn data manipulation using R language. So, let’s get started.
According to the R documentation, Dplyr provides a flexible grammar of data manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).
It has three main goals:
- Identify the most important data manipulation verbs and make them easy to use from R.
- Provide blazing fast performance for in-memory data by writing key pieces in C++ (using Rcpp)
- Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.
Dplyr is also known as the Grammer of data manipulation. It is a powerful data manipulation package that is used extensively in the Data Science and statistics Community. Dplyr contains a set of verbs( or function). These verbs are used in order to transform the raw data into meaningful insights. These verbs are:
- Group by
In this post, will learn how to use dplyr and its function and transform our data in order to gain meaningful insights.
First, we will load the dplyr package.
We will use the Google play store dataset in this tutorial. This dataset contains the details of the application of the google play store. You can download the dataset from here.
First, we will analyse the structure of the dataset.
# INTRODUCTION TO DPLYR
# read the dataset using the read.csv function
data=read.csv(file.choose(), header = T)
# check the structure of the dataset using the str function
‘data.frame’: 10841 obs. of 13 variables:
$ App : Factor w/ 9660 levels "- Free Comics - Comic
$ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..:
$ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 .
$ Reviews : Factor w/ 6002 levels "0","1","10","100",..:
$ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30
$ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",
$ Type : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2
$ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92
$ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3
$ Genres : Factor w/ 120 levels "Action","Action;Action &
$ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016".
$ Current.Ver : Factor w/ 2834 levels "","0.0.0.2","0.0.1",..:
$ Android.Ver : Factor w/ 35 levels "","1.0 and up",..: 17 17
It has 13 columns(or variables) and 10841 rows (or observation).
This dataset is based on the applications in the Google Play Store. Each column in the dataset represents the significant attribute of the application. The column names are explained below:
App: Name of the application.
Category: Category in which it is defined ( Arts, Business, Books, Games, Medical, Social etc).
Rating: Rating received by the application (1 to 5).
Reviews: customer reviews of the application.
Size: Total size of the application.
Installs: Total number of installation of the application on devices.
Type: Paid or free app.
Price: Price in $.
Content: Rating App target audience (teenagers, all etc).
Genres: It is the category of the application.
Last: Updated App was last updated on “date”.
Current.Ver: The current version of the app available for download.
Android.Ver: Android version for which the app is defined.
Using the head function we can have a glimpse of the dataset.
head(data) # top 5 observation of the dataset
The Select() Function.
The select verb is used to select the data which matches the condition from the original dataset.
The syntax of select() function: select(dataset , Condition).
# select the App column from the dataset
app_name = select(data,App)
# check the dimension of the output using the dimension function
# Display the data by removing the columns from rating to size
There are some built-in functions which helps us to select the data in the select function.
starts_with() : Starts with a prefix.
ends_with() : Ends with a prefix.
contains() : Contains a literal string.
matches() : Matches a regular expression.
num_range() : Numerical range like x01, x02, x03.
everything() : All the variables.
Select the variable in which the variable name (column name starts with “A”).
Name_Start_A = select(data, starts_with("A"))
Select the data in which the variable name (column name ends with “s”)
Name_End_s = select(data, ends_with("s"))
Select the variable which contains the name “app”.
# select the variable where the column name that contains with "app"
Name_App = select(data, contains("app"))
Select the data in which the name matches “Rating”.
# select the data where the column name that matches with "rating"
Name_App = select(data, matches("rating")) # column name that matches the keyword rating
Select the data range from the 5 to 10.
# select the data where the column range from column 5 to column 10
Name_range = select(data, 5:10)
Select the data where the rating comes first then the rest of the data.
# Display the data where the first column is RATING and then the rest of the columns.
Name_everything = select(data, Rating, everything() ) # column name that contains the keyword app
Some other functions regarding the selection.
The distinct function is used to remove the duplicate from the dataset.
# select the distinct data from the dataset
x_distinct = distinct(data)
The output shows 10358 observation and 13 variables. It has been reduced from 10841 observation. There are many observation that contains the duplicates.
The .keep_all function in the distinct argument is used to retain all other variables in the output data.frame.
# distinct with .keep_all argument
x2 = distinct(data, Index, .keep_all= TRUE)
It returns the x% of the data.
The filter is used to filter out the dataset which matches the condition
We can filter rows and retain only those values in which rating is equal to 4.0.
# filter function
rating_4 = filter(data, Rating == 4)
Or we can filter out data which contain only the category as the game.
Category_game = filter(data, Category == "GAME")
We can also use multiple conditions in a single statement.
and_filter = filter(data, Category == "GAME" & Rating == 4)
or_filter = filter(data, Category == "GAME" | Rating == 4)
The %in% operator
It can be used to select specific items in a column.
In_function = filter(data, Category %in% c("BEAUTY", "BUSINESS"))
Pattern Matching Using The filter function
GREPL function is used for searching the keyword in the dataset for pattern matching.
Grepl_funs = filter(data, grepl("one", App))
In order to sort the data, we use the arrange() function. The default condition is to sort in ascending order. We can specify to sort in descending order. In this example, we are sorting data by multiple variables. It’s syntax is:
arrange(dataset, variable name)
# sort the data
#Ascending order sorting
arrange(data, Rating, App) # here we sort using multiple column names
Suppose you need to sort by descending order.
Mutate() Function :
It is used in order to create new columns or variables with custom values. Its syntax is:
mutate(dataset , conditions)
New rating system
We will create a new rating system where the rating is given between 0 to 0.5.
For this, we will use the mutate function.
# Mutate function
New_data = mutate(data, Rating_new = Rating/10)
# display the new rating and then the rest of the table
select(New_data, Rating_new, everything())
group_by() function :
It is used to group the dataset by the column. Its syntax is:
# Group by function
#group by price
group_Price = group_by(data, Price = 0)
#group by category
group_design = group_by(data, Category = "ART_AND_DESIGN")
Thank you for sticking till the end. This is the end of this article but not about the dplyr. In the Dplyr 2.0, you will learn about the Pipe operator and summarise and join function. Stay tuned to Incedge.com.
Source code can be found on Github. I am happy to hear any questions or feedback.
Hope you like this article!! Don’t forget to like this article and share with others.