Intro to data wrangling and scraping using R

Mochamad Kautzar Ichramsyah
CodeX
Published in
6 min readFeb 20, 2024

Background

In 2020, I was contacted by my former lecturer when I studied Statistics at Gadjah Mada University, Indonesia. He invited me to participate as an instructor in a workshop at The Tenth International Conference and Workshop on High-Dimensional Data Analysis (ICW-HDDA-X) 2020. I was surprised at that time because we had not communicated for eight years. However, at the same time, I also felt flattered because he still remembered me and felt that I was capable and worthy of being an instructor at the workshop. The material that I decided to convey at that time was as stated in the title of this post. Without further ado, let’s go!

Photo by Choong Deng Xiang on Unsplash

What is data wrangling?

Data wrangling is everything you need to do before doing data analysis. Mostly we have to do these things:

  1. Spot variables and observations
  2. Derive new variables and observations
  3. Reshape the data into the best format
  4. Joining multiple datasets
  5. Group-wise summarize
An article title published by The New York Times

Quoting from an article published by The New York Times, it says “Data scientists, according to interviews and expert estimates, spend from 50% to 80% of their time in this matter, before it can be explored.” Therefore, it would be good if we could carry out this data-wrangling process efficiently and effectively so that we could utilize more of our time to carry out the analysis.

Data wrangling using R

For this material, I use the R programming language with the R Studio GUI. You can download the R programming language here and the R Studio GUI there. The packages that I said for doing data wrangling at that time were `tidyr` and `dplyr` both of which are part of the `tidyverse`.

# installing the packages for the first time
.packages(c('tidyr', 'dplyr'))

# load the packages using the library function
library(tidyr)
library(dplyr)

There are also several additional packages that I use, namely `devtools` and `EDAWR`. Specifically for `EDAWR` we can only get it if we use the R Studio GUI.

# installing and loading the `devtools` packages for the first time 
install.packages('devtools')
library(devtools)

# installing the EDAWR using install_github from devtools
install_github('rstudio/EDAWR')
library(EDAWR)

Finally, we can use several datasets available in the EDAWR package, such as `storms`, `cases`, `pollution`, and `tb`.

# get the help of each dataset to know about the background 
?storms
?cases
?pollution
?tb
Documentation of `storms` dataset.
Documentation of the `pollution` dataset.

Tidying data using `tidyr`

What is “tidy data”? “Tidy data” is data that has the following characteristics:

  1. Each variable is saved in its own column.
  2. Each observation is saved in its own row.
  3. Each type of observation is saved in a single table.

The goal is to make it easier to access the data and preserves the observations.

‘tidyr’ is a package to reshape the layout of the tables. The two main functions are gather() and spread()

How to use gather() and spread()

Example usage of gather()
# collapses multiple columns into two columns
gather(cases, 'year', 'count', 2:4)
Explanation of gather() function
Example usage of spread()
# generates multiple columns from two columns
spread(pollution, size, amount)
Explanation of spread() function
How spread() and gather() connected

How to use separate() and unite()

Example usage of separate()
# splits a column by a character string operator
separate(storms, date, c('year', 'month', 'day'), sep = '-')
Example usage of unite()
# unites columns into a single column.
unite(y, 'date', year, month, day, sep = '-')
Recap of using ‘tidyr’ to create a tidy dataset

Manipulate data using `dplyr`

‘dplyr’ is a package to transform tabular data.

How to access data information using `dplyr` functions
Example usage of select()
Example usage of select()
Example usage of filter()
Example usage of filter()
Example usage of mutate()
Example usage of mutate()
Example usage of summarise()
Example usage of summarise()
Example usage of arrange()
Example usage of arrange()

How to use 'pipe' operator

Pipe operator %>% have a function to chain multiple operations together.

Example usage of `pipe` operator
Example usage of `pipe` operator
Example usage of `pipe` operator and group_by()
Shortcut to write `pipe` operator
Recap of using `dplyr` to transform tabular data

Joining data using `dplyr`

I’m sure some of you are thinking, “dplyr is a lot like SQL.” Yes, you are right! This time we will explore the “join” function in dplyr which is very similar in use to “join” in SQL.

Example usage of bind_cols()
Example usage of bind_rows()
Example usage of union()
Example usage of intersect()
Example usage of setdiff()
Example usage of left_join()
Example usage of inner_join()
Recap of using `dplyr` to “join” tables

Data scraping using `rvest`

# installing the package for the first time 
install.packages('rvest')
library(rvest)

# other packages needed
install.packages(c('selectr', 'xml2', 'jsonlite', 'stringr'))
library(selectr)
library(xml2)
library(jsonlite)
library(stringr)

In this opportunity, we are going to scrap data on this page https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361

Screenshot of the page we are going to scrap
# assign url 
url <- 'https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361'

# using read_html function to read the url then assign to webpage
webpage <- read_html(url)
Details we need
# create column of "nama_kolom"
nama_kolom <- webpage %>%
html_nodes('#jenis1 th') %>%
html_text() %>%
as.vector()
# create column of "kolom_no"
kolom_no <- webpage %>%
html_nodes('#jenis1 td:nth-child(1)') %>%
html_text() %>%
as.integer()
# create column of "kolom_kode"
kolom_kode <- webpage %>%
html_nodes('#jenis1 td:nth-child(2)') %>%
html_text() %>%
as.integer()
Merge all data
Example usage of simple aggregation using the scrapped data

Conclusion

Data wrangling requires our ability as analysts to see, digest, and understand the data we have, which ultimately makes us understand how to make changes, additions, subtractions, and other things that are needed before starting further data analysis.

In case you want the PDF version of my workshop material, you can access the file here. Or you can just read it directly below:

I hope this post is useful for your journey to do data wrangling. Thank you for reading my post!

--

--

Mochamad Kautzar Ichramsyah
CodeX
Writer for

Data analytics professional with 10 years of experience at tech companies in Indonesia.