Intro to data wrangling and scraping using R
Background
In 2020, I was contacted by my former lecturer when I studied Statistics at Gadjah Mada University, Indonesia. He invited me to participate as an instructor in a workshop at The Tenth International Conference and Workshop on High-Dimensional Data Analysis (ICW-HDDA-X) 2020. I was surprised at that time because we had not communicated for eight years. However, at the same time, I also felt flattered because he still remembered me and felt that I was capable and worthy of being an instructor at the workshop. The material that I decided to convey at that time was as stated in the title of this post. Without further ado, let’s go!
What is data wrangling?
Data wrangling is everything you need to do before doing data analysis. Mostly we have to do these things:
- Spot variables and observations
- Derive new variables and observations
- Reshape the data into the best format
- Joining multiple datasets
- Group-wise summarize
Quoting from an article published by The New York Times, it says “Data scientists, according to interviews and expert estimates, spend from 50% to 80% of their time in this matter, before it can be explored.” Therefore, it would be good if we could carry out this data-wrangling process efficiently and effectively so that we could utilize more of our time to carry out the analysis.
Data wrangling using R
For this material, I use the R programming language with the R Studio GUI. You can download the R programming language here and the R Studio GUI there. The packages that I said for doing data wrangling at that time were `tidyr` and `dplyr` both of which are part of the `tidyverse`.
# installing the packages for the first time
.packages(c('tidyr', 'dplyr'))
# load the packages using the library function
library(tidyr)
library(dplyr)
There are also several additional packages that I use, namely `devtools` and `EDAWR`. Specifically for `EDAWR` we can only get it if we use the R Studio GUI.
# installing and loading the `devtools` packages for the first time
install.packages('devtools')
library(devtools)
# installing the EDAWR using install_github from devtools
install_github('rstudio/EDAWR')
library(EDAWR)
Finally, we can use several datasets available in the EDAWR package, such as `storms`, `cases`, `pollution`, and `tb`.
# get the help of each dataset to know about the background
?storms
?cases
?pollution
?tb
Tidying data using `tidyr`
What is “tidy data”? “Tidy data” is data that has the following characteristics:
- Each variable is saved in its own column.
- Each observation is saved in its own row.
- Each type of observation is saved in a single table.
The goal is to make it easier to access the data and preserves the observations.
‘tidyr’ is a package to reshape the layout of the tables. The two main functions are gather() and spread()
How to use gather() and spread()
# collapses multiple columns into two columns
gather(cases, 'year', 'count', 2:4)
# generates multiple columns from two columns
spread(pollution, size, amount)
How to use separate() and unite()
# splits a column by a character string operator
separate(storms, date, c('year', 'month', 'day'), sep = '-')
# unites columns into a single column.
unite(y, 'date', year, month, day, sep = '-')
Manipulate data using `dplyr`
‘dplyr’ is a package to transform tabular data.
How to use 'pipe'
operator
Pipe operator %>% have a function to chain multiple operations together.
Joining data using `dplyr`
I’m sure some of you are thinking, “dplyr is a lot like SQL.” Yes, you are right! This time we will explore the “join” function in dplyr which is very similar in use to “join” in SQL.
Data scraping using `rvest`
# installing the package for the first time
install.packages('rvest')
library(rvest)
# other packages needed
install.packages(c('selectr', 'xml2', 'jsonlite', 'stringr'))
library(selectr)
library(xml2)
library(jsonlite)
library(stringr)
In this opportunity, we are going to scrap data on this page https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361
# assign url
url <- 'https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361'
# using read_html function to read the url then assign to webpage
webpage <- read_html(url)
# create column of "nama_kolom"
nama_kolom <- webpage %>%
html_nodes('#jenis1 th') %>%
html_text() %>%
as.vector()
# create column of "kolom_no"
kolom_no <- webpage %>%
html_nodes('#jenis1 td:nth-child(1)') %>%
html_text() %>%
as.integer()
# create column of "kolom_kode"
kolom_kode <- webpage %>%
html_nodes('#jenis1 td:nth-child(2)') %>%
html_text() %>%
as.integer()
Conclusion
Data wrangling requires our ability as analysts to see, digest, and understand the data we have, which ultimately makes us understand how to make changes, additions, subtractions, and other things that are needed before starting further data analysis.
In case you want the PDF version of my workshop material, you can access the file here. Or you can just read it directly below:
I hope this post is useful for your journey to do data wrangling. Thank you for reading my post!
References
- Tidyverse https://www.tidyverse.org/
- Dplyr https://dplyr.tidyverse.org/
- Tidyr https://tidyr.tidyverse.org/
- R-Studio Data Wrangling with dplyr and tidyr Cheat Sheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf