How to easily access web data with R — part 1 of “R for Applied Economics” guide

Dima Diachkov
8 min readJan 22, 2023

--

How to get data from the internet and put it into your R environment?

Are you ready to know more about R and its application in Economics&Finance? No theory classes are needed, and we will get hands-on experience today. Let’s jump right in.

It’s the job that’s never started as takes longest to finish.” — Sam Gamgee

Sam Gambee by MidJourney AI (not red-haired, but nevertheless I like it)

Please, don’t be scared to start doing something practical with R. I want to remind you that you are here because you are already familiar with R (at least at a “basic” level, like being aware of data structures and syntaxis), which is a powerful and widely-used programming language for data analysis and visualization. But let me also remind you that one of its key strengths is its ability to easily access and work with data from various sources, including the web. And today we are going to do exactly that.

Accessing web data with R is a highly valuable and versatile skill for data analysts and scientists.

On one hand, the ability to easily import data from various web sources allows for a wider range of data to be utilized in analysis and visualization. This can lead to a deeper understanding of the subject matter and more accurate conclusions. Additionally, R’s wide range of packages and functions for web scraping and parsing HTML and XML data make it a powerful tool for data extraction from the web.

On the other hand, there are also concerns about the ethical and legal implications of web scraping. Without proper permissions and considerations, web scraping can be a violation of a website’s terms of service and can also potentially lead to the mishandling of sensitive information. It’s important to ensure that data is obtained legally and ethically before using it in any analysis or visualization.

Hence, we will use only publicly available and free-to-use data (preferably — official statistical and financial data). It’s important to ensure that the data is obtained legally and ethically before using it in any analysis or visualization.

There are several ways to access web data in R, including using the built-in functions such as ‘read.csv()’, or using specialized packages such as ‘rvest’ and ‘httr’. These packages allow for web scraping and parsing of HTML and XML data, respectively.

Other formats that can be accessed in R include JSON and APIs. ‘jsonlite’ and ‘RJSONIO’ packages are useful for working with JSON data, and ‘httr’ can be used to interact with APIs. Basically, all possible and popular formats are available for parsing or downloading through R.

When it comes to web data access, some of the most commonly used libraries in R are ‘rvest’, ‘httr’, ‘RCurl’, ’XML’, ’xml2’. This should be enough, to begin with. Further and more specific details will be covered later throughout the guide.

Business case 👇

Credits: https://www.fakewhats.com/generator

Practice

Well, in this case, you have plenty of possible ways to solve the problem. But let’s pretend that you don’t have data at hand, so you decided to download it first, and then you have to be prepared for the update next week/month/quarter/year.

Conceptionally, the task consists of these steps, which remind me of ETL-approach:

  1. Find the right data and the way to parse or download it;
  2. Manipulate with data (if needed);
  3. Do whatever you were asked to do (charts, tables, etc.).

Step 1. Find the right data and the way to parse or download it

Just google it (if you don’t the right place yourself) and use your deduction skills to search for it.

If you need official (which means — trusted) data you can either use government/agency websites (quite often they are data owners or data stewards) OR you can go with commercial data providers (like Capital IQ, Eikon, Bloomberg, etc.). I can say that both types of data sources are fine. It is up to you what to use! But in the simulation, we pretend that we do not have any access to the commercial data providers. We have just received a simple task and we need to act quickly with very limited resources.

So, we need to find data for inflation in the EU. I would go to a place, which I trust fully — the ECB website. Let’s just browse through the statistics section, and voilà — we see the link to ECB statistical warehouse (https://sdw.ecb.europa.eu/). Should be it, right? Of course, we will use it in other projects as well, so keep it. Main data items are on the front page so just click on Inflation. What do we see next?

ECB statistical warehouse (https://sdw.ecb.europa.eu/)

This is the page that we will parse. And it is very common to parse webpages when you don’t have a convenient link to data export in the desired format. So you will see many pages like that.

Please, look at the web address.

The highlighted area is the URL and I am not the one who can tell you techie stuff about it. But what I do know is that at the right part of the address, there are parameters (actually — just one in this case).

So this page responds to the request for time series of “122.ICP.M.U2.N.000000.4.ANR”, which means inflation and some other related details.

Hence, if you know the other time series key — just replace it in the browser, re-run the page, and you will see the new data (you can try it out, use this key for USD/EUR “120.EXR.D.USD.EUR.SP00.A”).

Do you see where it is going? Yeah, when we prepare our code in R to parse inflation data, then we will be able to re-use it for other datasets (i.e. USD/EUR mentioned above). Awesome, right? I will show it to you soon.

So basically now we have a URL for the data. That is all we need. Keep the page as we are going to copy the URL from here later.

Let’s run our RStudio (or whatever you use to work with R). Please start by attaching the needed library — ‘rvest’ (please do not forget to install it beforehand if had not done it earlier)

# chunk 1
# attaching packages / settings
library(rvest)

Okay. Now we are ready to parse the inflation rates, so let’s set the path to the data, parse it from HTML format and check what we got.

# chunk 2
# input / extraction
link <- "https://sdw.ecb.europa.eu/quickview.do?SERIES_KEY=122.ICP.M.U2.N.000000.4.ANR"
parsed_data <- read_html(link)
class(parsed_data)
parsed_data
Output for chunk 2

Well, the output is some HTML structure while the class is XML document. It means that we are working with a tree-like data structure. And we know that the page we saw had a table that we need. Let’s look for it inside.

# chunk 3
# looking for the right table inside
parsed_table <- html_table(parsed_data)
class(parsed_table)
parsed_table
Output for chunk 3

Seems like we have a list of tables, extracted from the XML structure. Just scroll down your R console. The right table is usually the last (as you may have already guessed, when you parse the whole page you receive all table-like structures from HTML notation). The table we need is the last (index 6). Congrats! We made it this far and we can proceed with it further.

Step 2. Manipulate with data

Let’s clean a mess. We usually have issues with formats, headers, and redundant columns&rows. Run the str function to have a look at first rows and formats.

# chunk 4
# manipulation / transformation
raw_df <- as.data.frame(parsed_table[[6]])
str(raw_df)
Output for chunk 4

What do you see? No headers (first issue), wrong formats (second issue), and the first couple of lines are populated with metadata (third issue).

Where do I see it? Well, you’ll get there. Before that use a more comfortable way of presenting dataframe — RStudio.

Rstudio preview of raw_df

Now it should be totally simple to fix. I would be radical and rename columns and drop the first two lines (as now I know what’s inside and how to name it properly).

#chunk 5
clean_df <- raw_df[3:nrow(raw_df),1:2] # subset raw_df: take all rows starting from 3 and only first 2 columns
names(clean_df) <- c("Period","Inflation") # now we need to rename only those columns, that are relevant
# start with time measure

clean_df$Period <- paste0(clean_df$Period,"-01") # as far as data do not have day, I will just add it for convenience
clean_df$Period_formatted <- as.Date(clean_df$Period,format="%Y-%m-%d") # new column is formatted based on the format settings

# then fix the value column
clean_df$Inflation_formatted <- as.numeric(clean_df$Inflation)

# now we check what we have got
str(clean_df) # it is all good, now we drop redudancies
clean_df <- clean_df[,3:4]
names(clean_df) <- c("Period","Inflation") #rename newly created columns
Output for chunk 5

Pretty straightforward, right? Seems like step 2 is done.

Step 3. Do whatever you were asked to do (charts, tables, etc.)

Now the dataframe is available for analysis. Let’s just plot something super-simple so we know what has been achieved.

# chunk 6
# output / loading
plot(clean_df, type = “line”)
Output for chunk 6

Nice, seems real. I believe that we can make this chart better in many ways and tune it for various purposes (i.e. add other data to cross-check it).

Conclusion

At this stage, we can conclude part 1, since all steps are done and data is prepared and available at hand. So far, we have learned:

  • how to use URL for parsing;
  • how to parse data in R with ‘rvest’ package;
  • how to extract data from HTML/XML documents, and transform them into manageable dataframes;
  • And what is more important — you have written a code that parses inflation data from the website of the ECB and delivers it to you in proper form.

Next time we have to draw something more complicated and add some economics to it as we need to help out our buddy Gandalf. Do you remember that we promised him really nice charts to support the evidence? He (and you) would love it!

By the way, the FULL code is available at the designated Github repo for your convenience: https://raw.githubusercontent.com/TheLordOfTheR/R_for_Applied_Economics/main/Part1.R

Please clap 👏 and subcribe if you want to support me. Thanks!❤️‍🔥

--

--

Dima Diachkov

Balancing passion with reason. In pursuit of better decision making in economic analysis and finance with data science via R+Python