Web Scraping Historical Sovereign Credit Ratings

Using BeautifulSoup and Pyhton

Fernando Aguilar
8 min readNov 15, 2019

Introduction

I am currently working on a project inspired on a recent World Bank hackathon event I had the opportunity to attend. For this project I need the historical long-term sovereign credit ratings, last 10 years of data would be ideal, for as many countries as possible. Searching for this information I stumbled upon the following website: https://countryeconomy.com/. Among its various economic indicators, it has a section solely on sovereign ratings, which is what I need for my project.

For this post I am assuming basic familiarity with HTML tags from your part, so I will not cover it. In case you need to brush up on HTML I recommend this short post:

Although not relevant to this post, I would like to define what a sovereign credit rating is for those of you not familiar with the term. From the website:

“Sovereign credit rating, is an evaluation made by a credit rating agency and evaluates the credit worthiness of the issuer (country or government) of debt.”

The Website

Once on the website, the layout is pretty straight forward. It lists all of the countries in its database with the latest ratings, in this case 2019 ratings, in a single table. Each country has a ‘[+]’ sign at the end, meaning that more information of the country is available by following the link. If you click on the country, you will be taken to the country’s page with its historical sovereign ratings. Alright, now that I know I can get all of the information I need for my project from this website, How do I get it into a pandas dataframe to be able to use it in my project?

Let’s start at the first page containing a table with all of the countries.

Depending on the browser you are using, the way to access the developer tools to inspect the websites code might be different. I am providing the following link where you can find how to access developer tools in your browser:

I am using Chrome, hence I usually highlight the element I’m most interested in, right click, and select the ‘inspect’ option. This will open up the developer tools and show that particular element’s location in the website.

Highlighted the first country in the table (United States) and inspected it in DevTools.

Great! Now we know that each each row in the table is a country. The first cell of each row is a link, <a> tag, with the href property pointing towards the country’s page. Example: <a href= “/ratings/usa”>United States [+]</a>

It makes sense to store each of the links as a tuple in a list containing all the links for each of the countries. Example: country_links = [(‘United States’, ‘/ratings/usa’),…]. In this way we could iterate through each of the elements and get each of the country’s ratings from their respective webpages.

Getting the Tuples List

First, we need to import the required libraries, BeatufulSoup, SoupStrainer, requests and pandas. BeautifulSoup is a Python library for pulling data out of HTML and XML files.The SoupStrainer allows you to choose which parts of an incoming document are parsed. Requests allow you to send HTTP/1.1 requests to access the response data, the website content in this case.

Let’s start by requesting the ratings’ website containing the table with all the countries and their respective links. From the image below, the table containing the information has the id= ‘tb1T’. Since this is the only element we need from the whole webpage, this will be the SoupStrainer’s parameter. Cutting off unnecessary information at the start makes finding the information you DO need easier down the line.

In the gist below, I imported the libraries described above, set the correct parameters, and now my soup variable stores all the contents of the website with the id= ‘tb1T’.

Requesting the table from the website and storing the contents in ‘soup’.
<!DOCTYPE html>
<table class="table tabledat table-striped table-condensed table-hover" id="tb1T">
<thead>
<tr class="tableheader">
<th style=" width:19%;">
</th>
<th style=" width:27%;">
<a href="/ratings/moodys">
Moody's ratings [+]
</a>
</th>
<th style=" width:27%;">
<a href="/ratings/standardandpoors">
S&amp;P ratings [+]
</a>
</th>
<th style=" width:27%;">
<a href="/ratings/fitch">
Fitch ratings [+]
</a>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="/ratings/usa">
United States [+]
</a>
</td>
<td>
<span class="graph_hbar" style="background-color: #8DEEEE; width: 77%;">
</span>
<span class="padleft">
Aaa
</span>
</td>
<td>
<span class="graph_hbar" style="background-color: #00D600; width: 74%;">
</span>
<span class="padleft">
AA+
</span>
</td>
<td>
<span class="graph_hbar" style="background-color: #8DEEEE; width: 77%;">
</span>
<span class="padleft">
AAA
</span>
</td>
</tr>
<tr>
<td>
<a href="/ratings/uk">
United Kingdom [+]

This are the first lines that get printed from the above gist, I only displayed the lines until we get to the second country. We do not need any of the table headers,<thead>, and it is evident that all of the countries are listed after the <tbody> tags. Furthermore, we are only interested in the first cell of each table row. So, now, we store the table body,<tbody>, contents in a variable called table.

table = soup.find('tbody')

The only link tags in the table, <a>, are the countries that we are interested in, hence, it is easier to call on the links one by one rather than going to the first cell of each row, row by row. We do that by creating a for loop for each link in the table, table(‘a’). We also want the names of the countries clean without the ‘[+]’ sign.

Get the country and its link as a tuple in a list
Number of countries:  143
[('United States', '/ratings/usa'),
('United Kingdom', '/ratings/uk'),
('Germany', '/ratings/germany'),
('France', '/ratings/france'),
('Japan', '/ratings/japan')]

We have successfully gathered the links to the historical sovereign credit ratings pages for 143 countries. Before scraping the data for each of the countries, it is prudent to verify that all of the links are working correctly. Using the requests library and a for loop, we request each of the links, if an error were to arise it would add the country name to our error_country list.

Check for invalid links
All links are working

The Country’s Webpage

Now that we have all of the countries and their respective links in a list. We are goin to look at a country’s webpage to see how to get the information we need from it. Since all of the countries will have the exact same webpage, just by looking at one arbitrarily will get us all the information we need to build our scraper. The webpage for the United States looks like to following image:

United States Historical Sovereign Credit Rating Webpage

After inspecting the webpage using the developer tools, I concluded, all of the ratings data is contained in the div element with id= “”myTabContent”. It makes sense since we can see the information is organized in tabs, the three tabs on top of the table Each tab stands for each of the three rating agencies, Moody’s, S&P, and Fitch.

<!DOCTYPE html>
<div class="tab-content col-sm-12" id="myTabContent">
<div class="tab-pane fade in active" id="moodys">
<a id="MOODYS">
</a>
<div class="tabletit">
Rating Moody's United States
</div>
<div class="table-responsive">
<table class="table tabledat table-striped table-condensed table-hover" id="tb0_963">
<thead>
<tr class="tableheader">
<th class="wborder" colspan="4">
Long term Rating
</th>
<th class="wborder" colspan="4">
Short term Rating
</th>
</tr>
...

After inspecting the HTML code, I found tha ratings are located inside <tbody> tags. Now, let’s look at how many of those are in the whole page.

Number of tables: 3

Great! There is a table for each of the tabs, hence we could iterate through each of the tables and get each of the agency’s ratings. First, I will instantiate a list named ratings to store all of the scraped ratings. At the same time I will create a list with the names of the three credit rating agencies, this will help down the line to label them. The whole scraper can be explained by dividing it in 3 for loops.

  1. The first for loop iterates though the country_links list which contains the country-link tuples. It gets the content from the country’s website and parses only the element with id= “myTabContent” to the soup. Then we store each of the 3 <tbody> elements in a variable called tables.
    Optional: Some websites are aware of web scrapers and may deny your requests if the website feels targeted due to the amount and the speed of requests coming from your IP address. Hence, it is advisable to use the sleep method from the time library. It is entirely optional, in this example I set it to 1 second. Every loop will hold for 1 second before continuing with the request.
  2. The second for loop is nested in the first one. This loop iterates through each of the <tbody> elements in the soup pairing it with its agency label.
  3. The third and last for loop is nested in the second one. This loop iterates through each of the rows in the <tbody> element. In this step, first we will test if there is text in the first cell of the row, if not, it will move on to the next. The first cell of each row is the date, it being a <td> element. The second cel, or <td> element of the table row is the rating. In this loop we get the date as a string from the first <td> element and the rating from the second <td> element.

Last but not least we append the date, country, rating and agency to the ratings list. After all of the iterations are completed, we turn the ratings list into a pandas dataframe. Resulting from the following code:

...
Scraping Zambia moodys soverign credit ratings
Scraping Zambia sp soverign credit ratings
Scraping Zambia fitch soverign credit ratings
Web scraping for Zambia completed

Ratings dataframe ready.

A quick reminder to save your data as a csv for future reference and to save time by not having to scrape the data every time you intend to use it.

Resulting dataframe

Awesome! Now that we have the data in long format, you to clean and scrub it to fit your project needs.

References and Further Reading

--

--

Fernando Aguilar

Data Analyst at Enterprise Knowledge, currently pursuing an MS in Applied Statistics at PennState, and Flatiron Data Science Bootcamp Graduate.