Extracting significant content from a web page using Arc90 Readability algorithm
In the vast, unstructured, world of information we live in, it can be hard to locate significant content of a given media.
Fear no more, in this article, I will get you acquainted with the Readability algorithm, and how it can be used to extract significant information from a web page.
I learned about this tool when I was tasked with crawling and scraping relevant information (phone numbers, email addresses, main goals, and vision) from web pages of companies in the hydraulic field. This article will focus on presenting an overview, and example, of the Arc90 Readability algorithm, but a full article about the web scrapping project will come soon.
About the algorithm
The Readability algorithm was developed in 2010 by Arc90 labs. Its main goal was to offer media consumers the option to get rid of the junk on a website, and focus only on the relevant content. Here’s a demo of the algorithm : https://www.youtube.com/watch?v=jnn1D5UOmlU
In a nutshell, the algorithm is ran on the HTML code of a website and grants points (positive or negative) to elements which are characteristic of main topics of a web page. The elements that are inspected, and scored, can be text in h or p tags, but can also be a class or ID name.
Application
Here’s a web page of a company:
At first glance it is fairly hard to pinpoint important information . Let’s run the Readability algorithm:
Here are some imports you will need:
import requestsfrom readability.readability import Document #For the Arc90 readability algorithmfrom bs4 import BeautifulSoup as bsimport re
Here is the code to extract the title and general content from the content.
#regex to clean the HTML tags
clean = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')#Get the html code of the page
response = requests.get('http://www.dyrhoff.co.uk/')#Initializing the readbility algorithm with Document
document = Document(response.text)#Get title of the webpage
doc = document.title()#Get main topic of the webpage
elem = document.summary()#Print out the output:
print('Title: ')
print(re.sub(clean,'',doc))
print('\n')
print('Content: ')
print(re.sub(clean,'',elem))
Output:
Title:
Dyrhoff UK : Inflatable Rubber Dams and Spillway Gates Content:
Dyrhoff Limited is celebrating fifteen years of successful operation from its base in Folkestone, Kent. Recognised globally as one of the leaders in the rubber dam and pneumatic gate field, Dyrhoff Ltd has, since 2005, designed, supplied and supervised the installation of 136 rubber dams and spillway gates, across 87 sites in 29 countries across Europe, the Americas, Asia and Africa. In addition, Dyrhoff Ltd is currently working on a further 28 projects for completion in the next three years. Dyrhoff AS began its association with rubber dams in 1989 as the Scandinavian agent for Sumitomo Electric. At the time the Japanese company was the leading manufacturer and supplier of rubber dams in the world. In 2003, Sumitomo granted Dyrhoff a worldwide licence to sell rubber dams to the Sumitomo design. Dyrhoff Ltd was established in June 2005 to develop this side of the business. Whilst taking its technical knowhow from the Japanese specialist, Dyrhoff Ltd developed partnerships with suppliers around the world, enabling it to respond to very specific client requirements and remain competitive in this ever-evolving market.
Now let’s do a quick inspection of the page:
As we can see, there are consecutive paragraph tags which were (very likely) scored positively, and extracted by the algorithm.
Conclusion
It is important to note that this method won’t produce perfect results. You might receive a blank, or unrelated data as an output depending on the web page you are running the algorithm on. That said, this tool can still come in handy when dealing with large amounts of content, and can save time during the reading process.
Thank you for reading!