Serverless News Extraction

A Comparative study of three Open Source Libraries

Piotr Orzeszek
7 min readJul 23, 2020

Introduction

Serverless technologies are having a major impact on the world of computing. They allow IT teams to dynamically scale computing resources without directly administering computing power and storage. As a result, users pay less for services, and developers are free to concentrate on application features rather than on the run-time environment. Serverless environments are suitable for deploying a wide range of applications, including but not limited to news extraction from web content.

To extract content from web pages, a variety of methods can be applied. These include data mining, information retrieval, API endpoints, and semantically enriched content extraction. Further, a range of tools can be used to extract and curate information from the websites of newspapers and magazines.

In this article, we take a closer look at three Python libraries: Newspaper, Goose3 and news-please. The libraries are used to compile information associated with articles, including author name, publication date, site domain, prominent images and video content embedded in the article, keywords describing the content, a summary and full text.

The purpose of this article is to review the extraction speed and accuracy of these three libraries in a serverless environment. Readers can use our research and benchmarking as a reference for how to set up a serverless solution and use it as the first step in their data processing pipeline. (We will reserve exploration for another time about more sophisticated approaches, such as different methods of storing extracted data points, or data validation, cleaning and enrichment.)

This research relies on Python Cloud Importer, a solution for importing libraries directly from cloud storage and automating package optimisation. The importer was developed as a part of the Cloud AI Operating System (CAIOS) project overseen by BST LABS, the software engineering unit of BlackSwan Technologies.

Libraries Used in our Benchmark

We compare three Python libraries: Newspaper, Goose3 and news-please. To review more information about each one, visit its project page and repository:

It is important to recognise that, in serverless environments, code size and memory footprint matter. We expect that libraries with fewer dependencies and smaller memory requirements will behave better in news extraction performance and bootstrapping time.

In the graph below we present the list of dependencies needed by each library we included in our research:

Fig1 News extraction libraries dependencies.

The following code sample illustrates how simple it is to write individual benchmarks and run tests with the help of CAIOS tools.

from goose3 import Goose
import
time
import pkgutil
from typing import Tuple
TIMEOUT = 50 # [seconds]ERRORS = (
'Access Denied',
'Site Not Available',
'Page not found',
'Unavailable for legal reasons',
'404 Error'
)
def extract(g: Goose, url: str, extracted: int, missed: int, elapsed: int) -> Tuple[int, int]: try:
article = g.extract(url=url)
if (len(article.title) > 20):
extracted += 1
# it looks so strange but goose3 return exception message as article.title rather than rising an exception if (article.title in ERRORS):
missed += 1
print(f'{extracted}/{missed}', ' ', int(elapsed), ' ', article.title) except Exception as exc:
print(f'{url}: {extracted}/{missed}', ' ', int(elapsed), 'exc: ', exc)
missed += 1
return extracted, missed
class Goose3Benchmark:'''Goose3 Python library test service'''def _run_single_thread(self) -> Tuple[int, int]:
'''run Goose3 benchmark in one thread'''
extracted = 0
missed = 0
urls = pkgutil.get_data('data', 'urls.txt').decode('utf-8').split('\n')
start = time.time()

with Goose() as g:
for url in urls:
elapsed = time.time() - start
if elapsed > TIMEOUT:
break
extracted, missed = extract(g, url, extracted, missed, elapsed)
return extracted, missed

That’s all and nothing else is required. To build Lambda function, deploy and run it on cloud one just needs to type:

caios test_remote run

Run-time Environment and Deployment

The cloud landscape is dominated today by four services: Amazon AWS, Google Cloud, Microsoft Azure, plus Alibaba Cloud rising rapidly in the Far East. For this analysis, we focus on Amazon AWS, specifically AWS Lambda, but our results could be valid with minimal changes using other providers.

The main advantage of AWS Lambda is that it lets you run code without provisioning or managing servers. Compared to AWS EC2, it allows for building smaller on-demand, responsive serverless applications. It is a computing service that runs code in response to events and automatically provisions the computing resources required by that code. Users pay only for the compute time their application consumes. Whereas Amazon EC2 is priced by the hour but metered by the second, AWS Lambda is metered in increments of 100-millisecond time chunks.

With Lambda, you can run code for virtually any type of application or back-end service. You simply upload the code then Lambda takes care of everything required to operate and scale our service with high availability. Of course, nothing prohibits starting multiple Lambda instances in parallel and terminating them just after the tasks are completed.

Each AWS Lambda instance is a kind of container, bootstrapped with Amazon Linux and equipped with 128–3008 MB of RAM (in 64 MB increments) and 512 MB of disk-like storage. Data persists there only for the duration of the instance life; it is discarded after all the tasks running in the instance complete. The Lambda container has a configurable execution time from 1 to 900 seconds.

Using data science and data-processing packages in projects like news extraction can be a real challenge in serverless environments, as they require a lot of storage space to load their components. Amazon recently announced a shared file system for Lambda partially to address this. However, the biggest difficulties in designing solutions based on serverless technologies turn out to be automation and complicated configurations.

Python Cloud Importer is a valuable tool used in this research that addresses storage space as well as the other challenges. It opens a lot of new possibilities and lets developers focus on application features rather than tuning complicated aspects of the cloud environment. The importer has been developed as part of the Cloud AI Operating System (CAIOS) project at BST LABS, a subsidiary of BlackSwan Technologies. Additional CAIOS tools made this benchmarking experiment practical, particularly Service Template Compiler, which automates conversion of high-level programming language code, such as Python, into service deployment scripts

Speed Benchmark

We based our speed evaluation on three different benchmarks. A very simple, single-threaded, minimalistic extraction loop was run for each library for 60 seconds, 300 seconds and 840 seconds of execution time. The input data was a list of URLs for English language articles from major newspapers and news portals. We collected data from ten samples on how many news articles were extracted properly and what percentage was missed (due to exceptions or server-side errors). We also collected some data about the memory footprint for each execution. Then, we calculated an accuracy factor to compare speeds and to take into account how much was missed in each sample.

Fig2 Newspaper3k Statistics
Fig3 Goose3 Statistics
Fig4 news-please Statistics

These tables are revealing in several ways. The results show that our benchmark can be considered as a competition across two main categories: extraction speed and accuracy. There is no doubt that in such challenges Newspaper would be found the ultimate winner for speed, while Goose3 would be the winner for percentage of properly extracted articles.

Accuracy Measurements

The ratio of properly extracted articles to total processed URLs was calculated for each library for 60 seconds, 300 seconds and 840 seconds execution time periods. Final results are expressed as accuracy percentage factors in a table below.

Fig5 Accuracy
Fig6 Newspaper3k Hit/Miss Ratio
Fig7 Goose3 Hit/Miss Ratio
Fig8 news-please Hit/Miss Ratio

The single, most notable observation to emerge from the comparison of the collected data was the correlation between the accuracy factor and Lambda-implemented execution time. This correlation can be observed from samples taken per library, where the results follow a visible pattern. For each library, the average accuracy is the highest at 300 seconds execution time. The results obtained from the preliminary analysis may indicate that the Python GC mechanism is not very effective in memory management on long-term execution. At the same time, the 60-second execution period may be influenced by minor delays based on how a Lambda container operating system allocates CPU time to the Python process.

The precise mechanism of this accuracy/execution time correlation remains to be elucidated. There are several possible explanations. A natural follow-on to our work here would be to analyse potential memory leaks in Newspaper, Goose3 and news-please code. However, it is hard to believe that all three libraries have a memory management issue expressed in such similar behaviour and impact on accuracy.

Conclusions

The main goal of our current efforts was to determine how one can use serverless platforms to complete news extraction tasks. The most obvious finding to emerge from this research is that such tasks can be performed in serverless environments with similar levels of efficiency as a traditional deployment using isolated Python environments (Virtualenv, Anaconda, etc.). The Newspaper library is the ultimate winner in speed. Goose3 also should be considered seriously for projects because of its accuracy.

--

--

Piotr Orzeszek

Enthusiastic and self-motivated software developer with experience in data science, realtime big data analytics machine learning and serverless computing