Experimental web-scraping using the Google Cloud Platform Vision offering

Published in

Vendasta

6 min readAug 30, 2019

Web-scraping is a method for extracting information from websites. There are multiple approaches to web-scraping [1], which range from humans manually accessing a website with the intent of copying information, to automatic scraping through the use of web-scrapers.

Web-scrapers are programs written with the goal to programmatically access websites and collect information in an automated fashion. An approach that is sometimes used by web-scrapers is loading websites and saving their page sources (raw HTML). After saving the page sources, other programs can attempt to extract information such as names, phone numbers, addresses, etc., by performing pattern matching, or looking for known ID attributes [2] that point to information to be saved.

One of the pitfalls of web-scraping by saving page sources is the maintenance of the web-scrapers. For example, a web-scraper might expect to find greeting_id_1. However, it is possible the web-developers of a website to have changed the ID to greeting_id_2 at some point, which would break the scraper’s capability of extracting the desired information:

This is a simple example. However, it serves to illustrate a pivotal problem that can cause web-scrapers to break and cause a lot of developer toil.

Another potential approach to web-scraping is the use of computer vision algorithms to extract information from images — the way humans would. This approach falls within the realm of object detection — a central problem in computer vision — which has witnessed important progress in the last decade. With these advances, multiple commercial offerings for object detection are now accessible, one of them being Google’s Vision [3]. The GCP Object Detection API is still in Beta but it showcased promising results the first time we tried applying it to a web-scraping experiment.

The web-scraping experiment consisted of:

Creating a set of images of websites (screenshots) from which we would extract information;
Submitting a data labeling job using GCP’s Data Labelling offering [4];
Training an Object Detection model;
Performing predictions via API;
Cutting pieces of the screenshots where boxes are suggested;
Extracting desired information from the smaller, cut, images.

Creating a set of images

To create a set of images for labeling and training, we used Puppeteer [5]. In our case, Puppeteer was used to access multiple websites in order to take screenshots at a set resolution of 1280 x 960:

Submitting a data labeling job

This step required following Google’s documentation on how to structure images in cloud buckets, creating instructions for labelers, and submitting a data labeling request.

Training an object detection model

Once we built and structured the training set according to Google’s documentation, we had to create a dataset of all the images we collected. Once the dataset was created, we were a few clicks away from training and deploying the model.

Performing predictions via API

There are two options for performing predictions once a model is trained — via the Python library, or curl. We performed the experiment via curl with the following script:

Notice the calls to extract.py and tesseract. The first one is a script that is linked to in this blog, while the other is a CLI for optical character recognition that is mentioned as well. For brevity, this is Google’s example for performing predictions using the Python library:

Cutting pieces of the original screenshot

Numerous image processing functionalities are exposed by the PIL (Python Imaging Library)[6], which is what we used for cutting out suggested boxes that were obtained via API using the following script:

Extracting desired information

This part was performing using PyTesseract [7] — a CLI that can be used for OCR.

Results

After training the model on approximately 4.3k screenshots, the GCP dashboard showcased the following prediction performance on an intersection over union (IOU) [8] threshold of 0.8:

Given the significant differences between the screenshots that we trained the model on, the numbers looked promising! Here, precision represents the model’s ability to correctly select items/boxes that are relevant out of all possible predictions (ability to distinguish false positives), and recall represents the model’s ability to select relevant items from the predicted ones (ability to distinguish false negatives).

Here are the extracted images based on a Vision API prediction:

The address extracted from the screenshot illustrated above

The name extracted from the screenshot illustrated above

The extracted phone number

The opening hours of the business on the day the screenshot was taken

Finally, the output from the bash script linked above is:

Advantages of this approach

In some instances, access to an offering such as the GCP Vision is better than building something on one’s own. Building a complex network requires expertise not only in software development but also some background in statistics to help justify choices made when implementing the model (e.g cost function, filter size, network size, etc). In that case, abstraction is an advantage as it opens the door for many users who have ideas for building interesting products, making projects, and performing experiments.
The GCP Vision offering is an accessible solution that only requires curated and structured data. Other than making sure one provides the required images and instructions sets (e.g consistent size, nice boxes, documentation on edge cases, etc), there is not much to training a model. In short, GCP Vision is easy to use and offers the potential to build interesting things on top of it;
The focus shifts from maintaining web-scrapers to creating infrastructure that is responsible for extracting features from images and cleaning the extracted text based on the results of PyTesseract (in this case), which may be more intuitive and easier to maintain because of the images that are involved — images are more intuitive to humans than searching for patterns with the intent of extracting information from page sources.

Disadvantages of this approach

Currently, there are a lot of hidden variables if one is using the Vision Beta offering. Training the model does not offer much control over how training occurs. For example, under the arguably safe assumption that it is using a variation of a Convolutional Neural Network, knowing the number of layers, types of layers, size of the applied filters, cost function, etc. This would help better explain why a model trained by Google’s Vision offering is a better choice for building an object detection model in comparison to building something on a personal machine (e.g YOLO v3 [9]);
PyTesseract’s performance seems to be highly dependent on the choice of particles per inch (PPI/DPI) and the quality of the images it performs OCR on. For instance, the result illustrated above is a “good result”. In comparison, there have been instances when changing the tesseract PPI resulted in a mix of characters instead of a nice line representing a business’s opening hours (e.g Sat 10:00 am — 6:00 pm vs. Sat 10);
The shift from maintaining fragile web-scrapers to maintaining and implementing a “screenshots-taker” program. Developing a performant program that takes consistently good screenshots requires implementing logic around cookies agreement pop-ups, expandable sections that display potentially desirable information (e.g business phone numbers hidden under a “Show Phone Number” link), ads (e.g some ads show phone numbers, which can result in false positives but the ads can be closed), etc;
The time requirements for preparing a good training dataset, which can delay time to delivered value. A good training dataset requires a lot of preparation. This preparation includes data collection, curation, labelling, creating labelling tasks for a labelling service, etc. All these factors contribute towards potentially delaying the time to value of a feature that may be backed by, in this case, a screenshots-based web-scraper. However, it is possibly the most important component that contributes towards the success of this approach.