Efficient Selection of DOM Elements for Data Extraction

Penn Wu
Penn Wu
Aug 18, 2017 · 5 min read

The traditional March Madness bracket pool system lacks interaction or direct head-to-head competition. So for March Madness this year, we used a Calcutta Auction to auction individual teams to players and distribute winnings from the pool based on how well each team did. Each matchup was exciting and personal between the owners of the teams competing. It was a hit.

Maintaining it was a pain, though. We put together a Google Sheet to automate payout calculation, but we had to manually enter the results of each game. Slow and tedious. We looked into hooking the sheet up with a sports API, but they’re either closed (e.g. ESPN) or well past the budget of our $5 pool.

Another option to get programmatic access to the data was to create our own server, which would scrape game results from a sports site and serve it up as an API that the sheet could access. We would have to set up infrastructure, build server-side routing and develop a tested, reliable algorithm for data extraction. There are a lot of resources out there now that help make this process easier. NodeJS has libraries that abstract away the scraping algorithms, but the infrastructure and creation of the scrape definitions is still on you. Open source tools give you an interface for setting up scrapes, but are run once and actively by the user. Commercial software provide end-to-end solutions, but you’re limited to how/when the vendor provides data — plus it costs money.

So we built our own server.

Least-Most Specific (LMS) Algorithm

Finding reliable CSS Selectors for the HTML elements you want to scrape is critical to obtaining good data. During implementation of our web scraping server, we went through many iterations of our element selection algorithm to accurately pinpoint the elements we selected.

While it’s easy to identify elements by their ID or class list, many elements lack an id, or share classes with other elements on the DOM. In addition, popular websites like Reddit modify their HTML with ids and pseudo-id classes to prevent scraping.

To make your scraper versatile, it’s best to use a combination of element tags and nth-child positions to consistently grab elements. Luckily, you won’t have to do this from scratch! In the following example, we will walk through the Least-Most Specific Algorithm, a performant CSS selector algorithm for identifying elements on webpages.

Scraping from Reddit

Let’s say you are looking to build a web scraper for Reddit. To be specific, you are looking to get titles of all posts on the Reddit Homepage. Your initial thought is to use the id and classes to grab the titles.

That is until you run into ids like thing_t3_6uc76y that Reddit is using to prevent you from extracting titles on their website.

Your next idea is to get the entire DOM Path of the element.

The problem with extracting the entire DOM Path are redundant elements, such as html and body elements. In most scenarios, only a fraction of the DOM Path is needed to find the unique element.

Instead of traversing down from HTML, we can traverse up the DOM Path from the current element. Before traversing up the respective parent, we check to see if the relative path results in a unique element.

This approach results in the least specific selectors needed to get our post. Compared to the entire DOM Path shown above, we reduce traversing by 7 iterations.

But wait, that’s not it! We can use this same path to get the most selectors needed to get all related elements of the our selection. This is done by removing the last traversal.

By creating an algorithm that traverses the DOM from the child to html, we build a performant algorithm that reduces the number of traversals needed to get CSS selectors for our selected element and related elements.

In addition, we avoid the pain of dealing with websites that add layers of ids and pseudo-id classes to prevent web scraping.

LiveAPI

Data extraction is complicated and labor intensive process with a high developer barrier. While some paid solutions tackle the challenge of simplifying data extraction and service of data; for anyone who doesn’t have the money to spend on expensive API access or the time and technical skill to extract data manually, there are not many existing free, open source alternatives.

To solve this, we built LiveAPI, a simple, end-to-end, open-source data extraction solution. With a short bash command and a few clicks you can serve your own APIs locally or on an AWS server. To get started, set up the server, install the Chrome Extension, browse to a website and design your own JSON data to serve. Teams can even configure their server(s) with multiple users that have authorization to design endpoints. Nested detail scraping and pagination navigation is coming soon to the intuitive UI that guides users through the endpoint creation process directly, and unintrusively, on the target URL page.

We’re Brett Beekley, Melissa Schwartz and Penn Wu. For more information on LiveAPI please check us out on Github. We appreciate your suggestions, issues and pull requests.

Netscape

A community dedicated to those who use JavaScript every day.