My go-to tool for data collection is the SelectorLib library. It is an easy to use, quick alternative to setting up a scraping solution from scratch. There are many ways to implement the library, and I will share my workflow. I encourage anyone interested to also take a look at the documentation on the website, because they do a good job of providing tutorials and guides that spell things out clearly.
In order to use this module you need to download the python package, and download the chrome extension.
pip install selectorlib
You can think of the process of using SelectorLib as applying a filter to HTML output. You are picking all the pieces of HTML you want, and discarding the rest. SelectorLib is used to easily build the filter.
Step 1: Using the Chrome extension to select what you want.
First thing we need to do is identify the information we want to obtain. For this article, let’s assume we want to scrape the BBB complaints for Facebook. Here is the layout of the complaint’s page:
As we can see, the page is designed using a card layout. Most modern websites are designed in this way, with “cards” that all contain similar information. In the above example, each complaint is contained within its own card. Further down the page, it becomes clear that company responses are also added to the same cards:
The first step is to use Selectorlib to select the outermost card. We need to right-click on the page, open up developer tools, and hit the double arrow symbol.
Then click selectorlib in the dropdown menu.
Then we want to create a new template, name it, then click
create template again when prompted.
Then click add:
Now you will be presented with a form, you want to name your selections appropriately. I usually name the outermost selection card. After giving your selection a name, click select element next to the type of selection you want to make. There are two options
CSS Selector and
XPath . XPath can be very useful if you need to grab those hard to reach items, but I tend to try using CSS Selectors because they change less frequently than XPath — which can change depending on the layout of the page. You also want to select the “Multiple” radio button, because we want to grab all of the complaints on the page.
After clicking select element, you will see as you hover your cursor over the page that the elements will light up green. We want to start selecting the cards, after selecting a few they all should be highlighted red.
After the item is selected, it will turn red and you will see all of the information within the card in the preview pane. After you select the outermost card with all of the information you want to obtain, click save.
Now that we selected the outermost cards on the page, we want to start selecting all of the information within the card we want to grab. After clicking the plus sign, we just want to follow the same procedure, adding the elements within the card (and naming them correctly).
After we finish, the page should look something like this:
Now that we have the information we want, we have get the YML text by clicking on this icon:
We want to copy this YML output for use later in our python script. The YML is the template the selectorlib is going to use to select to correct elements on the page.
Step 2: Scraping the information in python.
Now that we have our template, fire up your favorite code editor and import the selectorlib module. You can use selenium, the requests module, or urllib to grab the page HTML from within your script, and extract the information. I like to use selenium because it allows me to deal with any browser interaction like pagination or scrolling down to reveal information on a site with infinite scroll. Below is a basic outline of the process, you can wrap other logic around this code in order to create some very efficient scraping tools very quickly.
I hope you found this tutorial helpful. SelectorLib is my go-to framework for extracting data from a website, and has worked in most of my scraping use cases. SelectorLib also does not only extract text, you can extract links, images, HTML, and attributes with this tool. Just remember to respect the robots.txt file for any websites you scrape.
💻 Feel free to check out my website.