In this post we are going to talk about what web scraping is, its uses and when you should choose another option. We are also going to go over some applications of this technique.
What is web scraping?
Web scraping is a technique to extract information from websites by using software programs to automate the process. These programs usually make HTTP requests directly from the code or simulate human behavior by embedding a browser inside the application. One famous example could be GoogleBot, Google’s web scraper, that gathers information from all the web in order to classify and rank sites for their search engine.
Let's scrap some data!
In the following example, we are going to get some data out of a well known online marketplace in order to gather some data that we could use later to feed a machine learning algorithm.
Before writing code
Before we start coding our scraper, it’s necessary to understand the page that we are going to be working on, the flow and the elements that we are going to be interacting with our code.
Let’s say that our machine learning algorithm is going to be able to predict a product’s category only by its title. In order to train it we are going to need lots of products titles and their corresponding category.
- From here we can infer that our scraper will need to write a term in the search bar (red circle) and then press the search button (green circle).
- This will take us to a view where we can see the categories that our search was related to (red circle) and the product title for each of the found results (green line). We will need our scraper to extract that information.
- Only one page of product titles wouldn’t be enough data to feed a robust AI algorithm, so our scraper needs to find the pagination buttons (red line), visit page by page and extract all the titles until the next button (green line) is no longer visible (that means that we are already on the last page).
- Download Google’s chrome driver, this is the browser that our script will use.
- After downloading it, move the executable file to /bin folder:
$ sudo mv chromedriver /bin.In this way, you won’t need to specify the driver’s URL in your code later.
- Install the Selenium web driver:
$ gem install selenium-webdriver
- Install interactor (we will use this gem to keep our code nice and tidy):
$ gem install interactor
- Or if you use a bundle like I will in this example, simply create a Gemfile and run
$ bundle install
We will start by defining our main class. It will contain our configuration, such as what browser we will use, what website we will visit, what we will search for and for how long we will wait until we consider something went wrong.
We will then call the class that orchestrates our scraper’s logic
ScrapingOrganizer will only be on charge of calling other 3 classes in order. First, it will search for an item, then find the categories related to it and then start getting the titles of all the items found. It will repeat this last action page by page.
Now let’s see how each of the three steps described before works:
SearchItem, as its name says, will:
- Find the search bar.
- Write the configured search term into the search bar.
- Find the submit button.
- Click the submit button.
GetCategories will parse the page shown after the search, look for the categories’ breadcrumb, extract and print the texts found in them.
Now that we have the categories in
GetTitles we proceed to extract all of the titles from the current page and when we are done, we go to the next page until there are no more pages to visit.
This is how our program would look while running:
You can find the complete code at:
When is web scraping useful?
- Web scraping is extremely useful when you don’t have access to an API because it gives you the keys to all accessible internet’s data. This is especially important these days when we are constantly improving and implementing algorithms that use data as their raw material.
- If you have an API available always choose it over scraping because a) it’s probably going to be easier to consume it rather than writing a web scraping script, b) web scraping is very susceptible to structural changes within the website you are working on. This means that a change on the site’s HTML structure could leave your program completely obsolete.
- Any tedious job in a website requiring human labor can potentially be replaced by an automation script.
- Automating Quality Assurance processes by using Selenium’s actions to check if styles, properties or HTML elements, change, have a particular value or match any comparison after or before an event in the DOM.
In this post we have learned how to get data out of any website, even if it is a one-page application by using Selenium webdriver alongside with Ruby. We also went through some of the most popular uses of this technique and reviewed some cheat sheets for the most commonly used selectors and actions.