Scrape nested contents using Ruby and Mechanize

Han Lee
Han Lee
Jul 23, 2017 · 3 min read

When you scrape a website, sometimes you need to click a link to get to the information you want to extract. Ruby and Nokogiri are useful tools to scrape a web page. However, Nokogiri alone does not let you submit a form or click a link to get you to the page which contains the information you need.

For example, chemists check Sigma-Aldrich website to find properties of a chemical. Let’s say you want to find molecular weight, boiling point and density of toluene. First, you need to go to Sigma-Aldrich home page ( http://www.sigmaaldrich.com/ ) and type in toluene in the search box, which will lead you to the page with the long URI address of http://www.sigmaaldrich.com/catalog/search?term=toluene&interface=All&N=0&mode=match%20partialmax&lang=en&region=US&focus=product.

However, this page only gives you formula, molecular weight and CAS number of toluene. You need to click one of the product link (e.g. 24511) and move to the product page to get boiling point, melting point and density of toluene.

Mechanize is a Ruby gem that let you interact with websites with your Ruby codes, so you can follow links and submit forms. Let’s write a Ruby app that accepts a name of a chemical and prints properties of the chemical by extracting information from Sigma-Aldrich website, which is one of the most extensive and reliable resources for finding properties of chemicals.

First, you need to have Ruby installed on your computer. Currently, I have Ruby version 2.3.0. Next, you need to install Mechanize. You can install Mechanize by typing gem install mechanize in your CLI (command line interface).

And, we will create a file called “scrape_chem.rb”.

scrape_chem.rb

Here are the description of the code above;

  • require ‘mechanize’ requires the mechanize library into the script,
  • agent = Mechanize.new instantiates a new mechanize object,
  • page = agent.get(“http:www.sigmaaldrich.com/…”) extracts the data from the page.
  • name = page.at(“a .name”).text 1) the #at method takes a CSS selector and returns the first matching element and 2) the #text method gives the text inside the element.

When you run ruby scrape_chem.rb in your cli;

CLI

Now, you get the name of the chemical, which is not that exciting. Let’s try to get formula and molecular weight of the chemical.

scrape_chem.rb

In line 15, #css method was used instead of #at method. The #at method returns the first occurrence of a Node, while #css or #search returns a NodeSet which is similar to an array of Nodes.

To collect melting point, boiling point and density of toluene, we need to click the first product link to navigate to the product page. This can be done with Mechanize. More instructions regarding Mechanize can be found at docs.seattlerb.org.

scrape_chem.rb

In line 29, #link_with method finds the first occurrence of a link with a certain href and #click method clicks the link and redirects to the product page.

Let’s try the app. In your cli, type ruby scrape_chem.rb, and in the app, type toluene.

CLI

You can see all the properties are extracted successfully. You can check the Ruby file in my Github, and join me developing this app. There are still many issues with this app, and I will continue to work on those in the future. Happy coding!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade