Beginner’s Guide to Website Scraping with Mechanize Ruby Gem

Katana Tran
5 min readSep 10, 2019

--

Ever tried looking for an API to use for your application but none of them have quite the data you hoped for? After running into this problem a countless number of times, it was time to add website scraping to my toolkit.

Notes

  • I’ll be using Mechanize (Ruby gem) which relies on Nokogiri (Ruby gem) as its default HTML parser.
  • I am using Merriam-Webster’s ‘Word of the Day’ for demonstration purposes.
  • The pros of scraping includes continually new data depending on the website you are scraping, being able to choose your data, and being able to format your data.
  • However, the cons of web scraping is if the layout of the page you’re scraping changes, then your code may not function correctly. Web scraping may also be against the terms of use of some websites.

Let’s get Started!

  1. First we will create a new folder with a Gemfile. In the Gemfile include gem 'mechanize' and run bundle in terminal to create a Gemfile.lock. Then create a new ruby file and add require 'mechanize' at the top of the file.
  2. In our file we will add mechanize = Mechanize.new and page = mechanize.get(‘https://www.merriam-webster.com/word-of-the-day'). The page will come back as a mechanize object which we can then call methods on.
  3. In the following image we are on the url https://www.merriam-webster.com/word-of-the-day. From this we will open our inspector web developer tools (Mac: cmd + option + j). Using the box with a cursor on the upper left of the pop-up screen we can select for elements we want to see them in the HTML. Then we will write code to select for each element of the page:
  1. Getting Title:puts page.title will puts the page title “Word of the Day: Pell-mell | Merriam-Webster” into our terminal. We can limit this to “Word of the Day: Pell-mell” by putting limiters onto our title puts page.title[0..-18] to format out “Merriam-Webster”.
  2. Searching for specific Word: using page.search like a querySelector will allow us to grab the element we want, in this case I am grabbing the word from the highlighted text above: word_of_the_day = page.search(‘div.word-and-pronunciation h1’).text.strip .text removes the tags from the search results. (An image below will show the output without calling .text.
  3. Search and receive array results: I am grabbing the definition of the word and the ‘Did you Know?’ with word_descriptions = page.search(‘div.wod-definition-container p’)[0..-4] this comes back as an array with extra items. I used [0..4] to discard the last 4 items as they were random tidbits of information.
  4. Search for first matching result: using .at will select for the first node that matches the search result date = page.at(‘article div span’).text.strip will give me the date on top of the word. (September 10, 2019)
If you forget to .text then the tags will show.

To crawl the Website and click links:

I wanted to add increased function to my program. So instead of just scraping today’s ‘Word of the Day’ mechanize gem also allows me to click links on the page which I can then choose to scrape. I use this in my program to scrape past ‘Word of the Day’ pages by ‘clicking’ the back arrow key.

I used link = page.link_with(text: ‘Prev’) to select the link I want to eventually click. I am querying for a link that has an innerText of ‘Prev’. I found this by using my inspector tools and highlighting the back arrow next to the current word of the day. To tell my program to actually click the link I usedpage = link.click. After this I called my function ‘create_word_of_day’ again to load the same formatting create_word_of_day(page.uri). Page.uri here represents the new website link after clicking the prev button.

Mechanize also supports filling out forms for you, for this documentation refer to the mechanize gem documentation resource at the bottom of the list!

End Result

To show extra functionality I formatted the results and added/subtracted words to generate my own ‘Word of the Day’ function. In the code I called the create_word_of_day method while inside the method to create a recursive function. As is, the function will continue to output a new ‘Word of the Day’ to the terminal until stopped with ctrl + c (Mac)! You can see how easily this information can then be used in your own website/app without any hardcoding. Happy scraping!

Final code(left) to obtain output(right)

EDITED 9/27/19 : Additional Tools

1. Grabbing an Img Src or Attribute

To grab an image, I query selected for class ‘rec-photo’ then asked for the attribute(‘src’). Alternatively you can use this to find specific attributes as well such as ‘alt’.

img = page.search(‘img.rec-photo’)
img.attr(‘src’)

2. Selecting for an Attribute

In the example I am looking for an ‘ a tag’ with attribute ‘data-a=1’

var a = page.search(‘a[data-a=”1"]’);

3. Getting Links from your Pages

require ‘mechanize’
mechanize = Mechanize.new
page = mechanize.get(url)
page.search(‘a’).each do |link|
puts link[‘href’]
end

Further Resources

  1. Mechanize Ruby gem documentation HERE.
  2. Nokogiri gem usage documentation HERE.
  3. Nokogiri ‘cheat sheet’ HERE.
  4. Mechanize gem usage documentation HERE.
  5. How to query CSS selectors HERE.

--

--

Katana Tran

Currently learning the ropes on programming through Flatiron School. Here on Medium to document my journey through a newfound joy!