Creating new Model Objects with Nokogiri

Brittany Hartmire
Code Journal
Published in
6 min readJan 31, 2019

This blog post provides a simple tutorial about how to scrape a website with Nokogiri and create new model objects with that scraped data in Rails. I assume basic knowledge of Ruby on Rails and ActiveRecord.

What’s Nokogiri?

Nokogiri is the Japanese translation for a fine-toothed saw used in woodwork. It’s also a Ruby gem that allows us to parse HTML, ripping through a massive string and allowing us to access the finer nested nodes within it.

I recently used the Nokogiri gem in a web application with a Redux/React frontend and a Rails API backend. My app allows dancers in Los Angeles to manage their dance class schedules and favorite dance instructors in the city. (Github repo here.) I include dance classes and instructors from three real dance studios, Millennium Dance Complex, Movement Lifestyle, and The Playground LA. This tutorial will walk you through my process of creating new Dance Class model objects by scraping the Millennium website with Nokogiri.

Here’s a look at my DanceClass and Instructor table:

Notice that my dance class table includes a foreign key for instructors. Next, I establish this basic has many/belongs to relationship in the Instructor and Dance Class model.

If you haven’t added the Nokogiri gem to your Gemfile, bundled, and ran your ActiveRecord migrations, do that now. Time for the fun part.

Let’s create a new file in our models directory called dance_class_scraper.rb

Add a DanceClassScraper model and require ‘nokogiri’ and ‘open-uri’ at the top of the file.

Let’s define a new custom model method inside our Dance Class Scraper that will grab the HTML string at my desired URL:

Whenever using Nokogiri, the DevTools in your Chrome console are a godsend. If the HTML elements on the webpage you’re trying to scrape have defining attributes, your life will be so much easier. Unfortunately mine doesn’t, but where there’s creativity, there’s a way.

Each day of the week is represented in an <h2> element. Under each day of the week there is a div containing four side-by-side <ul> elements with a class name of “pricing-table” and several <li>’s representing the start and end time, name, level, and instructor of each dance class for that day. I’m going to ignore the level property for the purpose of my app but I definitely want the rest of that data, including the day of the week. Because the data structure is identical for each day of the week I’m going to collect an array of the <h2> elements (i.e. days) and iterate through each day to scrape its dance class data. Let’s create a new method in our scraper model:

I slice my array of <h2> elements, because the webpage contains an additional <h2> element at the bottom of the page about something unrelated. Now I want to define a separate method that takes in a day as an argument with the sole concern of scraping the dance class data for that day:

I know it looks like there’s a lot going on here, but the logic in this method is actually quite simple. The webpage contains a total of 28 unordered lists with a class name of “pricing-table”, 4 for each day of the week. These 4 include information regarding the time, name, level, and instructor for each scheduled dance class. All of the <ul>’s are accessible in the data array defined at the top of the method. I am then assigning an information variable to a hash which will contain an array of all the times, names, and instructors for that day. Based on the argument value, this method assigns the hash keys to an array of the <li>’s in the corresponding <ul>. The method returns this information hash. Now let’s get back to #make_dance_classes.

If I invoke #get_elements(day) in my #each iterator, I will receive a hash containing three arrays of all the times, names, and instructors for the dance classes on that day. Now I need to grab the corresponding time, name, and instructor for every dance class and create a new DanceClass model object with that data. Luckily the hashes are all in the same order. In other words, information[‘times’][5], information[‘names’][5], and information[‘instructors’][5] will return three pieces of data about the same dance class.

Understanding this, I’ll define a counter variable and until loop which will extract corresponding data from my information hash and use that to create new dance classes.

I want my dance class model objects to have separate attributes for starting time and ending time. This way it will be easier to sort them chronologically later. I split each time string at the ‘ — ‘ symbol and assign a separate start_time and end_time variable with the two values of the returned array.

Now we’re done with our scraper class! The final step is deciding when to scrape. I would like to do it in my dance_classes#index controller method. I instantiate a new Dance Class Scraper object and call our custom #make_dance_classes instance method on it.

In order to render classes in JSON, I need to make a DanceClass serializer. More information about serializers can be found here. Let’s make ours inside app/serializers/dance_class_serializer.rb

Because I want my Dance Class JSON object to nest its corresponding instructor object in JSON format as well, I need to create an Instructor Serializer inside app/serializers/instructor_serializer.rb

Now define a corresponding dance_classes#index route. If you start your server and visit this route, you should see a JSON object of all the dance classes from the Millennium website! Soon I’ll create another Blog Post tutorial building off of this one about how to render this JSON data with a React/Redux frontend.

When I start my rails server and visit http://localhost:3000/api/dance_classes, this is what I see! I can see the Dance Classes being scraped in my server, and all of them (over 50!) dance class objects are rendered in a giant JSON array.

My plan was to use Nokogiri to scrape the dance class schedule data from three websites, not one. Unfortunately, The Movement and Playground dance schedules are JS-rendered, so I was unable to use Nokogiri to scrape them. I manually seeded the regular weekly dance class schedule at Movement and Playground. Figuring out how to scrape those websites will be a future project to improve my app. If you have any advice on how to get started, please let me know!

A tip when using Nokogiri: To view what’s scrapable on any webpage, disable JavaScript in Chrome Developer Tools. JS-rendered HTML will no longer be visible on the page, and anything remaining can be scraped with Nokogiri.

--

--