Nokogiri is a powerful module which provides coders with the ability to parse and search through HTML or XML documents with ease. When combined with Ruby, a powerful programming language in its own right, the two of them provide coders with a sort of “grey area” access to tons of information made readily available on the world wide web.
The First Step
When using Nokogiri, we must first tell Ruby to load it, we accomplish this with the following lines of code:
Require is Ruby’s way of telling the current program to load a file necessary for its code to work. In the first line of code, we are telling Ruby to initiate Nokogiri and in the lines that follow it, we are having Ruby load two other modules, ‘open-uri’ and ‘json’. With the ‘open-uri’ module, Ruby can open an http, https, or ftp URLs as though they were regular files, which can be incredibly useful. Finally, the last thing we need to tell Ruby to require is the ‘json’ module. This module allows Ruby to parse and generate JSON, which is a lightweight data-interchange format. JSON is easy for humans to read and write and easy for machines to parse through and generate, making it ideal for handling large amounts of data.
By implementing these modules along with a bit of Ruby code, we can collect information from a website that is needed for our own. The next step in accomplishing this is by telling Nokogiri to open the URL we wish to retrieve data from as a file and is accomplished below:
Replacing ‘INSERT-DESIRED-URL-HERE’ with the URL you want, allows the code to open a URL as a file using Nokogiri. After accomplishing this, it is just a matter of digging through the file returned by Nokogiri in the variable, doc. If you are unsure of what specifically your digging into doc for, you can use Google Chrome’s inspect feature. Google Chrome’s inspect feature allows users to view the code behind a webpage. By viewing the code behind a webpage, we can determine the location of the data within the file returned by Nokogiri. Using this location, we then implement it into the code below replacing the ‘location’ placeholder with it its abbreviated type (div, span, etc) followed by a period and then the location name:
The Last Step
The last step in the process, is to take the information that was collected and store/write it to a file for later use. A coder can start to accomplish this by first implementing the first line of code within the block of Ruby code below:
File.open(“temp.json”, “w” ) is a capability provided by the module, ‘open-uri’. This particular line of code, File.open(“temp.json”, “w” ), tells ruby to create/open a new file called “temp.json” and “w” tells Ruby to write to it. That being said if a coder wanted to read from a file instead, they would simply replace “w” with “r” and Ruby would return the contents of the file as a string. Now that we have a file available to write to, it is now time to write to it. This can be accomplished with the line of code shown within the do loop and is shown below:
This line of code, ‘f.write(collection.to_json), tells Ruby to write the contents of the variable, collection, to the “temp.json” file currently open for writing. The “.to_json” attached to the collection variable tells Ruby to convert the contents of collection to JSON before writing. Using this powerful combo of Nokogiri, open-uri, and json, a coder can collect data from a website, format it to JSON, and store it into a file for later use.
All and all this combo, is a process usually referred to as ‘scraping’ and is incredibly useful when building applications or when the data we need isn’t readily made available via an API. That being said, this process of scraping can fall into a “grey area” since the data being collected doesn’t technically belong to us in the first place. Due to this it is important to give credit where credit is due and mention where the data was collected from when using it.