I’ll share my process-based approach on how I created Libri and published it on RubyGems.org, alongside some technical roadblocks that I faced during the development phase. This project specifically focuses on scraping, which is a term used to describe the act of retrieving HTML-and-CSS-based data from a website page. Here is a walkthrough video of how Libri works:
After sifting through several scraping ideas—including scraping Noti.st, or 80,000 Hours’ Problem Profiles, or Adafruit’s Raspberry Pi projects—I settled on going back to a theme that can be simple, meaningful, and usable by many: Books. In searching for which website to scrape from, I had several options: the Man Booker website, Goodreads’ Awards section, as well as Penguin’s Award Winners list.
I chose Barnes & Noble’s awards webpage to scrape as it seems to be the most comprehensive and it’s also quite up-to-date.
To build a gem using Bundler, I started by running
bundle gem libri in the terminal at the Libri working directory. This will create file structures (called scaffold directory) for our gem, so we can start coding right away.
I made sure that my computer has also installed the following dependencies:
- Rake, used to build a local copy of our gem, which we’ll use to push and publish to RubyGems.org
- OpenURI, used to open a URL as if it is an HTML file
- Nokogiri, used to parse HTML and XML values from a webpage
- Pry, used as a local sandbox and a debugging tool
- Colorize, used to style text in the terminal using different colours
Now, for Libri, I wanted to make 3 things work on my terminal:
- Display the various awards
- Display the books belonging to a chosen award
- Display the information of a chosen book
To do this, I structured my
lib folder in this manner, separating the
Each class is responsible for different parts of the gem:
CLIclass is responsible for the terminal interface that interacts with the user
scraperclass scrapes text-based contents off the webpage
awardsclass creates new instances of
Awardsobject from hash values returned by the
booksclass creates new instances of
Booksobject from hash values returned by the
bookclass creates new instances of
Booksobject from hash values returned by the
This stage took the longest to complete, but all in all, it was a success, and I have several notes to make:
- I learned to use a multi-line string via
HEREDOC, which in itself has various methods to achieve the same thing (e.g.
- Initially, whenever an
exitcommand was called, the
Please try again.message would also be displayed. This was fixed by using a single-level
if/else...endconditional rather than
while input != 'exit'...endloop.
- I knew that I wanted to access several levels of information, scraping from various URLs, and being able to pass in different URL based on the user’s input (e.g. if user inputs for the Pulitzer Awards, the
Scraper.scrape_award()method must return information based on the Pulitzer Awards URL. If user inputs for Man Booker Prize, the expected return should be from the Man Booker URL). I knew then that I needed to pass in the URL as an argument for the
Scraper.scrape_award()method. Knowing this, I included a
:urlkey into the top-level
awardshash, whose value will be passed in to
Scraper.scrape_award(). Then, the second-level
bookshash can scrape from and access from the passed in URL—the same concept applies as we scrape from a third-level URL for individual book information. I wasn’t sure if this was workable, as previous labs I worked on hasn’t used a multi-level, real-time updated website, and therefore had no need for this flow. But it was! This was the best revelation I learned while building this project, knowing that versatility can be built into code.
- I couldn’t access HTML values for attributes which are not
href. The rating values on the B&N website was stored within the
aria-labelattribute, which does not return a value when I attempted to access it. I also couldn’t access the books listed under the Customers Who Bought This Item Also Bought section, which returns nothing as well. I’m still searching for answers.
- Originally, upon scraping, I realised that I could access hash values and display them from the
Hash[:key], even without instantiating new objects and assigning them their arguments / attributes. This led to an oversight, where I published the working gem without practicing the Ruby object relationship methods, such as has-many. This was fixed by editing the
bookclasses accordingly. Now we can access hash values, such as
- At one point, as the terminal displayed a list of books, then went back to select another award, the list of books displayed was accumulated, resulting in 20–40–60… number of books. This was a disaster, and I had almost given up. However, it was soon realised that the bug was caused by
CLI#make_book(award)method being called every time
CLI#menu_awardwas called, and this adds a new array of books onto
#make_book(award)is needed to instantiate our Book object and to access various attributes of our Book, and we need that. To fix this, a method to clear the previous instantiated object was included before
#make_book(award)is called, thus resetting the
Books.allreturn value for each menu call.
All in all, I wouldn’t have been able to overcome these challenges without talking through my code line by line, component by component, flow by flow, as suggested by Dakota.
By talking out my thought process based on this rough flow:
- What am I trying to do?
- Is Ruby doing what I’m expecting her to do? (Yn)
- If no, what’s happening instead, and why do we think it’s happening?
- If it’s happening because of X—then, if we change Y, we expect Z to happen.
- We test our hypothesis by changing Y, and we see if Z happens.
- If Z happens, based on our understanding of X, we should know how to fix it and achieve what we were trying to do.
- If Z doesn’t happen, don’t give up! Read up and look for help, and test different understandings to find the one that works with Ruby.
This is a simple project, however, with several different components interacting with one another, it was soon quite easy to lose track of one of them (e.g. how best to access and display every single piece of information, at which stage have objects been instantiated and at which stage they have not been, etc.), and when I lost that one, I soon lost focus on the big picture and I had to start all over again. So here’s to remember to keep practicing, and to practice it right!
Lastly, to publish a gem for the first time, I followed these simple steps:
- Edit the Gemspec file and update the
Summaryas well as
Descriptionspecification. Make sure that all
todoon the file has been rewritten to prevent any potential errors when publishing. Next, comment out the entire code block that says
Prevent pushing this gem to RubyGems.org, otherwise we won’t be able to push our gem.
- Add dependencies via
- Update the
version.rbfile if necessary, following semantic versioning standards. There are many guides out there, including this and this.
- Update the
README.mdfile as well. This is to help users have an overview of the gem, as well as how to install and run the gem.
- Make sure that your GitHub repo has all its files updated (latest commit and push).
- Make sure that
rakeis installed—so we can run
rake buildfollowed by
rake release, which will push our latest gem version onto RubyGems.org for others to use! Alternatively, I also tried using
gem push libri-0.x.x.gemto a similar effect. Another alternative is to install the
gem-releasegem, which provides several methods for helping with gem development that I will explore with further projects.
Hope you enjoyed this post, and I hope that makes sense to you! Drop in any suggestions for the gem and I’ll work on it. Happy coding!