Data Scraping in Ruby
Data science has quickly grown to become one of the fastest growing sub-fields in tech. As we have come to be more and more connected through social media and smartphones, the sheer amount of data creation resulting from this is staggering. Unsurprisingly this data can teach us a lot about how we think, how we communicate, and what we might just possibly be interested in in the future. Data analysis combined with machine learning is extremely powerful.
Big data firms use advanced algorithms to come to their conclusions, but much of this all goes back to one of the most fundamental parts of sifting through data: Web scraping. Web scraping is essentially gathering unorganized data, splitting it into relevant pieces, and making it more readable to humans. Web scraping allows us to quickly parse through and collect specific pieces of information from websites. The basic idea is to be able to scan through a page and compile a list of organized data. Manually collecting this information would be both tedious and wildly inefficient — this is why web-scrapers can be incredibly useful.
In an attempt to better understand the basics of how this all works, I decided to practice my skills by making a simple scraping application. As a more hands-on learner, after some brief background research I decided to just start writing some code in Atom and figuring it out as I went.
Nokogiri
Nokogiri is an open-source web-scraping library for parsing through HTML and XML documents in Ruby. This is one of the first gems I’ve used on my own outside of what I’ve used for my Flatiron labs, and thankfully despite a few guides warning of potential installation issues, the process was quick and painless. Installation required little more than waiting for the gem to install after writing out gem install nokogiri in your terminal and requiring it in your ruby file.
OpenURI
OpenURI is another popular gem used in web-scraping. It encapsulates the HTTP Request method into an open command, and basically gives you one less thing to worry about.
The first test was to see if I could pull the information from the Wikipedia entry on Web scraping. While there was no filtering involved in this first test, the goal was to go through the process of making the HTTP request through OpenURI and returning information in my Terminal.
I defined a variable called get_page to make the initial request to connect us to the server. To do this you must call Nokogiri::HTML, then the open command from OpenURI, and finally the url of the page you want to pull from. After defining this, you can then call the specific .css method of the variable to filter through. In this case, I used Chrome’s inspect tool to find the tag for the main page content .mw-body-content, as well as everything within the <p> tags (main readable content).
When you run the app in terminal, you get back the following:
But after adding .text to the end of our info variable, the result is a much more neatly organized version of the page contents.
And lastly, if we wanted to grab just the titles of the sections, we could filter by H3 headings.
I first tried using a .split(“ “) to separate the words to make it more readable, but the result was not at all what I was looking for. When I ran the app I got a list of all header names as well as the edit option next to each title that Wiki includes on all entries. The header titles were all split apart and not properly grouped together. Because Wiki includes this after each and every header, my way of working around this issue was to use the pattern to my advantage. Really all you have to do is replace it with a line break. Thinking back to old labs, I remembered the gsub method. Gsub takes in two arguments — the left side is the piece of string you want to select, and the right side is what you want to replace it with. The line of code below replaces each edit with a line break, therefore giving us each heading on its own line. Much easier to read.
To clean things up a bit further, we should put everything into their own methods so that we can more easily call them.
Now we can call either of the two methods and we will be able to get a neatly ordered list of titles, or the entire body text to read through in terminal. Now that we have narrowed down these pieces of data, we can use them in the future to get other attributes such as character count (.length), how many times a specific word comes up using .select type methods, and just overall being able to parse through any other sites.
One last test I did was to grab the usernames of the twitter users who recently used the #flatironschool hashtag. Much like in the first example I used, the goal is to parse through information and grab the data we want. We get the following terminal output when we run the recent_flatiron_tweeters method:
In addition to their usernames, I wanted to return their full names as well. Unfortunately, this is where I ran into a lot of issues. Some names were just a first name, others were a first name and last name, and others included emojis. The emoji factor was the largest problem I ran into and have still yet to fix.
To get the full names of the users I had to call a different css identifier. This was fairly simple and just involved changing around some names from the first twitter method I did. The problems occurred when trying to format the names within a list.
The emoji’s made it much more difficult to separate the names because there was no consistent pattern that all profiles followed. The next step was to find the unicode for the emoji’s so that they could be substituted out with gsub. I created a variable regex with all the identifiers. The regex variable is an instance of a class called “Regexp” and represents an example of a Regular Expression object. Regexp is used to identify specific characters.
In short, there exists an endless amount of ways of gathering data from the web. Sites like Google are scraping the entire web at all times, but on a smaller scale we can see how these building blocks all come together. I was happy to have learned how to return specific pieces of data from pages, and I definitely learned a lot from my issues. Different sites are of course built very differently, so what might work for one most likely won’t work for another. I’d like to sit down and spend some more time fine-tuning what I’ve learned and applying it to other websites. Although I didn’t get to write complete methods for everything I wanted to, I’d consider this as an overall successful experiment as a result of better understanding web scraping principles and being introduced to other CS concepts such as Regular Expressions.