Scraping vs APIs — necessity is the mother of invention
When building a web application, some of the most interesting features are often built on the back of pre-existing services. Whether that means incorporating twitter for easy social media connectivity or google maps for easy to use map data, using pre-existing data is often the best way to make your website awesome.
But what if you need information from a website that is small or doesn’t offer its own API? Theres still a way! Scraping!
Scraping should always be done with caution — if a website does not offer an API, sometimes it means that that website doesn’t want you to use its data. But there are sites out there that haven’t built out an API and have no problem with you finding and using their data.
So what does scraping look like? One way to do it is to create scraping classes, and then initialize a new object in that class that is the scraped data.
class RedditScraper
attr_accessor :html, :url
def initialize(theme_id)
url = REDDIT_URLS[theme_id].sample.url
@html = Nokogiri::HTML(open(url))
end
Now, our RedditScraper class will create objects with both a url and html. Because we are using multiple subreddits, we’ve abstracted away the actual website we are scarping and have the method Then, we have to format that data that comes in. To do that, you can build a ‘scrape’ method.
def scrape
content_hash = {}
self.html.css(“.title”).each_with_index do |post, index|
if post.name == “p” && index > 3
link = post.children[0].attributes[“href”].value
if link[0..2] == “/r/”
link = link.prepend(“https://www.reddit.com")
end
title = post.children[0].text
content_hash[title] = link
end
end
content_hash
end
Now we have formatted content to display. Its still in the form of a hash, so we need to get the content out of the hash. If we wanted random content from the hash, we can use the sample method. Because my application is a Ruby on Rails application, I’m displaying the information in an html.erb file:
<% post_title = content.keys.sample %>
<% post_url = content[post_title] %>
Now you can put that content in formatting divs, or even do more formatting if needed. But now you have consistent access to your website’s content.
This example is incredibly specific — we needed to scrape both the text and the link within that text. But the general idea will be the same with most scraping. Find the needed information, use Nokogiri to grab it, and format it as needed. Here’s another example of using Nokogiri, from the founder’s tutorial:
require ‘open-uri’ doc = Nokogiri::HTML(open(“http://www.threescompany.com/"))