Scrapping with Nokogiri and wget

One of my friend called me 2 days ago. She asked for help from me. The job description was easy, Scrapping Angel.Co. (I hope it is not illegal to scrap data from Angel.Co) Anyway, it was challenging for me and I took the job.

In this article, I will describe how I achieve to scrap data from Angel.Co step-by-step. I have used wget and Nokogiri.

OK, we will definitely use Nokogiri, but first we need to fetch the source code. open-uri library is perfect for the job. It is easy and handy. Let’s give it a try.

require 'open-uri'
uri = open('https://angel.co/slack')

You will see that this function will return 404 Not Found response. Yeah, Angel.Co is somehow blocking Ruby Net::Http. You can try to change user agent, and will still get the same response.

OpenURI::HTTPError: 404 Not Found
from /home/bunyamin/.rbenv/versions/2.3.1/lib/ruby/2.3.0/open-uri.rb:359:in `open_http'

We have a great tool in linux world called wget. You tell him what you want, and it gets it for you. That’s what I did. I told him to get me Angel.Co page, and it did not disappoint me. So I wrote the following Ruby function in order to download a given webpage by using wget.

Let’s give it a try. Open your ruby console and paste the above function. After that, paste the following code block, and you will see how it downloads the web page. Smooth!

host = "https://angel.co";
file = fetch("result.html", "#{host}/slack");

You have the source code now. You can scrap any information with Nokogiri.

require 'nokogiri'
page = Nokogiri::HTML(file);

Let’s fetch the Funding part!

If you do everything accordingly, you should see the following output.

Raised: $200,000,000 (NoStage) (Apr1,2016)
Raised: $160,000,000 (SeriesE) (Apr16,2015)
Raised: $120,000,000 (SeriesD) (Oct31,2014)
Raised: $42,750,000 (SeriesC) (Apr27,2014)
Raised: $10,700,000 (SeriesB) (Apr1,2011)
Raised: $5,000,000 (SeriesA) (Apr1,2010)
Raised: $1,500,000 (Seed) (Jan1,2009)