Web Scraping with node and RabbitMQ

repository link : https://github.com/red-starter/RabbitMQWebScraper

RabbitMQ is a message broker. It accepts messages from producers and gives them to consumers. In between we can buffer, persist and even route the messages as we please.

The queue lives inside RabbitMQ. Messages flow thorough and can only be stored inside a queue, which is a infinite buffer. Many producers can send data to a queue and many consumers can receive data from one queue.

Side note: The consumers, the queue and the producers can be in distributed systems. They do not have live on the same machine.

Work queues are useful to avoid doing resource intensive task synchronously. We can encapsulate a task and send it to the queue, a worker process running separately from our main application can grab the tasks off the queue and execute it. Moreover, we can run many workers at the same time and share the tasks between them. This enables us to easily parallelise work, we can scale easily by adding more worker processes.

Example of the way producers and consumers interact via the queue.

http://vichargrave.com/wp-content/uploads/2013/01/producer-consumer-model.png

RabbitMQ can use multiple protocols and has clients for many languages. This tutorial uses AMQP 0-9-1 protocol and the amqp.node client. The three external dependencies are:

“dependencies”: {
“amqplib”: “^0.4.0”,
“cheerio”: “^0.19.0”,
“request”: “^2.67.0”
}

Cheerio is an implementation of jQuery designed specifically for the server. We will use it to pluck relevant DOM information from returned html. The requests library is the simplest way to make http calls. FS is the core I/O library for node, we will use it to write JSON files containing the movie information (title, release and rating).

This is the producer, we will send a 100 messages (urls) to the worker queue.

// we need to require the library first:
var amqp = require(‘amqplib/callback_api’);
//simple function that generated valid IMDB url strings
function getMovieByIndex(index){
index = 350000+index
return ‘http://www.imdb.com/title/tt0'+index
}
//we need to connect to the RabbitMQ server
amqp.connect(‘amqp://localhost’, function(err, conn) {
//creating a channel, where the api for getting thing done is
conn.createChannel(function(err, ch) {
//declaring the queue where publish messages
var q = ‘movieUrls’;
//a queue will only be created if it doesn't exist
ch.assertQueue(q, {durable: false});
for (var i = 0; i < 100; i++) {
//create a hundred url strings which will be sent to the queue
var urlString = getMovieByIndex(i);
ch.sendToQueue(q, new Buffer(urlString));
console.log(“Sent “+urlString);
};
});
//We close the connection and exit
 setTimeout(function() { conn.close(); process.exit(0) }, 500);
});

Our receiver gets messages pushed from RabbitMQ, so unlike the sender which publishes a single message, it will run continuously. Open a connection and a channel and declare the queue to consume from. The queue name matches the queue that sendToQueue publishes to.

var amqp = require(‘amqplib/callback_api’);
amqp.connect(‘amqp://localhost’, function(err, conn) {
conn.createChannel(function(err, ch) {
var q = ‘movieUrls’;
//queue declared again
   ch.assertQueue(q, {durable: false});
ch.consume(q, function(msg) {
var url = msg.content.toString()
request(url,processAndWrite)
}, {noAck: true});
});
});

The receiver might start before the sender, so we make sure the queue exists. So we declare the queue in the consumer as well.

We’re telling the server to deliver the messages from the queue. This will occur asynchronously, so we use a callback that will be called when RabbitMQ pushes messages to the consumers. The callback requests information from the

ch.consume(q, function(msg) {
var url = msg.content.toString()
request(url,processAndWrite)
}, {noAck: true});

The callback triggers an http request and the fetched html will be fed to the processAndWrite callback. The html is loaded with cheerio and the processed output is written to a JSON file in the movies directory.

var processAndWrite = function(error, response, html){
if(!error){
var $ = cheerio.load(html);
   var name, release, rating;
var movie = { name : “”, release : “”, rating : “”};
   $(‘.header’).filter(function(){
var data = $(this);
name = data.children().first().text();
release = data.children().last().children().text();
     movie.name = name;
movie.release = release;
})
   $(‘.star-box-giga-star’).filter(function(){
var data = $(this);
rating = data.text();
movie.rating = rating;
})

fs.writeFile(‘movies/’+name+’.json’, JSON.stringify(movie, null, 4), function(err){
console.log(‘File successfully written!’);
})
}
}

You can now spawn as many workers as you desire. In the following example I spawned three workers.

A single golf clap? Or a long standing ovation?

By clapping more or less, you can signal to us which stories really stand out.