Portuguese news crawler in Node.js

2 min readSep 2, 2014

For the past few weeks I’ve been coding a crawler to parse portuguese news from the main websites and an API endpoint to open the database to the world in javascript using the great Node platform and MongoDB for storage.

Currently the project is in an early alpha phase, I’m adding new features and solving issues all the time but it’s quite usable by now.

The crawler has a list of news sites to crawl and for all the different sites there’s a specific module that process that site and extracts relevant content:

Title
Caption (if available)
Publish date
Single image (if available)
Category
Keywords (if available)
Main text

Crawled sites:

I’ve decided to add a specific parsing module to each website. It’s the more painful way but the content retrieved is perfect for every website, there’s no way to generalize the content extraction tool like $(“article”).

API endpoint

Right now the API is hosted at Heroku @ newsptapi.herokuapp.com and there are a few endpoints that can be accessed

Last news

/api/noticias/ultimas/(optional number of news to retrieve)

News from specific site

/api/jornal/(site name)

e.g. /api/jornal/Publico

Random news

/api/noticias/random/(optional number of random news to retrieve)

Goals

My main goal is to do some news processing to find interesting stuff and while doing this I created this accessible API that I want to open up to all portuguese developers so they can make interesting stuff as well. Also while doing this I’m learning more and more about javascript and Node.js, platform that can be useful for me in the future.

The source code is on Github (see below). It might not be the most structured and well coded node.js app, I’m still learning ;)