This post is a bit different as instead of talking about product analytics, I decided to share a few scripts that allows me to mine the GDELT project’s data.
Here’s a link to a GitHub repository that contains a few scripts that will allow you to easily consume new GDELT events as they are being published every 15 minutes.
A little bit on GDELT
Don’t know what GDELT is? It’s one of the most ambitious project I’ve encountered in the past few years, as it scrapes all newspaper articles, stores them in a massive database and provides an impressive amount of descriptive values to analyze that stream of data.
And that’s just the start as they are constantly improving the database. For example, they now let you scan the day’s top trending topics compiled from the national television station’s closed captioning. Amazing!
It’s massive and supported by Google on their BigQuery platform.
The technical requirements
So I’m working on a project with a friend that aims to monitor worldwide protests in real-time. And of course, one of the main source of data for our project is the GDELT Event database.
We could use Google BigQuery as they do provide a really cheap (or is it even free, I can’t remember) way to query that massive database. But that’s not really the requirements of the discursus.io project. We want to do live monitoring, so would like to plug ourselves directly on the stream.
And GDELT, in all its awesomeness, easily lets you do that.
Once we’ve downloaded those new events, we want to filter the ones we only care about and then commit them to our database. It should be worth mentioning that each event’s encoding is very detailed. Just look at the GDELT Event codebook for a description of all fields.
So, with all that being said, we now have 3 scripts in our GDELT Mining repository.
- The gdelt.sql script is only provided to give you an idea of how I structured my own mysql table to store the events I’m interested in.
- gdelt_miner.sh is a bash script that just goes through the loops to download the latest csv file from the GDELT Event 2 database.
- gdelt_transfer.py is a python script that reads the latest csv file, filters the events and commits the ones I’m interested in to my database.
For the miner and transfer scripts, I’ve added cron jobs that triggers those scripts every 15 minutes (of course, the transfer one is triggered 2–3 minutes after the miner one).
Easy! And that’s part of what makes that project so great, how easy it is to use it for your own requirements.
If you have any suggestions regarding how to make those scripts better, please send those my way.