How GDELT is cataloging and analyzing the entire planet

Tom Krazit
Structure Series
Published in
5 min readMar 1, 2016

(Editor’s note: This post was written by Kalev Leetaru, founder of the GDELT Project and Structure Data 2016 speaker. Leetaru will be presenting a version of this essay on March 9th at Structure Data; register here.)

What would it look like to use massive computing power to see the world through others’ eyes, to break down language and access barriers, facilitate conversation between societies, and empower local populations with the information and insights they need to live safe and productive lives? By quantitatively codifying human society’s events, dreams and fears, can we map happiness and conflict, provide insight to vulnerable populations, and even potentially forecast global conflict in ways that allow us as a society to come together to deescalate tensions, counter extremism, and break down cultural barriers?

That’s the goal of the GDELT Project.

All locations mentioned in media coverage monitored by GDELT February to July 2015 colored by the primary language of coverage mentioning that location.

All locations mentioned in media coverage monitored by GDELT February to July 2015 colored by the primary language of coverage mentioning that location.

The vision is to create a realtime open-data index over our global human society, inventorying the world’s events, emotions, images and narratives as they happen through mass live data mining of the world’s public information streams. GDELT takes advantage of the full power of today’s cloud by creating one of the world’s largest social sciences datasets that spans news, television, images, books, academic literature and even the open web itself; mass machine translating it all from 65 languages; codifying millions of themes and thousands of emotions; and exploiting algorithms from simple keyword matches to massive statistical models to deep learning approaches.

Monitoring the world starts with data. The many streams that feed into GDELT each come with unique processing requirements. Digitized historical books suffer from OCR error, television captioning combines typographical and homophone errors with a lack of capitalization and punctuation, PDFs with complex layouts can result in mangled text, and online news requires specialized algorithms capable of separating article text from surrounding navigation bars, insets, headers and footers. Monitoring local news outlets across the world requires ongoing collaborations with libraries, governments, NGOs, journalism consortiums and citizens to maintain and update a global list of news outlets. The decline of RSS feeds requires a massive crawling infrastructure capable of dynamically learning the structure of each news website and browsing for new articles like a human reader, adjusting its recrawl rate, crawl characteristics and requests in realtime to accommodate network congestion, update rates and site changes.

And the majority of this data is written in a language other than English. Processing all of this content requires a mass machine-translation infrastructure capable of translating news articles on nearly every topic imaginable from 65 languages and dictionaries that constantly update with new names, words and colloquial expressions and the ability to absorb sudden massive surges in foreign language content in the aftermath of major events like the terrorist attacks in Paris.

So how do we do it? GDELT brings together several thousand different software packages, models and algorithms written in languages including PERL, Python, Java, C, C++, R, PHP, JavaScript and bash scripts. All of these tools must interact seamlessly in realtime across a vast data fabric, scaling transparently, even while many of them were never designed for realtime real world datasets. There are 40 different sentiment mining packages that collectively assess thousands of complex emotions from each article like “anxiety” and “goal-based motivation.” There is little precedent for how to compactly represent up to 4.5 billion emotion scores per day in a portable, compact, yet rapidly parsable format, creating many on-the-fly lessons learned.

Most recently, GDELT has been exploring the world of images. To date, the majority of work on news analysis has focused on textual news, yet imagery plays a critical role in how we understand global events. With the help deep-learning algorithms, more than half a million images per day are cataloged, identifying objects and activities, logos, text, facial sentiment and even image-based geolocation.

A visualization of the network of people mentioned together in Russian media coverage of the Russian economic sanctions

A visualization of the network of people mentioned together in Russian media coverage of the Russian economic sanctions

Of course, once you have all of this data, the real problem is how to actually analyze it all. While GDELT offers raw CSV files of all computed metadata, it turns out that few organizations have the technical capacity to work with even a fraction of the data. In the early days of GDELT it took hours to days to perform even the most rudimentary extracts. Today, using cloud analytics, 1.2 billion location mentions in a 1.5TB table can be aligned with their corresponding narratives and exported to a cloud mapping platform in less than 60 seconds, co-occurrence networks can be visualized in tens of seconds based on billions of connections and millions of books can be sentiment mined at 340 million words per second.

Most excitingly, it becomes possible to identify patterns visible only at whole-of-dataset scale, using a single line of SQL to perform a million correlations a minute to quantify the patterns of world history as seen through the news. Scale is no longer a limiting factor, with one analysis last month combining 860 billion emotional scores, 1.48 billion location mentions, 89 million events and 1.4 million photographs from 200 million news articles in 65 languages from every country on earth to map global happiness in 2015.

The cloud makes all of this possible. GDELT makes extensive use of Google Cloud Platform, using Google Compute Engine to run its production systems, but also to rapidly spin up large customized clusters on demand for specialized analyses. In contrast to the large HPC systems of the academic world, the VM model of the cloud — with its fully customizable software stack and ability to custom “build” machines to precise specifications and to “rent” thousands of processors in durations of seconds — makes possible the rapid prototyping model used by GDELT, while the stability of the cloud means successful prototypes can instantly transition to production without a single change.

Yet, perhaps most critically of all, despite this immense power, the fully managed world of the cloud makes it possible for a one-person open data initiative like GDELT to monitor the entire planet, focusing on changing the world, not changing hard drives.

The author would like to thank Google and Google Ideas for their support and use of the Google Cloud Platform and the BigQuery and Cloud Vision API teams for all of their help, along with CartoDB for the use of their mapping platform.

--

--

Tom Krazit
Structure Series

Executive Editor, Structure. Tech industry observer. Opposed to the designated hitter.