A Big-Data Approach to Uncovering National Secrets


You’d think there’d be room enough in our nation’s warehouses, filing cabinets, and hard drives to store the classified letters, cables, and tapes that make up the stories and histories of the United States. Remember that big warehouse at the end of X-Files (or, if you prefer, Raiders)? Surely our diplomatic history is worth devoting one of those huge storerooms to.

My friend Matt Connelly, a history professor at Columbia University, says that isn’t actually the case, and we’re in a bit of a national historical crisis. The United States government produces a growing mountain of classified documents each year (nearly 77 million in 2010), while the budget to handle declassification and storage is shrinking.

So what happens when a short-staffed government agency (The National Archives) has exponentially more boxes of classified documents than it can possibly process and declassify? Well,as Connelly describes it, the staff takes a small sampling of the documents, and then a sampling of that sampling, and skims them.

Important? Keep. Not so important? Shred.

Ninety-five percent of the stored documents meet the latter fate. There simply isn’t funding to store them until they can be analyzed; and anyway, the pile is growing much faster than any team of humans could possibly keep up with.

And so, our national history is shredded. How many screenplays, Mother Jones articles, and political thrillers have been shredded already? Who knows, but a lot more will be.

Connelly is leading a group of data scientists and historians at Columbia to develop a big-data solution to this problem. It’s called The Declassification Engine, and it parses language patterns and word usage to discover relevant islands of data (the stories) in the swelling sea of documents.

For instance, users can look for spikes in cable messages that may or may not correspond with news coverage — looking for the behind-the-scenes of media stories, or instances where national events weren’t uncovered by the media at all. From a spike in diplomatic cables, the user can develop a word cloud to find the most prominently used terms, and then dig in for instances of those terms.

It’s a fascinating way to quickly uncover untold stories from our nation’s past, but its efficacy depends on the size of the data set. The Declassification Engine has a ravenous appetite for documents to chew on, and it needs to be fed to work.

The engineers behind this project have started by compiling more than a million government documents, and are looking for more. The long-range goal is to create a platform where everyone can upload their own finds, use the tools to make discoveries, and then tell the world.

I love the idea of a boom in Hollywood political dramas, as 25 more Argos are found in the big pile of declassified documents. And I love the idea that The Declassification Engine is re-imagining that X-Files warehouse as an archive where everything is findable by anyone.

If you’d like to contribute to the development of The Declassification Engine, there’s an Indiegogo page set up for donations. And full disclosure: I’m helping out on The Declassification Engine on a pro-bono basis. You should too!

Email me when Kyle Monson publishes or recommends stories