Using Dataflow in Clojure to process Google’s huge new WikiReading dataset

Alistair Roche
Google Cloud - Community
3 min readFeb 26, 2017

Yesterday I was exploring the new WikiReading dataset, and managed to get its 208GB of uncompressed JSON down into about 50GB by simplifying the structure of the objects — basically removing a bunch of denormalised fields. I used a simple command line tool: jq. But the files are still a little too big to slurp into a Clojure REPL on my laptop.

Today I want to move from the 18.8 million triples of (document, property, values) to a map of ~4.7 million documents, each associated with a set of (property, values) tuples. This should drastically reduce the size once again.

Technically I could do it with jq, but there would be no good way to parallelise the operation, unlike the mapping we did yesterday (where each line is a separate JSON object, and transforming it depends on just that line). To do a “group by” over all the data with jq, we’d have to read the entire data structure into memory, and then work on it in a single core. I’d have to spin up a special super-high-memory instance on Cloud Compute, and it’d still take ages to process.

This might be a good use for Cloud Dataflow. There’s a great library called datasplash which wraps around the 1.x SDK of Dataflow. I used it even though Google have moved on to recommending Apache Beam, because clj-beam just isn’t there: when I tried to run the most basic examples, it failed due to cryptic interop issues. Moreover, the API isn’t nearly as high-level as datasplash’s.

The pipeline I’m trying to create is very simple:

  • Read all the JSON files for a given subset (training / validation / testing), turn them into a big collection of Clojure maps (with keys document, property, values)
  • Group by the document
  • Write to a single JSON file for each subset

Compared to what I’ve done in the last few days with Dataflow, it’s simple — linear, and only a few steps. Take a look at datasplash’s examples for some meatier pipelines.

Here’s a heavily commented gist:

This is how I run it:

And this is how it looks in the monitoring interface while it’s running:

Here’s how long it took:

And here’s the output in Cloud Storage:

We’ve gone from 208GB → 50GB → 6.8GB without losing any of the information needed to replicate the results of the paper. Alright! I might even be able to play with this dataset on my laptop.

There’s something satisfying about watching a bunch of computers dance for you without having to write much code. Dataflow takes care of all of the implementation details of the distributed computation. The code I wrote looks similar to how I’d write it for running on a single core, on a single machine — the main difference is that I’m using datasplash’s map, group-by and I/O functions instead of Clojure’s built-ins.

I do wish that the pipeline construction was based on assembling data structures (like in Onyx) rather than mutating a Pipeline object. In fact, I would use Onyx, except that I’d have to do all my own devops, and learn what Apache Zookeeper is. Another day, perhaps! Or maybe one day Onyx will have its own fully-managed service like Dataflow.

In any case, I’m now much closer to being able to quickly iterate in my exploration of WikiReading. Onwards!

--

--