Using Dataflow in Clojure to process Google’s huge new WikiReading dataset
Yesterday I was exploring the new WikiReading dataset, and managed to get its 208GB of uncompressed JSON down into about 50GB by simplifying the structure of the objects — basically removing a bunch of…