Analyzing Big Data with grep and awk

Nick Mitchell
Cloud Computer
Published in
5 min readAug 11, 2021

The Cloud has many features of a large, distributed, and very hard to use computer. The Cloud indeed offers to manage storage and compute resources for us. Great! But these gems are hidden behind a mess of APIs and YAML. The Cloud is sadly quite unlike the beautiful simplicity of UNIX.

What would a “UNIX way” be for such a Cloud computer? UNIX pipelines are a proven way to handle large amounts of data. After all, this approach of streaming through data with simple and composable utilities was designed precisely for the situation where input size greatly exceeds available local memory. Sounds familiar!

Without any programming, Super analyzes five CommonCrawl data files in parallel to classify web pages by the hosting web server.
The same, this time analyzing 100 CommonCrawl files in parallel.

With UNIX pipelines, one can stream through data with cat and grep and sed, all highly tuned dynamos. It is also easy to combine arbitrary code, written in any language. There is no language or API lock-in.

Our previous blog introduced Super, a way to “bash the Cloud”. Super is a CLI that lets you run standard UNIX pipelines, but in parallel against Cloud data. Super boils the complexity of the Cloud down to a simple super run.

Download Super now! https://supe.run

How effective is this approach at data preparation and analytics tasks? Can we useawk and grep against giant data sets? In this blog, we will demonstrate the affirmative, using use CommonCrawl as our corpus.

Example: Word Count

If you have downloaded the Super tool, try issuing super example 1 to see the full UNIX pipeline for this example. Check here for details on two more examples.

Our first example classifies the web by the words used on web pages. As a data source, we use the CommonCrawl WET format files. This information houses the extracted text from each web page, and is stored in a collection of gzip-compressed files (these are the CommonCrawl “segment” files).

A portion of a CommonCrawl WET file.

As shown to the left, the file format includes a mix of metadata lines and lines with the extracted text.

Once we have Fetched-and-Decoded the stream of data, we need to Filter out the metadata lines, then Split the remaining text-only lines so that each line contains a single word.

We now have a set of classes, where each line represents a classification by words used. We can then summarize the findings by Aggregating the classifications into a histogram.

We can use UNIX utilities to do all of this. Indeed if /s3 were a mounted filepath, this UNIX pipeline would run unchanged on your laptop!

cat /s3/.../segment*.wet.gz | gunzip -c - \   <-- Fetch-and-Decode
| grep -Ev '^WARC|Content-' \ <-- Filter
| tr ' ' '\012' \ <-- Split
| sort | uniq -c' <-- Aggregating

Behind the scenes, Super takes care of implementing what it means to cat a Cloud-based filepath like /s3. It will also automatically schedule one parallel job per file matched by the “glob” segment*.wet.gz.

To filter out irrelevant lines, grep -EV ‘^WARC|Content-' works well. We challenge others to find an approach that combines the performance and simplicity of that pipeline stage. To have each line contain only one word a simple tr ‘ ‘ \012' does the trick; \012 is the octal representation for newline. We can then pipe this into sort | uniq -c which nicely tallies up the word classes into a histogram.

Beyond this initial pipeline the fun, as a data analyst, begins. For example, it may be important to eliminate uninteresting words. To exclude words shorter than 5 characters: add grep ..... to your pipeline. Desire case insensitivity? Try piping in tr [:upper:] [:lower:].

And so on! We can add as many stages as needed, to iteratively refine our classification.

Finally, we must join the histograms that we have created, in parallel against each file, into a consolidated histogram. Super allows us to use normal UNIX pipeline syntax to do this, too: pipe the results into a final | sort | uniq -c. This will combine the individual per-input-file histograms into a histogram across all files. We have just done a streaming join, again without the need to code to any API!

With “Super Bash” Performance is Surprisingly Great

This approach not only gives you the flexibility to mix-and-match utilities written in whichever language you prefer, it also achieves very high computational density. The performance extracted from each CPU is high.

The Super “bash the Cloud” approach shows consistently higher performance per file, across three classification use cases — all without coding to any parallel API!

These figures compares the classification throughput (in MiB/sec) of: Super (the series labeled “Bash” in a cyan-blue color); Ray (in yellow); and plain Python (in white).

While both Ray and Super can scale across multiple files and multiple nodes, to highlight the computational density of each approach, these figures show the per-node performance of analyzing one large input file. The figures then explore how well each approach utilizes the cores on that node, by scaling from 1 core to 8 cores per node — i.e. here we are highlighting the “scale up” parallelism of each approach.

Unsurprisingly, the plain Python curve is flat as we increase core count. This is expected; without some help from an extra API and runtime, Python cannot parallelize. Ray can help with this, and nicely scales up as core count increases on the node; but the figures show that Ray seems to have overheads relative to Super’s plain bash approach.

Note how a plain bash approach benefits from multiple cores, despite no explicit coding to a parallel API. This is because UNIX pipelines express “pipeline parallelism”; e.g. in a | b | c, UNIX runs the three tasks concurrently, and manages the flow control between them for us.

Join us in bashing the Cloud: https://supe.run

--

--