Exploring Big Data with a CLI

Published in

The Graphical Terminal

4 min readJun 21, 2021

When working with large data sets, fruitful analysis benefits from a healthy dose of exploration and discovery. We need to know the shape of the directory structure, we need exemplar file paths against which we can dry-run our analyses, and we need to learn the encoding and schema of the data files. With a few graphical enhancements via the Kui CLI framework, we believe that exploring big data with a CLI can be a joy: rapid exploration yields discoveries about the data that we can leverage. We will use the CommonCrawl data set to drive the discussion.

Using Kui to navigate my S3 buckets via quick point-and-click.

These discovery tasks are well suited to a CLI-driven approach, but bash needs some help. Data files are deeply nested, and have long and cryptic names. One file from CommonCrawl is stored in commoncrawl/crawl-data/CC-MAIN-2021-17/segments/1618039626288.96/warc/and is named CC-MAIN-20210423011010-20210423041010-00639.warc.gz.

There are a number of command line tools that can help with this task. For example the aws CLI from Amazon supports a shell-like syntax for exploring S3 buckets. The Minio client mc supports a similar syntax. With these CLIs, you may enumerate the contents of a bucket via aws s3 ls myBucket/.

Note the trailing slash. With these tools, there is an uncanny valley: the syntax and command set are indeed bash-like. However, both aws and mc differ in a number of ways from bash; e.g. in the importance of this trailing slash.

CommonCrawling via the AWS and Minio CLIs. This discovery task took 51 seconds.

The GIF to the left shows a scenario of navigating the CommonCrawl directory structure to find an exemplar, and then to study its schema. We start with the aws CLI, using its s3 ls command to list the contents of directories. After a few rounds of copy and paste, we arrive at a candidate file. At this point, we switch to the Minio client. It has a clever cat operation that we employ to pipe the compressed file to a standard UNIX head command. This process took 51 seconds.

This experience is, we feel, at least in the right ballpark. There is no need to invent a way to navigate a directory structure. The UNIX ls/cat/head/gunzip paradigm has proven itself over the decades. We just need to tweak it a bit; 51 seconds is pretty high, just to find one file. What if I need to backtrack to explore other files or directories? These latencies add up quickly.

Performance (milliseconds) of directory listing of cc-index from CommonCrawl using three tools: the AWS CLI, the Minio CLI, and Kui.

Note from the animated GIF that the aws s3 commands are quite slow. We made sure that we were using the latest version of Amazon’s CLI. Indeed we were. This table shows a comparison of aws s3 to the Minio client mc for the task of ls cc-index/. The latter is quite a bit better, especially at the 90th percentile (denoted p90 in the table) and above.

Both are still fairly tediously slow. Waiting upwards of a second for every directory hop adds to the annoyance factor of the experience.

We feel that this style of exploration could benefit from a small number of focused improvements, ranked in our opinion of priority order:

Eliminate the uncanny valley. Almost-bash can lead to aggravation.
Focus on latency. Annoyance starts at around 200ms.
Support mouse navigation to further reduce the latency of exploration.
Provide quick previews of compressed data.
When essential, render the schematic structure of common file formats.

Exploring CommonCrawl using point and click, with preview of large documents — CommonCrawling via point-and-click with Kui. This scenario took 14 seconds, 3.5 times faster than using the conventional CLIs.

And so we added these features to the Kui tool. Kui is a framework for extending normal CLI experiences, and is part of the Kubernetes suite of tools.

In the GIF shown here, we use Kui to perform the same directory exploration task. Note how Kui is fast, and lets us click to navigate the directory structure. These long sequences of numbers are hard to type, and copying long strings is cumbersome. If we are using a mouse to copy some random-looking directory name, we might as well move the mouse and just click. With Kui, the task completes in 14 seconds — 3.5x faster.

When we finally arrive at an exemplar, a final point-and-click yields a preview of the compressed content. At the time of this writing, Kui does not yet recognize the syntax of these WARC files. We think that would be a great future addition. It should slot in seamlessly, further enhancing the experience.

Download Kui: https://github.com/kubernetes-sigs/kui/releases. After you have downloaded Kui, launch it and execute ls /s3/aws/commoncrawl. You should see a directory listing of the top level of the CommonCrawl data sets. If you have already set up aws s3 access to your AWS buckets, try ls /s3/aws, and you should see a directory listing of your own buckets. If you are a user of IBM Cloud Object Storage, try up --ibm to verify your connection.

Exploring Big Data with a CLI

Written by Nick Mitchell