Exploring Big Data with a CLI
When working with large data sets, fruitful analysis benefits from a healthy dose of exploration and discovery. We need to know the shape of the directory structure, we need exemplar file paths against which we can dry-run our analyses, and we need to learn the encoding and schema of the data files. With a few graphical enhancements via the Kui CLI framework, we believe that exploring big data with a CLI can be a joy: rapid exploration yields discoveries about the data that we can leverage. We will use the CommonCrawl data set to drive the discussion.
These discovery tasks are well suited to a CLI-driven approach, but bash needs some help. Data files are deeply nested, and have long and cryptic names. One file from CommonCrawl is stored in commoncrawl/crawl-data/CC-MAIN-2021-17/segments/1618039626288.96/warc/
and is named CC-MAIN-20210423011010-20210423041010-00639.warc.gz
.
There are a number of command line tools that can help with this task. For example the aws
CLI from Amazon supports a shell-like syntax for exploring S3 buckets. The Minio client mc
supports a similar syntax. With these CLIs, you may enumerate the contents of a bucket via aws s3 ls myBucket/
.
Note the trailing slash. With these tools, there is an uncanny valley: the syntax and command set are indeed bash-like. However, both aws
and mc
differ in a number of ways from bash; e.g. in the importance of this trailing slash.
The GIF to the left shows a scenario of navigating the CommonCrawl directory structure to find an exemplar, and then to study its schema. We start with the aws
CLI, using its s3 ls
command to list the contents of directories. After a few rounds of copy and paste, we arrive at a candidate file. At this point, we switch to the Minio client. It has a clever cat
operation that we employ to pipe the compressed file to a standard UNIX head
command. This process took 51 seconds.
This experience is, we feel, at least in the right ballpark. There is no need to invent a way to navigate a directory structure. The UNIX ls/cat/head/gunzip
paradigm has proven itself over the decades. We just need to tweak it a bit; 51 seconds is pretty high, just to find one file. What if I need to backtrack to explore other files or directories? These latencies add up quickly.
Note from the animated GIF that the aws s3
commands are quite slow. We made sure that we were using the latest version of Amazon’s CLI. Indeed we were. This table shows a comparison of aws s3
to the Minio client mc
for the task of ls cc-index/
. The latter is quite a bit better, especially at the 90th percentile (denoted p90 in the table) and above.
Both are still fairly tediously slow. Waiting upwards of a second for every directory hop adds to the annoyance factor of the experience.
We feel that this style of exploration could benefit from a small number of focused improvements, ranked in our opinion of priority order:
- Eliminate the uncanny valley. Almost-bash can lead to aggravation.
- Focus on latency. Annoyance starts at around 200ms.
- Support mouse navigation to further reduce the latency of exploration.
- Provide quick previews of compressed data.
- When essential, render the schematic structure of common file formats.
And so we added these features to the Kui tool. Kui is a framework for extending normal CLI experiences, and is part of the Kubernetes suite of tools.
In the GIF shown here, we use Kui to perform the same directory exploration task. Note how Kui is fast, and lets us click to navigate the directory structure. These long sequences of numbers are hard to type, and copying long strings is cumbersome. If we are using a mouse to copy some random-looking directory name, we might as well move the mouse and just click. With Kui, the task completes in 14 seconds — 3.5x faster.
When we finally arrive at an exemplar, a final point-and-click yields a preview of the compressed content. At the time of this writing, Kui does not yet recognize the syntax of these WARC files. We think that would be a great future addition. It should slot in seamlessly, further enhancing the experience.
Download Kui: https://github.com/kubernetes-sigs/kui/releases. After you have downloaded Kui, launch it and execute ls /s3/aws/commoncrawl
. You should see a directory listing of the top level of the CommonCrawl data sets. If you have already set up aws s3
access to your AWS buckets, try ls /s3/aws
, and you should see a directory listing of your own buckets. If you are a user of IBM Cloud Object Storage, try up --ibm
to verify your connection.