Bash the Cloud

Nick Mitchell
Cloud Computer
Published in
5 min readAug 11, 2021

Cloud is the anti-UNIX. It is a world where nothing really is a file, and computation must be allocated, planned, and utilized by careful employment of a hundred miserable CLIs and APIs.

That is not to say we are anti-Cloud. We love that resources can be pooled, and that the costs of idling and management can be amortized across many and disparate consumers. The Cloud truly can be a modern day operating system, interpreting our desires and managing hardware with precision and transparency.

What would the UNIX way be for a Cloud computer? These animated GIFs captures our desire: UNIX pipelines, but against Cloud data and compute resources. We have implemented this approach in a tool called Super.

Example 1: Using plain UNIX pipeline syntax, we can classify the processors running in our Cloud.
Example 2: Using plain UNIX cp src dst syntax, but instead auto-parallelized and against the Cloud. We have transparently used Cloud compute resources to mediate copies between two Cloud providers.

You can download it now: https://supe.run

UNIX Pipelines in the Cloud

We believe that a rich subset of Cloud tasks are amenable to a lightweight and bash-like approach to analyzing data.

When one crafts a UNIX pipeline for local operations, the focus is at a pretty high level: on the data. We are concerned with where it originates (so we can cat the right set of filepaths), where it must eventually settle (so we can > redirect to the proper destination filepath), and how the schema of the data can be manipulated with existing off-the-shelf tools. APIs are largely a secondary concern.

We desire the same for the Cloud: freedom from the burdens of allocating and scheduling resources, keeping data and computation close, acquiring and propagating data authorization to the compute engines, flow control and chunking and caching, of when to scale up versus scale out — that all of this, and the APIs needed to direct each of these disparate concerns, is hidden behind a simple utterance: super run.

With some careful thought put into the tooling story, we have found that this is indeed possible. We can compile high-level utterances into executions that have both scale-up and scale-out parallelism, without coding to a parallel API such as Spark or Ray.

Surprisingly, we have found that the resulting executions also have high computational density. By always streaming data and leveraging fast C-coded utilities, the MiB/sec/core of such pipelines can often exceed most other approaches. This knock-on effect has been observed by others. We will detail this in a subsequent blog.

Example 1: Classify the processors in the Cloud

The first animated GIF shown at the start of this blog illustrates a pipeline that classifies the CPU types of a Cloud provider. This example fires off a command line for execution in the Cloud, and presents the result on the console; it includes both fork/join parallelism and (a bit of) pipeline parallelism:

❯ super run -p100 — ‘lscpu | grep “Model name” | cut -f 2 -d “:”’ | sort | uniq -c
91 Intel Core Processor (Broadwell, IBRS)
9 Intel Xeon Processor (Cascadelake)

Interpretation: fork 100 pipelines that execute the bold portion of the pipeline, and join the results into a histogram via sort | uniq.

UNIX pipelines extract a fair degree of pipeline parallelism for free!
Furthermore, we can automatically extract fork/join parallelism, entirely avoiding the need to code to a parallel API.

Users of this approach will need some way to express which portions of the pipeline are done in the Cloud (i.e. in the fork), and which are done on our laptops (after the join). In this example, the ‘quoted part’ is forked and executed in the Cloud, and the streaming output of those jobs is fed into a local pipeline: | sort | uniq -c. The final output is presented on the user’s console, as per usual with UNIX pipelines. GNU Parallel adopts a similar syntactic ploy.

Example 2: Globbing the Cloud

The second animated GIF at the start of the blog illustrates a Cloud-based copy. By leveraging the “globbing” capability of UNIX shells, the set of matched files can represent an implicit fork:

super run -- cp /s3/src/*.txt.gz /s3/dst

Interpretation: expand the glob pattern, and fork a copy job for every matched source file.

Even simple copy tasks may benefit greatly from taking place entirely in the Cloud. Doing so avoids downloading and re-uploading the
data. Furthermore, Cloud providers often do not charge for in-Cloud
data access.

Big Idea 1: Bash helps us to be Optimal, without Optimizing

Optimization and parallelization are hard, unforgiving jobs. A bash pipeline, in contrast, benefits from data prefetching and pipeline parallelism, without any extra coding. We have found that common pipelines against CommonCrawl data have a net pipeline parallelism of 2–3. This is zero-code parallelism, brought to you by UNIX.

Big Idea 2: Bash is Anti-Viral

When everything is a stream, utilities written in any language, whether C, Python, Perl, or Bash, can be composed into a pipeline. This approach also allows us to leverage UNIX standards such as grep and sed and awk, which are versatile and perform amazingly well. They are also backed by a large corpus of examples and guidance on StackOverflow and the like.

Big Idea 3: Bash has a simple Introspection Model

It is easy to splice in debugging functionality at any point in a pipeline. For example, one can insert tee or tools like the pipeline viewer pv where needed: gunzip -c input.tx.tz | pv | …tells you the rate of decompression with only a minor syntactic change. This spliced pipeline has the same output, and has nearly indistinguishable performance.

Better yet, the pv utility works with any pipeline. There is no need to find the Go/NodeJS/Python variant of this functionality, code to its API, find a way to disable it when in production, etc.

Join us in bashing the Cloud: https://supe.run

--

--