graalvm
Published in

graalvm

Babashka: How GraalVM Helped Create a Fast-Starting Scripting Environment for Clojure

This is a guest blog post by Michiel Borkent.

Introduction

Babashka is a fast-starting scripting environment for Clojure. It has become a popular choice among Clojure developers for their scripting needs due to the fast startup time and convenience it offers. It is compiled as a standalone binary using GraalVM Native Image.

Clj-kondo (a linter and analyzer for Clojure) and clojure-lsp (an LSP server for Clojure) are also available as standalone fast-starting binaries. GraalVM Native Image has been a real game-changer for Clojure tooling!

In this article we will describe how we created a fast-starting native executable version of Clojure, babashka, that is ideal for scripting. We will also look at a few caveats and trade-offs that we made.

Babashka allows scripting using Clojure. Clojure is a dynamically typed Lisp that emphasises functional programming and immutability. It runs on the JVM. It is a general-purpose programming language: everything you would use Java for, you can also do in Clojure.

The JVM is a powerful platform and Clojure is a great language but if you are interested in running scripts, the startup time of the JVM might not be a good fit. Babashka allows us to have the best of both worlds: using a high-level productive language such as Clojure and super fast startup times.

Fast Startup for Scripting with GraalVM Native Image

Babashka supports a large subset of the Clojure language, but it doesn’t require a JVM since it’s built as a standalone binary using GraalVM Native Image. It contains several useful built-in libraries and Java classes.

Typical use cases of babashka include: build scripts, command line utilities, small web applications, task runner, git hooks, AWS Lambda, and everywhere you may want to use Clojure where fast startup and/or low resource usage matters.

To get a glimpse of what you can do with babashka and why it also may be interesting for Java developers, let’s evaluate an expression on the command line. The babashka binary is called bb.

$ bb -e '(.exists (new java.io.File "README.md"))'
true

In the above example we do some interop on the JVM class java.io.File: we call the constructor with the string "README.md" and then invoke the instance method .exists. In Java you would write this as new File("README.md").exists() but since babashka is a Clojure runtime, we use prefix notation. Note that the amount of parens is exactly the same though :). To get a notation which is closer to method chaining, you can use the -> (thread-first) macro:

(-> (new java.io.File "README.md") (.exists))

Don’t worry if you don’t understand the Clojure code in this article in detail. The intent is to give you a general idea of why babashka might be useful if you are using Clojure.

Of course we could evaluate the above expression using JVM Clojure too:

$ time clj -M -e '(.exists (new java.io.File "README.md"))'
true
0.75s user 0.06s system 166% cpu 0.491 total

But note that this takes almost half a second (on a Macbook Air M1) and with more complicated programs and dependencies, it will take significantly longer. To see a pronounced example of this, we can run babashka itself as a JVM Clojure program. Because it has a large amount of built-in libraries and classes that are loaded at startup, we can see it takes a while:

$ time clj -m babashka.main -e '(.exists (new java.io.File "README.md"))'
false
14.77s user 0.54s system 214% cpu 7.123 total

That’s seven seconds. With babashka compiled using GraalVM Native Image, the same example runs a lot faster:

$ time bb -e '(.exists (new java.io.File "README.md"))'
false
0.01s user 0.01s system 83% cpu 0.022 total

Just 22 milliseconds! Because babashka scripts start fast, similar to bash and python, and the Clojure community generally likes writing Clojure better than writing bash scripts, babashka aims to fill this gap in the Clojure ecosystem.

Let’s look at a more useful example. To deal with file operations, babashka comes with the fs library and for spawning processes there is the process library. In some Clojure programs, we have to compile Java classes to be able to use them. You can do that as follows:

compile.clj:

#!/usr/bin/env bb

(ns javac
(:require [babashka.fs :as fs]
[babashka.process :refer [shell]]))

(when (seq (fs/modified-since "MyClass.class" ["MyClass.java"]))
(println "Compiling Java")
(shell "javac MyClass.java"))

This script checks if the MyClass.java source file is newer than the MyClass.class file and if so, it runs javac to compile the class. If not, the script does nothing.

The #!/usr/bin/env bb line indicates that the shell should run this script with the bb interpreter, and Clojure supports this syntax by treating it as a comment. This allows you to put scripts on your shell's path and treat them as global utilities. As this isn't supported on all platforms, babashka provides the bbin utility for installing scripts globally.

Running Clojure without the Clojure Compiler in a Native Executable

The JVM Clojure compiler turns programs into JVM bytecode and this JVM bytecode can be compiled to a fast-starting standalone binary with GraalVM Native Image. This means that nearly any Clojure program can achieve fast startup once it’s been compiled with the GraalVM native-image tool. The purpose of babashka, however, is to provide a tool that can run arbitrary Clojure code without having to run it through this two-step compilation process.

Babashka solves this problem by both pre-compiling many broadly useful Clojure libraries, and including an interpreter for executing arbitrary Clojure. The interpreter is called SCI, which stands for Small Clojure Interpreter.

To create the babashka binary, we configure the interpreter so that it can access pre-compiled library functions. We do this for two reasons: 1) the interpreter can look up functions used in scripts; and 2) the native-image tool will not eliminate those functions as a result of its reachability analysis.

Below is an example of how we could configure an SCI-based environment. It’s a stripped down version of how babashka is actually written:

(ns babashka.main
(:require
[cheshire.core :as json]
[sci.core :as sci]))

(def ctx (sci/init
{:namespaces {'cheshire.core {'parse-string json/parse-string
'generate-string json/generate-string}}}))
(defn -main [_ expr]
(let [evaluated (sci/eval-string* ctx expr)]
(prn evaluated)))

The ctx is the environment for the interpreter. The interpreter doesn't give you access to the host, but only what you provide to it via the ctx argument. We provide as one of the namespaces cheshire.core, a JSON library, and expose two of its functions: parse-string and generate-string.

We can then run this on the JVM:

$ clj -M -m babashka.main -e "(+ 1 2 3)"
6

and also use the built-in JSON library:

$ clj -M -m babashka.main -e "(require '[cheshire.core :as json]) (json/generate-string [1 2 3])"
"[1,2,3]"

When we want to let users use classes, we have to explicitly provide them to the context. When we don’t do that, we see:

$ clj -M -m babashka.main -e '(new java.io.File "README.md")'
Execution error (ExceptionInfo) at sci.impl.utils/throw-error-with-location (utils.cljc:39).
Unable to resolve classname: java.io.File

When we change the context with the file added:

(def ctx (sci/init
{:namespaces {'cheshire.core {'parse-string json/parse-string
'generate-string json/generate-string}}
:classes {'java.io.File java.io.File}}))

we see that the previous example works:

$ clj -M -m babashka.main -e '(new java.io.File "README.md")'
#object[java.io.File 0x68ee7b3b "README.md"]

The constructor call (new java.io.File "README.md") is implemented in SCI via reflection. So the java.io.File constructor is looked up dynamically at runtime and then invoked via the Java reflection APIs. To make this work, you have to add a file named reflect-config.json to configure the native-image tool. In babashka this is automated from a large list of built-in classes which then get added to the SCI context and the generated reflect-config.json file.

To build babashka, first an “uberjar” is created (also known as a “fat” JAR file, a JAR file containing all the program’s dependencies) and then native-image is called like this:

$ native-image -jar babashka-standalone.jar \
--no-fallback \
--initialize-at-build-time=clojure,cheshire \
bb

And then we have our fast starting interpreter:

$ time ./bb -e "(+ 1 2 3)"
0.01s user 0.01s system 81% cpu 0.016 total

A Whirlwind Tour of Babashka

Libraries

Babashka comes with the libraries that you would expect in a scripting environment. It has HTTP client libraries (including java.net.http), a web server, templating, command line parsing, java.time, etc. It also supports loading libraries written in babashka from Maven, Clojars (the Clojure community equivalent of Maven Central), or git. For example, the awyeah-api library for interacting with AWS or the deep-diff2 library for pretty-printing differences between data. Babashka even supports adding libraries dynamically at runtime:

#!/usr/bin/env bb

(require '[babashka.deps :as deps])
(deps/add-deps '{:deps {lambdaisland/deep-diff2 {:mvn/version "2.7.169"}}})
(use 'lambdaisland.deep-diff2)
(pretty-print (diff {:a 1 :b 2} {:a 1 :b (range 10)}))
;; {:a 1, :b -2 +(0 1 2 3 4 5 6 7 8 9)}

Cross Platform

One benefit of writing scripts in babashka, unlike in bash, is that they become automatically cross-platform, since the JVM is cross-platform and GraalVM Native Image targets all the major platforms.

REPL-Driven Development

Another benefit is that you can use REPL-driven development, which Clojure developers have come to rely on. Unlike interactive shells in scripting languages such as Python or Ruby, Clojurians use an editor-connected REPL to drive the whole development process. It allows incremental construction and precise inspection of your running program. For a good explanation of the REPL-driven approach, watch the talk Stop writing dead programs by Jack Rusher.

Concurrency

Since Clojure is a language which makes concurrency easy, and native executables produced by GraalVM Native Image support multi-threading, babashka supports it too. There is no GIL like in other scripting languages. With virtual threads around the corner, this will become even more interesting.

Task Runner

Babashka also comes with a task runner that works similar to make but you get to use Clojure instead of a bash-like DSL.

Pods

For programs that are not supported in babashka source code (for example, code that relies on non-built-in classes), you can write a pod. Pods can be written in Clojure and compiled using the native-image tool but they can also be implemented in other languages, so long as you implement the pod protocol. Pods are compiled and distributed as a standalone binary. Communication back and forth between babashka and the pod is performed via JSON or other serialization formats. Examples of pods are the filewatcher pod and the sqlite3 pod. Pods are started only once in the lifetime of a babashka program and are communicated with on every pod function call.

Challenges

Build-Time Initialization

Clojure code currently needs to be initialized at build time due to how the Clojure compiler and runtime is set up: a lot happens in static initializers and this work cannot be delayed to runtime. This is why we need the --initialize-at-build-time=clojure,cheshire build option. Every compiled Clojure namespace needs to be added to this list, which is automated using the graal-build-time library. This leads to the following semantics for natively compiled Clojure programs:

(ns my-namespace)

;; this happens at build time
(def random-number (rand-int 1000))
(defn foo []
;; this happens at run time
(let [random-number (rand-int 1000)]
(inc random-number)))

If we compile the above program with native-image the var random-number will always have the same value every time we run the binary. This can be fixed by delaying the initialization using delay:

(def random-number (delay (rand-int 1000)))

Note that this is good practice anyway, since similar issues may occur with Clojure AOT compilation.

Performance

SCI can run a combination of pre-compiled code and interpreted code. SCI is written in Clojure itself and is agnostic about GraalVM: it is not a Truffle Interpreter, so it won’t benefit from Truffle optimizations and JIT. Generally code that is run with SCI is fast enough compared to Python and bash. So the sweet spot for babashka + SCI is mostly startup time and the batteries-included aspect. When programs contain lots of hot loops, it may be better to use Clojure on the JVM:

$ time bb -e "(time (loop [val 0 cnt 10000000] (if (pos? cnt) (recur (inc val) (dec cnt)) val)))"
"Elapsed time: 651.339667 msecs"
10000000
0.66s user 0.02s system 99% cpu 0.681 total
$ time clj -M -e "(time (loop [val 0 cnt 10000000] (if (pos? cnt) (recur (inc val) (dec cnt)) val)))"
"Elapsed time: 10.641333 msecs"
10000000
1.03s user 0.06s system 185% cpu 0.588 total

Note that bb performs slightly better than Python here:

loop.py:

x = 10000000
val = 0
while (x > 0):
x = x - 1
val = val + 1
print(val)
$ time python3 /tmp/loop.py
10000000
0.85s user 0.02s system 97% cpu 0.894 total

Note: it is possible to build a custom version of babashka with a different set of libraries included, so you can precompile code to get better performance for the parts where it matters.

Custom Types

In JVM Clojure you can write custom types that implement Java interfaces. Those compile down to new JVM classes. In babashka, we don’t have the ability to create new JVM types at runtime, since all JVM types have already been created ahead of time.

For example, in Clojure you can create an anonymous class that implements a Java interface, such as java.io.FileFilter like this:

(def file-filter
(reify
java.io.FileFilter
(accept [this f]
(.isDirectory f))))

(map str (.listFiles (java.io.File. ".") file-filter))
;;=> ("./.clj-kondo" "./.lsp" "./.git")

To support this specific use case, babashka contains an implementation of java.io.FileFilter that is compiled ahead of time and then dispatches to the accept function given by the user at runtime. This approach requires a pre-selected list of reify-able implementations and combinations of interfaces. Currently SCI doesn't support implementing Java interfaces on defrecord and deftype which is the biggest source of incompatibility with existing Clojure libraries.

Combining AOT and Interpretation

It could be interesting to explore Clojure on Truffle or a Clojure run inside Java on Truffle, but the combination of pre-compiled libraries and code running inside a Truffle context while having good startup time poses its own challenges. A benefit of how SCI is currently set up is that it’s easy to combine pre-compiled and interpreted code. As SCI is implemented in Clojure it was also easy to support ClojureScript, which is the dialect of Clojure that compiles to JavaScript. SCI on JavaScript enabled writing a Node.js version of babashka, called nbb and a browser version called scittle.

Binary Size

When including libraries and classes in a native executable, it’s a good idea to keep an eye on the size of the executable. When libraries dynamically require Clojure namespaces (not on the top level, so not at build time), the size of the executable can become much larger than necessary, even if you load those namespaces beforehand. So sometimes libraries need some light patching before they are compiled with the native-image tool to retain the most optimal binary size. See the dynaload library which helps with this problem. There is also the trade-off of which libraries and classes are actually going to be useful in the long term and if they are worth the additional binary size. Babashka is currently around 75MB (20MB zipped).

Targeting multiple platforms

Users of babashka expect the bb binary to be available on all major platforms (Linux, macOS, and Windows) and architectures (AMD and ARM). On Linux you can compile binaries as static or dynamic executables. Only musl-compiled static binaries work with Alpine Docker images. Babashka currently releases seven different pre-compiled binaries to cater to these needs, built on four different CI platforms: CircleCI, Appveyor, Cirrus and Github Actions. In 2019 when babashka first got started, Appveyor was one of the few available options for Windows executables. There are more options now. At the time of writing, Cirrus is one of the few platforms offering support for building macOS aarch64 images.

Conclusion

Without GraalVM Native Image there would be no babashka. It is an essential tool to make babashka a fast-starting scripting environment for Clojure!

Thanks to Alex Miller, Rahul Dé, Daniel Higginbotham, Martin Kavalar, Anthony Caumond and Eugen Stan for proofreading this article and providing feedback.

About the author

Michiel Borkent, also known as @borkdude on the web, is an open source software developer who loves Clojure. He is the author of clj-kondo, babashka, SCI, nbb and other popular Clojure developer tools. He uses GraalVM Native Image to apply Clojure in new contexts such as scripting.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store