Many research organizations are behind closed doors, restricted from openly participating with the wider community by their commercial parent companies. The Allen Institute for AI (AI2) is a different sort of place, founded on the idea that open science for the common good is the way to move technology forward. As a philanthropic non-profit, we openly share our research. Naturally, we use the web as a vehicle for this — what better way to distribute content to a global audience?
With several research teams working across multiple domains in AI since 2014, it’s no surprise we’ve ended up with a lot of small web applications. Many of these were initially ill-maintained, collecting cobwebs in some corner of a datacenter in Amazon or Google’s back yard. We weren’t thrilled about this and decided last year that we needed a solution.
But wait, websites are simple…right? Can’t you just open your favorite text editor, plop down a few
<marquee> tags and drop the result in an S3 bucket? You’re not wrong — in fact when a static site can do the trick, it’s the way to go. The gotcha is when things get a bit more complicated. Most of our applications include:
- Some dynamic code (probably Python) that processes input and does something interesting with a model. This code likely uses a lot of memory, and might even need a GPU to provide a prediction in real-time.
- A pre-trained model — it’s probably a moderately sized file sitting in AWS S3 or GCS. You may or may not want to embed it into your application at build time. It depends.
- A database that the application both reads and writes to.
- A complex visualization with a lot of bells and whistles — explaining the inner workings of AI is much trickier than making sure an ad renders correctly.
So…while we may still throw that prediction in a
<marquee> tag (I mean, who wouldn’t?) there’s more we need to consider. Throw some smart, motivated people into the mix and it’s no surprise that you end up with a lot of applications that are similar but also just a little bit different. Now imagine having 50+ of these apps, spread across multiple AWS accounts, each with its own Load Balancer, DNS entries, TLS certs, and more — yeah, ok, now I get it.
The plethora of potential solutions is in itself overwhelming. People often criticize the front-end community for moving too quickly — lately it feels like DevOps moves at a similar pace. Again, it’s not too surprising, as each cloud vendor is trying to provide something that’ll cause people to flock to their service, and particularly so if you’re not the current ring leader (AWS).
To narrow down the possible solutions we started by assessing what we needed:
- Our solution had to be flexible. As an AI research institute we’re naturally pushing boundaries and trying things that require unique technical solutions. A rigid, inflexible one would stifle innovation — we knew we couldn’t do that.
- The typical boilerplate, like TLS certs, DNS, rate-limiting, and other general aspects involved in operating an HTTP server needed to be taken care of without any effort from the user. Our researchers and engineers had more important things to worry about — they shouldn’t be bothered to worry about these types of details.
- Administration needed to be easy. We’re a small engineering organization that does a lot of big things. Our solutions need to be simple to operate and maintain so that we can continue to try new things. We don’t have an army to throw at the problems we tackle—we have to enable a small, adept group of ICs to operate at a level far beyond what’s typical.
While evaluating potential solutions, I found myself nostalgic for prior times spent running a small Apache web server. My friends and I in high school each maintained a series of small personal websites. We ran a single Apache server for hosting the applications — some of which used PHP, others which were completely static, and used virtual hosting to route traffic to the right place. We each had our little slice of the server and managed to write all sorts of terrible code to our heart's content without disrupting one another. An Apache web server wouldn’t work for our use-case, of course, but I wanted to replicate the same notion of isolation and centralization. None of us were worried about how requests got to our tiny little slice — we just focused on delivering the magic we were so enamored with.
Another source of inspiration was the success of Docker both across the tech industry and internally at AI2. Another team at our organization had long before identified the value of running model training and evaluation in containers and distributing them using Kubernetes. This solution was wrapped up and dubbed Beaker — after the muppet that we all know is a big fan of moving fast and breaking things. The product was and is very successful at AI2 — to this day people run experiments in Beaker, both in the cloud and on our on-premise clusters. I wanted to replicate that success and make developing public-facing web applications an easy, seamless process.
A few design documents and helpful conversations later we ultimately arrived at the solution we’ve been running with great success for the last year — called Skiff. And while the name doesn’t stick with the muppet theme, it launched a nautical metaphor that the team and I have and continue to extend. Turns out there’s a much larger collection of nautical terms than there are muppets — how’s that for designing for the future.
The beauty of Skiff is that it’s quite simple. So much so that it’s hard for me to even call it a platform — as really it’s just a bunch of great open-source software glued together to provide a standard archetype for developing and deploying web applications. Turns out all those smart people I was lucky to be mentored by over the years who told me to “keep it simple, stupid” were on to something.
“Simplicity is the ultimate sophistication.” — Leonardo da Vinci
Here’s a 10,000-foot view, and one of the many diagrams that were part of the design exercises we went through:
Skiff starts by utilizing containers, something that (as noted) was previously well established at AI2. This is really the only requirement we ask our users to adhere to — if their application can be packaged up by
docker then we can run it.
This was an immediate benefit in that we were able to take many applications that were already containerized and quickly transition them to the new solution. Just like that a number of EC2 nodes were turned off with the flick of a wrist. I could picture Bezos shaking his fist at us as those cobweb laden, low utilization VMs got turned off.
For new applications we help people get started by offering a template that includes a Flask API, a TypeScript and React-based UI, and an NGINX reverse proxy. The proxy serves the UI in production from disk and routes traffic to
webpack's development server in local environments. It’s also a nice way to avoid the
Access-Allow-Control-Origin: * that we see all too often.
The next piece of the puzzle involves building images and getting them to Kubernetes. We chose and continue to use Google Cloud Build. As users push changes to
master we take their code, package things up in Docker, and push it to Google’s container registry. Our system then uses Jsonnet to generate the Kubernetes config required for running their application and
kubectl apply's it. Shortly thereafter the user’s workload is humming away in the cloud.
Google Cloud Build has done the job we’ve asked fo it for the last year, but could definitely be better. Their build triggers still show up as an opaque hashes in Github’s UI, they lack support for
git features like
submodules and the YAML they require for declaring one’s build is clunky and verbose. We’re actively looking at porting to Github actions — if the solution had existed prior to Skiff’s conception we probably would’ve jumped on that bandwagon.
Once the application gets to the cluster there’s a bunch of machinery that gets things set up correctly for the user. Luckily all of this is transparent to them and remarkably easy for my teammates and I to maintain. We use GKE as our Kubernetes provider — Google’s support is, simply put, best in class. We use Cert Manager to provision LetsEncrypt provided TLS certificates in minutes. And last, but surely not least, the Kubernetes Ingress NGINX Controller handles TLS termination and forwarding requests to the right place.
The setup described served us well for the first few months — in fact, after getting things rolling our team did nothing but transition people to the new stack. We grew quickly — as a small internal team we were having our “startup moment”. Suddenly teams we hadn’t even anticipated were asking if they could use our solution. It must’ve been that nautical metaphor which was all too easy to extend, or maybe they’d all been struck by an inevitable TLS certificate expiration outage and wanted to say goodbye to that problem forever.
As the number of workloads grew we realized we needed a few more bits and pieces to provide an excellent user experience. So we wrote a few additional pieces of software to help people effortlessly launch new applications.
This solution has allowed a team of 3 engineers to run a diverse collection of over 90 web applications that handle millions of requests per month over the last year with 99.95% availability.
The last piece of the puzzle is an application we call the Bilge Pump. This small, reliable piece of machinery is responsible for scanning the cluster for ephemeral environments and removing them. We let users create new application environments from any Github branch via the click of a button. These environments have a configurable expiration, after which the Bilge pumps them back out to sea. This has proven to be a vital mechanism for fast iteration — now code reviews can include a live demo and a chance for product managers and others throughout the org to review and suggest changes.
There are, of course, a few pieces I’m leaving out — as with any system it’s impossible to describe all the details. That said, what’s important is that this solution has allowed a team of 3 engineers to run a diverse collection of over 90 web applications that handle millions of requests per month over the last year with 99.95% availability. What’s even more important is that the pace of development has continued to increase throughout the year — signaling that we’re enabling exactly what we intended to.
You might be tempted to dismiss this as a hype infused post from YAKF (yet another Kubernetes fan). Sure, I’ll admit I’m a big proponent of the technology. But I’m also not shy about admitting that it’s been a complex bit of software to fully understand and operate. We made just about every mistake in the book, and if it weren’t for GKE’s handling of some of the low-level details I’d probably be telling a very different story. That said I can also say that the resulting power, flexibility, and resiliency afforded by Kubernetes has been essential to the success of Skiff. Turns out if you automatically restart someone’s application when it OOMs and run a bunch of replicas, you can greatly improve the end-user experience.
I’m really excited to see what else we bring to Skiff in the coming months and years, and the impactful, forward-thinking applications our researchers develop using it. I also can’t wait to continue to give nautically themed presentations to the company — the good ole’ Captain’s Log never gets old.
⛵️ Smooth sailing out there friends. I think I’m going to go spin up an Apache server, as this post has me (again) feeling nostalgic. I might abstain from writing PHP though as that’s something I don’t miss too much.
Sam Skjonsberg is an engineer on the ReViz team at AI2, building tools and infrastructure that help teammates share their work in new and compelling ways. When he’s not spinning up Apache web servers, you’ll find him riding his bike or adventuring in the PNW with his wife, two dogs, and soon their son (they’re about to welcome a new little one!).