Engineering a fast feedback infrastructure

Published in

Snips Blog

9 min readJun 25, 2015

A tech company’s potential to create value comes from its ability to prototype quickly and iterate fast: the infrastructure shouldn’t be a hurdle in that process. In fact, it should do exactly the opposite: give us the means to go even faster. At Snips, we believe that everyone in the team should be able to run and monitor any code on any server at the press of a button.

The time and effort needed to go from the idea to a prototype running in production should be as small as possible. Prototypes provide insights to know what works and what needs to be improved. This enables us to avoid premature optimizations and to focus on what matters. Making logs and run-time metrics straightforward to record and explore hence goes a long way into making this iterative process more efficient and enjoyable.

In this post, we want to share the first steps we have taken in the direction of a true infrastructure as a service approach, using exclusively open-source tools. We will touch upon:

how we run one-off or recurrent jobs
how we run long-running services
how we inspect and monitor services and jobs
how we push services in production

A Docker-based infrastructure

When we move code from local machines to the shared infrastructure, we must guarantee that the deployed code will work exactly like it does on our development machines despite potential differences in package versions, OS distributions and hardware configurations. This is why we use Docker to build a standardized environment.

Docker is an open platform enabling the creation of software containers. It runs containers, which have their own isolated user space, network interface, file system and processes, a bit like a virtual machine. Since isolation is done at the OS level, it is less strict than in a virtual machine. But instantiating a container is real fast, because there is no separate OS to boot.

A container is created from an image, which describes the initial content of the containers in which the processes will be run. This is described by a script, called a Dockerfile, which starts from a base image like a raw Ubuntu distribution and describes what commands should be run to install the particular packages and files needed for this image.

Docker is smart about space and uses a concept of image hierarchy to only save diffs (called “layers”) when it makes sense. When building several images based on the same base image, Docker will only store the original image once. It then only stores the differences from the original image. At Snips, we have created our own base image including the packages we use most and our internal libraries, and use it for most of our builds. This allows us to save a lot of space when building hundreds of derived images which mostly consist of adding a few additional packages and a script to the base image.

We believe each Docker image should describe as much as possible an atomic functionality. For instance it is best to run a database and an application in two separate containers which are linked by the various facilities provided by Docker. Thus, each service is isolated, leading to easier maintenance and scalability. This extends the micro-services philosophy that we apply to our internal and public applications.

All of our Docker images are stored on a private repository called the private registry which is shared by all our servers. This allows to push once and use images everywhere.

Everything at Snips runs in a Docker container.

Everything in our infrastructure runs in a container. Developers and data scientists are in charge of maintaining their own images. At its core, maintaining infrastructure means making sure Docker works. Provisioning a new machine on our infrastructure is relatively easy: Install Docker ;-)

Running a one-off job

We have a small home-made Docker wrapper which enables us to start a new instance of our base image very easily. A simple sky container in the command prompt gives you a new instance (container) running in our cloud. It’s more or less like SSH’ing into a random machine of our infrastructure, except the environment is completely isolated and standardized:

> sky container
7ffb9a3b28e3# pwd
/opt/docker
7ffb9a3b28e3# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 zsh
   50 ?        00:00:00 ps

You can then git clone a repository, run some long-running code, and get notified once the process has finished (using a small in-house tool called snitch):

7ffb9a3b28e3# snitch --notify-email snipster@snips.net -c "sleep 5; echo done"
2015-02-25 11:19:06.788241[57]: Started "sleep 5; echo done"
done
2015-02-25 11:19:12.791770[57]: Command successfully finished

You can even detach from the container and re-attach later as you want to check on it. Now this is great for running one-off jobs, but what happens if you want to run a long-running service like a REST API?

Deploying a service

You have a piece of code working on your laptop and you wish to deploy it so the rest of the company can start testing it. It can be anything from a new algorithm to a new API, a new dashboard or even a new database.

The first step is to construct a Docker image, containing all the required binaries and code. A simple Dockerfile extending our custom base image does the trick in a few lines. Once the image is built, we need to describe where the container instance will be started and how it will be connected to the rest of the infrastructure. To this purpose, we use a standardized service configuration file containing:

the service maintainer (name and email)
the service docker image and image version
the service interfaces (ports, DNS…)
the service dependencies (databases, file system volumes…)

This config is inspired by Fig (now Docker Compose) and Maestro. Here is a configuration example for a server requiring one Cassandra database:

my-application:
    maintainer:  snipster@snips.net
    image: my-application
    requires: [ database ]
    instances:
      my-application-prod:
        version: 1.0-SNAPSHOT
        env:
          RUNTIME_ENV: production
          VIRTUAL_HOST: my-application.snips.net
        limits:
          memory: 1g
      my-application-dev:
        version: 1.1-dev
        env:
          RUNTIME_ENV: staging
          VIRTUAL_HOST: my-application-dev.snips.net
        limits:
          memory: 1gdatabase:
    image: cassandra
    version: 1.0-SNAPSHOT
    env:
      RUNTIME_ENV: production
    limits:
      memory: 10g

The instance key in the my-application section describes a list of all instances of the application to be run (in this case, a production and a development instance). Each instance will inherit the parent properties of the configuration file (here maintainer, image, and requires). This means that each instance will spawn a new database alongside it. To connect the application to the database, the sky tool will inject specific environment variables in the application container describing which address and port to connect to.

The VIRTUAL_HOST environment variable is cool: it allows us to bind a container application to a public or private (on our VPN) URL by simply adding a line to the config file. This also load-balances instances having the same VIRTUAL_HOST variable.

Because the maintainer email is in the configuration file, any alerts from warnings or errors are directly sent to him so he can fix them.

Once the service configuration file has been written and the Dockerfile built, starting the service is as easy as:

> sky service start my-application
Starting my-application-production-database.. DONE
Starting my-application-production.. DONE
Starting my-application-dev-database.. DONE
Starting my-application-dev.. DONE

and the running process logs can be obtained by running sky service logs my-application.

A dashboard on our private intranet also enables us to monitor the status of services. It enables us to gather logs, interrupt or restart failing services. This makes it straightforward for new team members to understand at a glance how the Docker infrastructure works and to inspect what is going on with their containers.

We will extend the configuration file to include lifecycle checks (HTTP check, port check or a custom command) ensuring the service runs smoothly and is never down.

Running a recurrent job

Recurrent jobs are not that different from services and thus are expressed in the same configuration framework. For example, sending out a report email periodically would look like:

still-alive:
    maintainer:  snipster@snips.net
    image: base:0.7
    command: echo "Hey I'm still alive!"
    every: 1 day at 17:00
    notify-on-completion: true

A “scheduler” service watches for configuration changes and executes docker containers based on the aforementioned config.

We can of course start any of these recurring jobs outside of the standard schedule when an error has occurred requiring us to restart a service. For example, forcing a start of the “still-alive” job can be done using our sky tool using sky job start still-alive.

Inspecting services and jobs

Simplicity of use is not only about ease of deployment, but also about how simple it is to debug and improve your code. Two things are very important to get meaningful insights into your code: access to logs and run-time metrics.

Logs give precise details about what happened and when. Metrics quantify how fast and how often the code ran. Both are critical in understanding how applications and algorithms behave when run on production-sized data. Feedback about how services and jobs are running is essential. It means measuring as much as you can, and this for at least two reasons:

You can’t optimize properly if you don’t measure properly. You risk optimizing the wrong part of the code or doing premature optimisation.
It allows to quantify, compare and learn. Why is my code running so slow compared to others? Why am I using so much RAM? Asking the right question is already halfway to a solution. Develop a culture of speed and efficiency!

When a container runs on our infrastructure, it is automatically monitored. Anyone can then inspect its resource usage.

Each container is monitored using CollectD and Graphite containers. The results are gathered in infrastructure-wide dashboards which allow us to investigate resource usages of each container.

Application-level metrics and logs are handled with a set of homemade wrappers included in our base image. These tools are written in the most common backend languages used at Snips (Python and Scala), and give us a standardized way of evaluating code performance and exploring logs. Logs and alerts are handled by logstash, while metrics are handled by Graphite and StatsD.

Home-made Python and Scala librairies enable us to have a standardized way of defining application metrics. We introspect those using dashboards automatically generated for each containerized web server API. We can then investigate slow queries and explore time series using a series of tools.

Switching to production

All of our production traffic is duplicated and redirected to services in staging. This enables us to test services that have not yet reached production maturity with production data. Because everything is measured, we can quickly assess the impact of changes, and identify errors that lead to bottlenecks.

Since Docker images are tagged by versions and stored, rolling back to an older version simply consists in rolling back to the previous services configuration file which points to the previous image versions.

We use Strider as a Continuous Integration system. It allows direct deployment of services that pass tests upon GitHub commits. This is especially useful for iterating quickly in a staging environment to correct mistakes that have slipped through.

As a consequence of having an uniform infrastructure, running code in production is not fundamentally different from running a prototype in staging. The same toolchain and processes are used throughout.

Closing words

Fast iteration only becomes possible when you have substantially reduced the time and effort needed to deploy and inspect services on an infrastructure. Tightening the feedback loop enables richer ideas to be conceived, and higher quality prototypes to be deployed.

Less errors are introduced when the same toolchain is used for developing, prototyping and production. The infrastructure then becomes a high quality service for all of its users.

An infrastructure is in essence no different from a traditional interface. Its true purpose is to hide complexity, in order to let us do what we do best: be creative.

Snips is hiring!
If you care about creating products that will change the way we use our devices in our daily lives, take a look at our jobs page! We would love to hear about what makes you tick, your own personal projects, and discuss how we could work together!