How to build stable systems

— An incomplete opinionated guide.

13 min readFeb 7, 2016

Preparation

The first decision is easily the most important. It is one of ideology: the developers are in control of the software. Not the other way around. Managers are not in control of the software. Product Owners are not in control of the software. Developers are. The only people who control the software are they who write it.

The second decision is to have small units of work which you can control. It is better to solve a minuscule subset of the full problem space first and get that deployed to production than it is to have a large project die. Use the initial small units of work as ruthless test beds for later exploration.

It is the responsibility of the developer to always be in control of the software. This is the staple by which everything else is measured. When bugs occur, their dictate the work schedule of the developers (fix it!). When deadlines draw close, they dictate the work schedule of the developers and the software quality (finish it!). Thus, when the deadline is hit, you deploy the part of the software you control fully, and roll back the things which you don’t.

Any change to the software is made succinctly and quickly and moves the system from one stable point to the next. Err on the side of doing fewer things, but do them well. Once mistakes are deployed, they can alter production data and this makes them prohibitely expensive to fix. In the worst case you have to rewrite years worth of data which is wasted time.

The software is written as a series of projects. Each project is small, self-contained and has a known point at which it will be done. There are a few people on a project, no more than 6. A project is never longer than 2.5 months of time. Every project has a win condition. A success criteria which makes that project able to do its work.

A project starts with a small seed of 24–72 hours of concentrated work. This lays the core kernel for the rest of the project, and it prototypes the viability of the system as a whole. If the seed fails, the project is aborted and you try again with the new knowledge. Everything which can be simplified or cut will be, for the seed kernel. The goal is to show the feasability of the project as a whole and make everyone confident in the project.

A project usually have a single gamble only. Doing something you’ve never done before or has high risk/reward is a gamble. Picking a new programming language is a gamble. Using a new framework is a gamble. Using some new way to deploy the application is a gamble. Control for risk by knowing where you have gambled and what is the stable part of the software. Be prepared to re-roll (mulligan) should the gamble come out unfavorably.

Every project starts with a list of “things we don’t solve in this project”. Many of the things on the list will sound desirable. Limiting scope will help build the needed focus. Define what is the future extensions to the project. Put them into a separate project later on.

Know the position of the project in your infrastructure. Components close to the core requires more testing, slower development and more attention to error. Know the general level of testing in the business. If there is none, then aim for getting smoke-tests in place for components before adding more tests. And test core components more than leaf components.

Know your experiments, or your whole project will be an experiment.
— Mike Williams

Make Experiments. Carry them out as small analyses before starting the project proper. Flag them as analysis pre-planning work to the real project. Tell everyone you are doing R&D into the right solution.

Know the code quality of parts you will interact with. Be on guard and look out for bad APIs.

Know the data quality of data you will interact with. If data requires several passes through the laundromat before use, chances are the project should be abandoned until data is clean.

Any project on top of an existing system needs a transition plan: how do we gradually get from our current point into the new system? Big bang deployments tend to be associated with lots of risk for good reason. Don’t gamble this in a well-established setting. Understand how data sources are going to be updated and how you can gradually slide from one data source to the next. Connecting to multiple data sources and on-demand porting them over is definitely an option.

The developers are always in control.

Process & People

You will only pick the parts of Agile/XP/SCRUM/Kanban/ThisYearsFad which works for the team. You will kill everything else.

Development is digital. People working from home are as efficient as people working from the office. Avoid methods which require physical presence at the office. Thank me in 5 years when everyone hires all over the world, or your company has 3 offices in India, Germany and SF.

Prefer asynchronous communication: email, IRC, Slack, …, which avoids interruptions and makes it easier to catch up later. Create hiding spots at the office for people who wants to work undisturbed for a while, so manager-droids can’t disturb them.

Not everyone drinks coffee. Respect people are different. Some people love to do pair programming and solving things at the same keyboard. Other people just tire out of such interaction. Know how different people on your team likes to work with the code base.

Hours in the chair in front of the keyboard does not equate hours of productivity. Many solutions will come to you when you are away from the keyboard. Flexibility in work hours and work place is a must for productive people. Brilliant people have no off-switch in their brain they flip once they leave work. This holds for the single parent of three and for the triathlon-zealot.

In any team, individual members will make mistakes. Let them. Learn from the mistakes. Get on with solving the problem.

System Planning

The system is built for production. That is, the system is not built as a toy you then accidentally throw over the fence and put into production later. The system is built for production consumption. You will think about how you configure the system in production. You will think about its external dependencies and limit them. You will make the system easy to operate and maintain.

You build your system as a 12 factor app.

Your system is a flat set of modules with loose coupling. Each module have one responsibility and manages that for the rest of the software. Modules communicate loosely via a protocol, which means any party in a communication can be changed, as long as they still speak the protocol in the same way. Design protocols for future extension. Design each module for independence. Design each module so it could be ripped out and placed in another system and still work.

Avoid deep dependency hierarchies. They breed tight coupling. Avoid monster-modules and break them apart. Avoid the mess of microscopic modules as well. Always remember the power of copying a function and thus breaking a dependency. Fewer dependencies can some times be a win.

In a communication chain, the end points have intelligence and the intermediaries just pass data on. That is, exploit parametricity on mulitple levels: build systems in which any opaque blob of data is accepted and passed on. Avoid intermediaries parsing and interpreting on data. Code and data changes over time, so by being parametric over the data simplifies change.

You will have a supervision/restart strategy. If you write Erlang you already have this. If you do not write Erlang, you have to build such scaffolding either inside the application (for fine granularity) or via the operating system (for coarse granularity).

Your system will prefer ratcheting methods via idempotence. From a known stable state the system will attempt to ratchet to the next step in the computation. If it succeeds, it will verify consistency and then stabilize at that step. Failure will abort and you can try again. Essentially the system is stateless between computations and stateful only on the ratchet flanks. This is especially important as a best-effort delivery mechanism in distributed systems. A unique ID on all messages means you can always retry said message in case of a timeout and be sure it won’t be rerun by the receiving system, if the receiver keeps a log of what it has already done.

Construct systems such that they always “catch up” to a point in time. This avoids building a system with separate modes for on-line processing, and off-line catching up — duplicating and complicating the code path.

Follow the UNIX principle: each tool does one thing well. Avoid the temptation to add more and more functionality. Build a sibling tool instead.

Define the capacity of the system up front: what amount of load are you targeting for normal engineered operation? What load is the peak at which the system can operate nominally?

Setup

First you build an empty project. You add this empty project to continuous integration. You deploy the empty project into staging. If it is a new project, and there are no users, you can also make a deploy directly into the machine which will be the production setup down the line. Once this works, you start building the application. As needs arise, you add the necessary configuration into the deployment chain as well.

Continuous integration produces the artifact. The artifact is the built code in a self-contained fashion with no reliance on the host environment save simple setup. Seek to preconfigure your systems so you need no external dependencies when deploying. This saves you in a fiery situation later. The same artifact is deployed to staging and production. It picks up a context from the environment and this context configures it. It can be a configuration file on disk, consul, etcd, DNS or one downloaded from S3. Err on the simple side here and don’t use advanced technology too early on.

The artifact is reproducible. Lock dependencies to specific tags/versions. Make upgrading dependencies a decision on your part. Aim for the build to be reproducible. Vendor everything. Control for outside factors making sudden changes to applications you use and messing up your build.

The artifact contains everything for running the software. It is either a binary, or a directory tree containing the binary. The binary is statically linked. Go binaries, OCaml binaries, Haskell (GHC) binaries, or Erlang/Elixir releases are good examples of artifacts. The artifact is also packed with deployment information which is read by the deployment system you are using.

Try to make a production deploy take less than 1 minute from button-push-to-operational-on-the-first-instance.

Build a default library you include in every application you write. This library contains debugging/tracing utilities, tools for gathering and exporting metrics, ways for the application to become a bot in a chat network and so on. Let every application use the same library.

Development

Correctness is more important than fast.
Elegant is more important than fast.
Code Quality is more important than fast.
Fast is not really important.

There is a good-enough point which you should define before starting and don’t optimize beyond that point. Most software doesn’t have to be fast, and the fast parts are often separable from the mundane parts.

Measure before optimization of algorithms and data structures. Know if your change had the desired impact.

Build your system to collect metrics about itself as it runs. Ship metrics to a central point for further analysis. Read Gil Tene’s work on HdrHistogram and use that tool all the time.

Use any dirty trick with respect to tooling you have up your sleeve: unit test, property based test, type systems, static analysis, and profiling. There is really no reason to avoid tools that can help alleviate bugs early on. Late bugs are exponentially more expensive to remove, when they’ve had the chance of altering production data.

The software is built to run on multiple different environments, preferably UNIX. Embrace a diverse culture as the landscape of what to run on changes all the time and lock-down to a specific platform is often a problem.

If you only ever run on windows, you are screwed. You are locked in to a single vendor, and you live and die by the quality of their tooling.

Discussions about code formatting is mostly pointless. One defines a standard haphazardly and everyone follow suit. In Go, everyone just runs ‘go fmt’, for instance, and this discussion is over.

Understand where the error kernel in the system is: what parts absolutely have to be correct and what parts are more lenient in correctness. Separate the error kernel and isolate it. Focus testing efforts accordingly.

Use load regulation in the border of the system in order to avoid overload situations. Don’t regulate load on the inside. Reject work if too much is going on. It is better to give service to a few select than to fail giving service to everyone.

Use circuit breakers to break cascading dependency failure. These are also indispensable when maintenance is needed since they can be manually tripped.

Picking a database

ed(1) is the default text editor. postgresql is the default database.

Unless you have a dataset above 10 Terabytes, you pick postgresql. If you need MongoDB-like functionality you create a jsonb column. You learn about your data while they are in postgres and move them out when you have learned. Postgres is your authoritative storage. You export to elasticsearch from postgres. You preheat your other data stores from postgres. Until load increases you will run read replicas of postgres. You will use pg_bouncer.

Most new databases give dubious consistency and security guarantees. Especially the immature variants. They work in a “call me maybe” fashion, where they may accidentally fail to store you data. This is particularly true when there is a network partition in a distributed database. Consult the latest findings of Kyle Kingsbury in his Jepsen project and use this is a guide for your needed guarantees.

Many newcomers to the database market have narrow performance profiles: they work well within a narrow band of use, but fail miserably once you try to use them outside this narrow band or put more load/pressure on them than they can handle.

Be vary if you start using very complex transactional behavior. It makes it very hard to move away from your database design in the long run, especially if you need distributed operation. Isolate complex transactional interactions to a few parts of the store, usually having to do with money. Look for idempotent ratcheting methods as an alternative if possible.

Picking a programming language

To get a robust system, you have to pick Erlang somewhere inside the system. No other language supports the robustness principles needed for stable operation.

If you don’t pick Erlang, you will have to reimplement the ideas of Erlang in your Weapon-of-Choice™.

You will avoid the monoculture. Having all of your code written in C# or Java means that there are projects which you will be able to do easily and other projects which will become incredibly hard. Try to have a healthy mix of different languages that offer varying trade-offs. This maximizes your chance of picking a language which fares well in your problem space.

Know the weaknesses of a language: Python is not very well suited for massive concurrency. Erlang doesn’t work for problems requiring raw computational power. OCaml has an immature parallel execution story. Go doesn’t fare well on problems where you need good abstraction capabilities, or with complex failure modes.

A language can only succeed if you have a way to automatically deploy it with ease. The deployment tooling must be in place before use.

All projects, language notwithstanding, use the same tool for configuring and building themselves: make(1). Make can call into the given languages choice of build tool, but the common language for continuous integration and deployment is make(1). Use the same make targets for all projects in the organization. This makes it easy to onboard new people and they can just replay the work the CI tool is doing. It also “documents” how to build the software.

Configuration

The artifact comes with a default configuration. Everything which needs to be different is picked up from the context. Prefer ’12 factor apps’ and pick up configuration from environment variables. Persistent data lives outside of the artifact path, on a dedicated disk with dedicated quota. The application logs to a default location and will never log up to more than a constant amount of disk space which is predetermined. Rotation will be set up.

The artifact path is not writable by the application.

Use different credentials in production and staging. This avoids misconfiguration. Isolate the network of staging and production so a staging deploy cannot kill the production environment. Deny developers laptops easy access to the production environment and make them jump some hoops.

Avoid the temptation of too early etcd/Consul/chubby setups. Unless you are large, you don’t need fully dynamic configuration systems. A file on S3 downloaded at boot will suffice in many cases.

Operations

Optimize for sleep. The system must avoid waking people up in the middle of the night at all costs. Operations as well as developers.

The system must be able to gracefully degrade. The perception of a partially degraded system is often that it works, whereas the perception of a non-responsive system is that it is failed. Use this to avoid waking people up in the middle of the night.

The system runs out of monit, supervise, upstart, systemd, rcNG, SMF, or the like. It never runs in tmux/screen. The operating system gracefully restarts the failing application for a while before giving up.

Consider using a split stack of your own software, away from the system stack. This avoids having to work with old software and makes it easier to move your stack between machine types.

It is always safe to kill the application. You can’t control for events where Amazon decides to ward off and destroy 37 racks randomly. The application must gracefully stop and start if given the command to do so. Booting off active requests is not an option. Time must be spent to make this work gracefully, by stopping internal parts in the opposite order of which they were started.

Developers usually never log into the production hosts. Every log file is shipped and indexed outside of the system. Every interesting metric too. Developers work on staging hosts. There should be enough information shipped such that you can reconstruct the error without production access most (> 90%) of the time.

Metrics often show failures before they occur. The odd-one-out customer who abuse the system in a certain weird way is often how everyone will use the system down the line. Understand why the system spikes around that particular customer can make fault handling into a proactive venture. As load increases, extreme values become common.

The only way to make changes to a production host is to redeploy. The only way to make changes to a staging host is to redeploy.

The world is now elastic. Spinning up a new machine, jail, or zone is cheap so use this for operations. Rebuild the data center and flip a switch in the load balancer. Make it easy to roll back and downgrade a deployment. Gradually switch traffic to new code with high risk to see how it fares before letting it take full load. Overprovision machines around a risky deployment, fix problems and then scale down later when the software operates in a better way.

It is tempting to handle production deployments with errors by rolling forward all the time. This risks spinning out of control, so always have a way to take a step back. If you want to deploy to production many times a day, then have a group of stable hosts on hand for a rollback.

Docker is not mature (Feb 2016). Avoid it in production for now until it matures. Currently Docker is a time sink not fulfilling its promises. This will change over time, so know when to adopt it.