Writing Deployable Code (part two)

7 min readMay 3, 2015

This is the second section of the series about writing deployable code. In part one I described just a little bit of the motivation for writing deployable code or deployment-friendly code. In this section of this three-part series I will list the basic properties and methods that a deployment friendly code should support. In the third and last part I will list some of the more advanced topics for how to make your code deployable.

What does writing deployable code mean? Put it another way — what does one (developer) have to do to make its code deployment-friendly? Here’s what I’d learned, happy to learn more.

Self test. Self-test is the concept of a service that runs a few simple tests upon startup and reports it’s ready state. When a self-test passes it means that the service is ready to take action. For example, if a service needs to use the file system in order to operate or access a message queue the self-test would include a simple access to the file system to make sure it has the correct permissions or enough disk space in order to operate or access the message queue to make sure it’s up and that we have the correct ACL etc. The usual drill is to run a series of self-tests when a backend process is started, as part of the deployment script and if any of these tests fail, the deployment is considered bad and needs to be rolled back. Usually there’s a reserved URL for that same purpose, for example /self-test (sometimes referred to as health-test). In the past I’d written my own framework for self-tests but nowdays there’s abundant so I’d recommend simply using one of the existing libraries if you can find one that fits your stack. For example in Java we have metrics-lib (https://github.com/dropwizard/metrics) which includes, among other things, healthchecks.
Immutable services and application-tests. Similar to self-tests are application-tests. Application level tests (sometime referred to as “testing in production”) are application logic tests which are run on actual production servers. These can be viewed as extension to self-tests in the sense that — if self-tests run a series of a few sanity tests (such as — can I access the file system? can I access the message queue?), then application-tests run more complex tests and as a matter of fact, actually run API calls, real API calls to determine if their results are correct. Another way to look at them is “end-to-end tests”, but instead of running them in a dedicated staging environment they run on actual production servers. There are a few things to consider when writing/running application tests in production. One — how do you determine that they were actually successful? I mean, if they run on real, unsynthesized data, how will you be able to tell if the result is correct? This usually depends on your own logic but many times the result is indeed expected (for example a login API should allow login), and in cases where it isn’t a simple predefined response, the least you could do is run some fuzzy checks, for example that the structure of the response is as expected or at the very least check that the response code is 200 etc. The second concern is “What if by running the tests I change the state on the server and therefore damage producton?”. The answer to this is either you make your services immutable (more about immutable services here), which isn’t so-to-say simple, but is a good practice regardless of application-tests, or that you’re being careful not to run state changing API calls, or at least change the state in a way that would not be reflected on users. Lastly, one needs to make sure that by running a series of application-tests you do not accidentally melt your own servers yourself. You need to make sure you leave enough air for the servers the breath.
Monitoring. Ops surely know what monitoring is. Developers usually also do. The challange is connecting them both. There are a few topics to discuss while discussing monitoring best-practices, I will not cover this entire topic here, but I will try to at least provide the deployment-related angle. The main purpose of monitoring is to indicate whether the service is operating correctly. There are many layers to monitoring — the server level (e.g. disk space, CPU, networking etc), the application level, which is where we concentrate here, and then there’s analytics (e.g. google analytics and such) and user monitoring (mixpanel, intercom etc) and a few more. The title of the post reads Writing Deployable Code, but one needs to keep in mind that by Deployable Code we do not only mean that the application *starts* and *operates* correctly a minute or two afterwards, but that it also runs as expected for the long-haul, 1 hour later, 1 week later, 1 year etc. This concept is sometimes referred to as “post-deployment”; post-deployment events are for example high load which should trigger adding more capacity, service crash which should trigger service replacement etc. All these cannot be achieved without proper monitoring. Many of the containers provide some basic level of monitoring (containers — e.g. linux itself or AWS or other cloud services which provide CPU monitoring, traffic monitoring error rates etc), but to really understand what’s going on in your service, as a developer you have the power to whiten the black-box for ops. You do that by providing descriptive and accurate application level monitoring. As Allspaw once said, “If it moves graph it, if it matters alert it”. Monitoring and alerting need to be part of the usual flow for developers, just like unit tests are (or — hopefully they are). Just like testable code — which makes better code — monitorable code also makes better code. When you start thinking about what are the number you need to keep an eye on, it also makes you think about modularity and you design better systems (otherwise it’s be difficult to extract those numbers). For example if you have an internal in-memory queue, you need to keep an eye on the size of this queue and perhaps also on the incoming/outgoing rate. If you do, you may learn about memory consumption in this queue, about bottlenecks and perhaps eventually decide whether a queue is the correct way to handle the load or not. Speculating about performance is very hard; measuring performance is much easier and many times could lead to surprising results. Hence the process of monitoring your code and even setting up thresholds and alerts needs to be part of the usual dev drill. For example, a fellow developer once told me his team used to annotate each of it’s database access methods with @ExpectedLatency. For example, @ExpectedLatency(ms=50). So if this method runs for more than 50ms, an alert goes off. This is very effective in making developers think about database performance as well as — of course monitoring it. A popular monitoring library for Java is codahale’s metricslib (already mentioned here), which lets devs add application-level monitoring, expose them through an HTTP interface as well as send them to pupolar monitoring services such as Graphite, Ganglia and more. https://github.com/dropwizard/metrics
Logs. This is rather trivial, every developer understands the importance and utility of logging so I’m going to keep this short. The only real advice I got from an old programming guru is this: Logs are supposed to tell a story; they tell the story of your software. When you read the logs it’s as if the program is telling you its story, you shouldn’t need to read code to understand what’s going on, just read the logs and the logs will tell you what’s happening.
Crash only. Crash only software is one that expects failure and embraces it. Instead of trying to neatly cleanup when the service is stopped (e.g. listening to SIGTERM and handling it) — simply let the program die, but then, when the service is restarted — deal with this possible past failure. So for example a typical database writes an operation log and when recovering (e.g. at startup) it runs all the operations in the log that were not yet marked as run. Online services may adapt similar approach (but there’s a caveat, read on!). The line of thought is that — you’re going to crash anyway, I mean sure, if you run an expected maintnance (restart or deployment) then you can control the process and send a SIGTERM before nuking the service, but you’re not always that fortunate and more often than not — you’d have to deal with crash scenarios. So, since you already have to deal with crash scenarios — why not regard all scenarios as crash scenarios and rather than investing your time in graceful shutdown (which as mentioned you’re not always as lucky), invest your time in smart recovery.
To be precise — crash only software refers to something more extreme than this, it refers to systems (such as with Erlang) which instead of dealing with errors simply crash and restart by themselves. That’s also nice, but not my point.
Another point worth mentioning, is that in “cloud scenario” many times if a service crashes the new version of the service will not be started on the same physical or virtual host, many times it will be started on a different host. Therefore maintaining local state (as does a typical database) is of little utility and one needs to use external checkpointing utility to maintain state b/w crashes. Such typical checkpoint utility might be just a database, but usually a very simple one (dynamodb for example).
Zero touch deployment. If you have a wiki page for “how to deploy” — game over. Nuff said.

That’s it for now. In this section I listed some of the basic measures that help making your code deployable. In the next section I’ll continue with a description of some more advanced techniques. Stay tuned…

Written by Ran Tavory