Writing Deployable Code (part three)
This is the third and last part of my “ Writing Deployable Code” series. In part one I described the motivation for writing deployable code and in part two some of the more basic steps to take for making your code deployment friendly. In this section I will continue with the description of some more advanced (but very often still necessary) techniques.
As a reminder, let me shortly list the topics mentioned previously and then I will elaborate about the new topics.
- Self-test (mentioned in part 2)
- Immutable Services (mentioned in part 2)
- Monitoring (mentioned in part 2)
- Logging (mentioned in part 2)
- Crash Only (mentioned in part 2)
- Zero touch deployment (mentioned in part 2)
- Service discovery. Service discovery affects the deployment itself as well as the entire service lifetime — post-deployment. When a service is deployed it needs to announce its ready state in order to start accepting traffic, right after it’s self-tests passes (usually through a service discovery registry such as Consul, Zookeeper etc). And if the service uses other backend services it needs to discover them — also through the service discovery mechanism. The reason I bother to mention this here is that if, in the past, service discovery was more or less static, it was sometimes embedded in static configuration files or in DNS, then today service discovery needs to be much more flex and robust to the level that if a service backend changes all it’s users need to know about this change in sub-second latency. There are a few ways to go about it, I mentioned before Consul and similar services that may help (etcd, zookeeper to name a few more), but it’s also possible to implement service discovery by means of a load balancer such that the upstream services need only talk to the load balancers and the downstream services announce their availability to the load balancers (HAProxy is a good LB for this purpose). There are other interesting methods such as the SmartStack from Airbnb, which embeds an HAProxy in each node and all hosts’s traffic is routed through this local HAProxy (on each node separately). In short, when topology changes so often — as is the case with frequent deployments, service discovery is a crucial part of the deployment toolchain.
- Unified deployment interface. The deployed artifact needs to implement an interface. This interface likely supports methods such as Start, Stop, Test, Configure, Unconfigure, Announce etc, depending on your needs. Some operations might be void in some cases but the important part is that all services need to implement the same API such that it’s easy to write a unified deployment script (or use a deployment tool) that would be able to deploy your frontend servers as well as API servers, data processors etc in a unified way. One possible way to achieve a unified deployment interface is to adhere to an already established standard, for example encapsulate all artifacts inside a docker container or RPM for Redhat-like systems or deb packages etc. What you need to keep in mind is that — just like that your service have an API — and just like your classes that have an API — so does deployments need to have an API, that’s the key for every automation and needless to say, good automation is required for deployments.
- Deployment events. This is a somewhat related to the deployment API mentioned before. If you’ve developed or used a UI component before, you’d know that most components have an API, e.g commands you send to this component, such as Paint or Dispose, but here are also events, which are callbacks that the component notifies you of, such as onClick, onKeyup etc. So API is inbound and events are outbound. Deployments are no different, as mentioned before, a deployable artifacts needs to have an API, which is a way to send incoming commands such as Start, Stop etc. But this artifact also needs a way to notify that it is ready to serve or that it has a fault and it needs to stop serving. Of course — by virtue or load balancers or other error detection mechanisms it is also possible to detect these changes externally, even if the service does not bother exposing them — but do note that only “catastrophe” events are easily noticed by external entities, not the fine-grained ones such as CPU level, queue length etc, plus, it is much harder to debug a system by collecting signals from all around — much easier if the same process that experiences these events would notify you of them. You’re probably already doing it, at leas to some extent — it is called Monitoring and you’re damn right that you should keep doing it. However, what I’m referring to as deployment events is just a more semantic level that aggregates the different signals and noises and reports a more coherent and semantic state change such as Ready, SteppingDown, OverCapacity etc. So in a sense “events” are similar to monitoring events, only that they are more semantic and more actionable. The name and the meaning of the events are in some cases specific to the service, but just like an API — the more you standardize them, the easier it is to automate the service at deployment and post-deployment. How do you report these events and who should subscribe to them? We already mentioned monitoring as being some sort of “events”, however, monitoring can be noisy what what you want is clear and actionable events. Therefore a very basic and simple method to implement outbound events is to utilize a message queue/topic to report them such that all interested parties subscribe to these and get notified. Note that there’s a latency issue with queues so when you need realtime handling, such in the case of service Up an Down events, a service registry tool with sub-second notifications is also in place (plus the service registry is useful in order to provide the full current state, which message queues aren’t good at, and it is also able to notice “dead nodes” that just stopped reporting and can’t even send events). These events are useful for other services and the deployment tool itself to be able to implement robust automation. Eeven if not used for automation, at the very least should be logged and presented on the deployment tool’s interface for visibility.
- Immutable servers. The concept of immutable servers means that once you deploy a new version of your service — you do it on clean servers, you do not “reuse” old servers. This can be achieved in several ways — one is that each time you deploy a new version of the service you do it by bootstrapping a new virtual host/instance on hosting services such as AWS, either using a stock image and then running some configuration tool (e.g. chef/puppet/ansible) or use a pre-baked image that already contains your artifacts (as Netflix does). Another way would be to use a container such as docker which would encapsulate at least some of the aspects that may change during a host lifetime and bring you pretty close to immutable server. The motivation for using immutable servers (not to mistake with immutable services) is that during the lifetime of a service it may change its environment, e.g. write to disk, change permissions etc and deploying a new version is one great way to “start fresh” and make sure you did not count on (or fail due to) any rubbish previously present on the server. This makes for a clean and robust deployment process such that if you need to add 50 more servers at peak time you’d be able to happily do so without concern of “what happens on the new servers” — b/c always all servers are new, by definition.
- Cattle, not pets. The previous topic brings me to the statement previously made by others that you should treat your servers as cattle, not pets. If you remember your servers IP or names, if someone in the office is worried b/c server X (one specific server X, not service X) is having trouble — then you’re treating your servers as pets, you care too much about specific servers. You shouldn’t. The reality is that servers come and go, what you should be worried about is Services, not Servers. Each service typically runs on multiple servers and if one of the servers happens to crash then another server is (or preferably should be) automatically bootstrapped to take its place. This is implemented well in systems such as hadoop and map-reduce which assume failure and automatically restart jobs on other instances when that happens and there’s no reason why it should not happen with online services. So in short — your servers are cattle, you only care about their headcount, not about one specific instance, you do not know them by name and you shouldn’t care if one particular server dies — there should be an automated replacement for it.
To sum things up — I’ve listed some guidelines that could help make your services more easily deployed. By following at least some of the provided practices you’d be able to achieve higher level of automation and robust deployment (and post-deployment); sleep better at night; live longer; prosper.