Running a Modern Infrastructure Stack

This post originally appeared on the now-defunct Barricade blog in October 2015. Republished here for posterity.

Joe Beda has done a great job of documenting an anatomy of a modern production stack, and while we were working on this prior to his post, you might consider this a field report of what it’s like to actually build and run one of these systems.

A thought I always have when I read posts like this is, “I wonder how big their team is? Can I reproduce something similar?” At Barricade, there are two of us working full time on infrastructure development, with a further eight developers, designers and data scientists working on other parts of the stack, and another three people in various locations who need access to dashboards or other non-critical path services. That might help frame things.

Figuring out where services are, and communicating that, is a problem being tackled by service discovery tools like etcd, consul and others. This has proven useful for machines coordinating with other machines, but past a handful of services, the complexity of actually pointing humans to those resources becomes apparent. We don’t want to be telling people to visit random IP X.X.X.X on port YYYY, and have them update every time a service restarts.

DNS routing and addressing is still, it seems, a manual process in many environments, and exposing operational idiosyncrasies to customers (whether internal or external) is not good enough.

We manage our DNS records in Route 53 with Terraform, linked explicitly to a separate (non-ELB) routing layer, with health checks to handle router outages or topological changes. At the moment there’s still a separate, manual DNS component, but it’s being refactored out.

Related to, but distinct from addressing, is routing. With lots of services and containers popping in and out of existence, connecting services could really use a transport agnostic broker that is capable of reconnections, pooling, health checking, and SSL termination. This actually sounds like sort of a tall order until you remember that HAProxy exists.

Often, people are reaching for ELB here, but since routing logic is so tightly coupled to applications, I think it’s better to keep it as part of the core stack.

It’s also a good bit of forward planning if you intend to have multi-provider availability at some point in the future, or be able to reproduce a microcosm of your stack locally.

Note: taking ownership of routing may also mean taking ownership of SSL termination (unless you’re leaning on something like Cloudflare). Running an SSL server requires staying abreast of best practices in the area. This is not difficult, but it is an ongoing effort and needs to be taken into consideration.

In a stack like this, there are a lot of places we could put health checks — Route 53, HAProxy, Consul, even Zookeeper client libraries perform their own health checking.

Defining the boundaries for these and keeping a clean separation of responsibility keeps logs and services from being overloaded with health check spam, as well as avoiding weird race conditions when health checks disagree.

Addressing checks Routing, Routing checks Services. The Routing layer is the safest and most flexible point in the stack to make an informed decision about whether a service is healthy. Consul can then simply be used for hosts and services explicitly joining and parting.

We handed it all off to Datadog. We don’t need the hassle, and they are very good at it. I tried several products in this category when we started out earlier this year, and Datadog won hands down for scope (system level statistics, Cloudwatch integration, process monitoring) as well as being super ops friendly, having Slack integration, and an excellent web interface.

Their agent event runs a statsd compatible interface (for which there are plenty of client libraries) meaning application statistics can also be pushed here, enriching alerting and monitoring even further.

I thought logging was a sewn up market, but everything I looked at was too expensive or too operationally cumbersome, with assumptions about IP addresses and hosts remaining static or slow moving.

Our logs are for production debugging and postmortem analysis, a tertiary service that needs to be low maintenance and sporadically high performance, with a good interface. Datadog have raised the bar in what I expect from software like this.

Eventually we decided to use rsyslog to ship to in-stack Elasticsearch services, interfaced via Kibana, and rotated out regularly to S3. Less than ideal, but it’s ok for now.

I can’t be the only one who thinks this area is lacking its Stripe equivalent.

Configuration management is lagging in the wake of containerization. Most current configuration solutions expect to be full stack, controlling everything from provisioning to deployment, and are often too heavy for the evolving use case.

In truth, configuration management is losing, has perhaps already lost, its central role in the infrastructure pipeline.

I’ve never found a configuration management solution I liked, but Ansible is the one I least dislike for this purpose. I disagree with Ansible’s philosophy on things like testing (I think the idea that command exit codes are enough for testing is somewhat naive), but it’s a convenient abstraction on top of shell, with a pluggable orchestration layer that doesn’t get in the way. We have a small Python shim which translates Terraform tfstate data into an Ansible dynamic inventory.

The term “orchestration” seems to be used by projects like Kubernetes as shorthand for scheduling, but it’s not a usage I agree with. To me, an orchestration system should control the entire provisioning process, turning a plan defined in code into a production system.

As a result I (perhaps unhelpfully) don’t believe anyone has built an orchestration system, and it’s more likely to come from the direction of something like Deis or even Terraform. Our own is cobbled together between Ansible, Terraform, and Mesos.

I think separating out schedulers from orchestration is useful, and helps articulate what’s happening a little better. The likes of Mesos, Fleet and Kubernetes, to varying degrees, supply an abstraction over heterogenous server resources (different instance types, for example) to give a consistent API for compute, memory and disk to applications, and ensure that those resources are distributed amongst client services.

We run a mixed environment: leaning on wrapper frameworks developed by Mesosphere for some services, and using containers where it makes sense; usually our own code. Frameworks enable more complex service patterns, where there may be leader/follower relationships, or having the ability to dynamically spin up workers across multiple hosts can be useful (Storm, Jenkins, etc).

Unlike the folks at Segment, I’m not confident that Route 53 is enough for service discovery. On the surface, DNS seems like an intuitive answer, but it leaves a lot to be desired due to varying levels of cache aggression in consuming applications.

It also means building another service to handle node participation, something I think Consul handles quite well. Additionally, consul-template and confd allow for speedy, event driven configuration changes, which I think ends up having much broader compatibility than DNS level service coordination.

Many of the services we use rely on Zookeeper for coordination. Between Zookeeper, Consul, and Mesos masters, we have the makings of a “coordination layer” that sits at the core of any stack. Redundancy in all of these services is useful, and scale slowly on more or less the same increments (3 and 5 node quorums are common).

We do our instance and network level provisioning with Terraform, which has been a very useful tool, but restrictive in some ways. It’s awkward to use with teams (Atlas is the proposed solution to this), and we’ve tended to use it to statically allocate resources, rather than have a more dynamic, elastic architecture. The latter is not a fault with Terraform, but I would prefer a tool that encouraged elasticity.

As convenient as it is, one thing I cannot abide is going into the AWS interface and picking AMIs, booting up instances, customizing values, etc. It leaves no record of intent, and makes reproduction and auditing difficult.

I think there’s room for a new piece here, though I’m not quite sure what it is. I like the idea of a service, with an API and a good permissions system, that can handle requests from Slack or a resource scheduler.

We want resource pooling for optimal utilization of compute, but resource isolation for security and to avoid stampeding herd effects.

Effectively, we want resources shared when things are going well, and isolated when things are going bad.

These goals would seem somewhat in conflict, and they are, but it’s possible to group and separate systems so each can be optimized for one or the other.

Instances provision slowly, have fewer problems with “noisy neighbors” than containers, and in AWS, at least, can avail of network level security (security groups).

Containers usually provision significantly faster, however we need to be more careful about how services are co-located, and as of today, they can only make use of process level security.

In Linux security terms it’s probably useful to think of containers as being isolated by AppArmor or SELinux configurations, whereas instances can be isolated on network boundaries.

Given this, it makes sense to use instances when optimizing for resource isolation, and containers when optimizing for resource pooling.

Barricade is an operations product for security, so naturally I want to take a pragmatic approach to security, and find a good balance of risk against cost and operational complexity.

An unintentional side effect of containerization is the potential re-emergence of a single security boundary — you’re either inside the firewall and trusted, or outside and untrusted.

This leads to services with low security requirements or larger attack surfaces potentially being run alongside those with sensitive data, or which would otherwise have smaller attack surfaces.

It’s a tricky one to resolve, because it usually means defining boundaries behind which to segment services. These service boundaries are often better known as APIs. APIs are usually unstable in nascent companies and products, and adding them can slow down development if introduced too early.

It also means duplicating the coordination layer, and the consequential orchestration complexities. As a result, I find it useful to define infrastructural boundaries based on data sensitivity. This tends to split into larger, slower changing chunks than application logic, and can help highlight areas where there should be more controlled access.

Services operating on customer data go behind one network boundary, those interacting with the public go behind another, some services (such as SSH or dashboards) are exposed only inside a VPN, some only within their cluster, host, etc.

As a side note, I would recommend creating a VPN as early as possible. In the brave new world of devops, a VPN is part of your infrastructure, and not something left to your (often non-existent) I.T. department. Despite the upfront cost in figuring it all out, it’s a lot more convenient (and secure) than whitelisting IP addresses or relying on an office network being secure. It will also stand to you when you have remote employees, or need to work remotely.

Still the elephant in the room, how to deploy and manage stateful services is often conspicuously missing from many treatises on containerized infrastructure. If you look beneath the surface you tend to find a completely orthogonally managed data layer, sometimes manually provisioned.

Data services are highly coupled to their applications, and it’s safe to assume that the loss of one means the loss of the all services built atop and around it. It makes sense therefore to shard as early as possible, and logically isolate those shards (application aware sharding), to avoid at least total unavailability.

S3 is looking more and more like the place where data needs to end up, with a global namespace, limitless capacity, and an impressive uptime record. Of course, adapting to S3’s particular performance and consistency characteristics may not be reasonable for many workloads, but for data mining it certainly seems much more attractive for smaller teams than building out a HDFS cluster. In a recent blog post about HDFS from Twitter’s engineering team, it’s very apparent that creating a global namespace was one of the largest challenges.

There’s a lot more work going on recently, certainly in the Mesos community (thanks to EMC), about improving how persistent storage is handled. We’re exploring options at the moment, and so far have been biting the bullet and just having “data nodes” to which stateful services are allocated. Not ideal, and I’m looking forward to see how we can build on top of the work being done at EMC.

We try to have most services bundled as containers, since it pushes software dependency definitions closer to the code that uses them. Additionally, the easiest path to configure software becomes environment variables, which is fine in some circumstances, and a bit of a chore in others (say, where part of your config is a map structure).

It feels like a move in the right direction, but it can mean quite an overhaul to your deploy pipeline, hardcoding values in a repo you’d prefer to be configurable, or the introduction of some gnarly wrapper scripts.

We use Ansible for deploys, mostly as a holdover from pre-container deployments, where most of the deployment logic (such as ensuring at least x services remained healthy) was located.

This could quite easily live, as mentioned earlier under Provisioning, in a small service with a HTTP API and permissions system, accessible via Slack, etc. Will be investigated.

Testing all this is hard. I don’t have any good answers past what Segment are doing with the dev/staging/prod split. It’s the best of a bad situation. Initially, I thought containers would make local development easier, but that has proven not to be the case so far.

It seems reasonably easy to use containers for either local dev or production, but doing both involves a bunch of duplicated effort. Maybe a cross-platform language for defining network topologies / service relationships would help. Libnetwork might spearhead some work in that direction.

I speculated in March that Terraform might become the basis for a testing framework, but we have yet to realize that goal.

Some things I haven’t touched on — like choice of host operating system or container engine — are because the options are probably more dependent on the development team than objective characteristics.

I chose Ubuntu, for example, because I’ve been building production Debian and Ubuntu based infrastructures since 2009 (FreeBSD before that). I picked Docker for the container engine because, for practicality’s sake, I don’t see a good reason to pick anything else.

I also find it a good idea not to innovate in too many directions at once. Having a solid baseline to work from means you can move faster in the direction you do.

I’m @duggan on Twitter, and ross@duggan.ie.

October 20, 2015


Originally published at blog.barricade.io.