How we cook Mesos

Putting everything together

Denis Parchenko
14 min readMar 14, 2016

There are many articles out there describing Mesos in general but very few showing the complete working infrastructure ready for production use. In this article, I’ll try to describe our approach to configuration and usage of different components in order to achieve continuous delivery and fault-tolerant runtime platform with Mesos. Machine provisioning scripts, however, are the subject specific to our own infrastructure so they may not be published in full, yet some parts necessary for understanding are provided.

Overview

We are not going to go through the actual installation and configuration of machines, but let’s look at the following diagram to have an idea about the software installed:

Currently, we run 3 master nodes and 3 slave nodes and provision them using Saltstack.

Here is the high-level overview of the complete process from code building and deployment to its configuration and execution.

In the following parts, we’ll cover how all components work and interact.

Preparing Docker images

Although Mesos perfectly deals with any kind of executables using its default containerizer, we’ll stick to running dockerized applications here. We’ll need a few additional bits in every Docker image being deployed for easy integration into the runtime platform.

First, it will be very handy to know the actual IP address of the Docker host machine. While people still beg to support the special Docker flag, that could be easily achieved with custom entrypoint script:

Now the containerized application can reliably refer to “dockerhost” hostname when it needs to access the Docker host machine.

Second, for service configuration management (which is discussed later) we’ll need one more script to be present in Docker image, it’s called service-wrapper. Also, we’ll have to install its dependency—jq, and if you use minimalistic Linux images (like Alpine) make sure that curl is also available. This yields the following base Docker image for Node.js applications:

This image should be pushed to Docker registry accessible from CI server. In our example, it will be private registry available on registry.local host.

Building and deployment

For continuous integration we use Drone, it’s rather simple but very promising piece of software which is based on a great idea of executing builds in Docker containers. Also, all of its plugins are just regular Docker containers which are fed with build information payload and expected to process it in a plugin-specific way. And you guessed it, even repository cloning is done with a separate Docker container.

As we use a pretty diverse range of technologies, Drone simplifies setting up our build server a lot—we just prepare separate Docker images for every specific technology stack and tell it to use them for builds. To name a few, there are images for running Node.js, Java and Mono builds. And of course, they are also built with Drone.

For the sake of demo I’ve created a Node.js web application called test-server which simply shows its environment variables upon request, you may find it on GitHub. We’ll refer to it in the rest of the article and will see how to create a Docker image with test-server and deploy it to Marathon. It should be noted that the service should strive to follow the principles of the twelve-factor app to be perfectly integrated with this runtime platform.

Before we proceed with building, it’s worth noting that we use Marathon to schedule our long-running services. It is the cornerstone of the whole runtime platform, it controls service availability and is used for service discovery. It has pretty straightforward curl-able API and we leverage it to automate deployments. In order to deploy a service using Marathon one needs to call its API and post a JSON payload describing service similar to the following:

It instructs Marathon to schedule 4 instances of Docker-based service from image registry.local/test-server tagged with master-ef5a154e25b268c611006d08a78a3ec0a451e7ed-56. Let’s call it service definition. We’re not going into details of its specification, there is a good Marathon API reference on it. What more important here is automatic generation of this file during the build. For every deployable service, there is a service definition template containing placeholders which start with ‘$’. Aforementioned file is generated using the following template:

Note the $tag, $environment, and $instances placeholders. During the build $tag value is generated and put into service.json with the following simple script:

It creates 2 service definitions: one named after full version and one named after branch name, service-master-ef5a154e25b268c611006d08a78a3ec0a451e7ed-56.json and service-master.json respectively. In both files $tag is replaced with master-ef5a154e25b268c611006d08a78a3ec0a451e7ed-56 but $environment and $instances are left intact, we’ll need them later during deployment phase. Let’s take a look at the Drone build configuration to see what happens next (the complete syntax is described here):

Then CI is instructed to publish build artifacts:

  • Create a Docker image and push it to registry
  • Publish generated service definitions to S3

It’s a good idea to store some number of last published Docker image tags in your Docker registry or even store all of them if disk space is not a concern. This allows for easy rollbacks with Marathon if something goes wrong with newly deployed version.

Now we are ready to deploy our new shiny build. In order to do that we have to post service definition JSON to Marathon’s /v2/apps API endpoint. But first we need to replace remaining placeholders: $environment and $instances. Although these are really simple things, we still want to automate them with the help of marathon-deploy utility. It will download service definition template, replace placeholders with values and create/update the app in Marathon. It should be called like this:

marathon-deploy.sh <service-template-url> <environment> [instances-count]

where service-template-url is the URL of service definition artifact published to S3, the environment could be anything you like to separate different runtime environments (staging, production, A/B split test groups, etc.), instances-count allows to specify the number of service instances to start, 1 by default. For example:

marathon-deploy.sh https://build-artifacts.s3.amazonaws.com/test-server/service-master.json staging

will deploy the latest build of master branch to staging environment and run a single instance. Or:

marathon-deploy.sh https://build-artifacts.s3.amazonaws.com/test-server/service-master-ef5a154e25b268c611006d08a78a3ec0a451e7ed-56.json production 4

will deploy a specific build to the production environment and run 4 instances.

In our case we copy the marathon-deploy script to all Mesos slave machines when they are being provisioned and start deployments via Chronos, we have preconfigured jobs for deploying our key services to different environments, so the deployment requires just a single click. The great thing about Chronos is that it can chain jobs together, so one could configure specific one-off tasks to be executed before and after actual deployment to prepare runtime environment.

Service discovery and load balancing

Services in most distributed systems come and go often due to different reasons like service starting/stopping, scaling or service failures. Unlike static load balancer configuration which works for servers with well-known IP addresses and host names, for Marathon, load balancing requires much more sophistication in registering and deregistering services with balancer on the fly. For that, we need some kind of service registry which can hold information about registered services and provide it to clients. This concept is known as service discovery, it’s the key component of most distributed systems.

Here Consul comes to the rescue. As its website states, “Consul makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface”. In addition, it has other useful features we’ll make use of later. Now that we have service registry, we need to let it know which services are started and where they are located (hostname and port) and optionally provide other useful meta information about them. One way of doing that is to make services themselves use Consul API directly, but that would obviously require every service to implement its own communication logic, and while it’s trivial for one or two services it becomes a burden if you have many of them, especially when they are written in the different programming languages. Another way is to have some kind of third-party utility which monitors services and reports them to Consul. We use a program called marathon-registrator, it tightly integrates with Marathon and can register any kind of services Marathon runs. Another option is to use Gliderlabs registrator if you only have services in Docker containers. You just need to run an instance of such utility on each Mesos slave host.

Once services are registered, other services should be able to locate them. Again, they could directly communicate with Consul API or DNS and get this information (client-side discovery), but there is an alternative, a load balancer such as HAProxy (server-side discovery).

Server-side service discovery with HAProxy has many benefits over client-side discovery:

  • Load balancing for free.
  • Immediate propagation of changes from service registry to consumers. HAProxy is reconfigured and ready to route requests to new instances right away after a change occurs.
  • Extremely flexible configuration. To name a few features: load balancing strategies, healthchecks, ACLs, A/B testing, logging, and statistics.
  • No need for service to implement additional discovery logic.

But how does HAProxy keep track of services registered in Consul? Normally, its configuration is done statically with all backends known in advance. But it could also be built dynamically with an external tool such as consul-template. This tool monitors Consul for changes and generates arbitrary text files from provided Go templates, so it’s not only limited to configuring HAProxy and can be used with anything that is configurable with text files (nginx, varnish, apache, etc.). Its templating language documentation is comprehensive and may be found in README.

As you might have noticed from the overview chart, we run two different HAProxy configurations: one for internal and one for external load balancing. Internal instances provide the actual service discovery and balance traffic across back-end services. External instances in addition to service discovery expose TCP 80 port and accept requests from outside, this way balancing front-end services load.

For these different HAProxy instances we manage two separate consul-template template files which themselves are built by another templating engine (Jinja2) during machine provisioning performed by Saltstack. This was done mainly to keep everything DRY and to populate some parts with data from machine configuration software. Let’s look at the external balancer configuration template. Note raw/endraw markers which make Jinja engine disregard Go template curly braces and render enclosed contents “as is”:

It includes several dependencies. haproxy-defaults.ctmpl.jinja is a regular static part found in many HAProxy config examples, haproxy-internal-frontend.ctmpl.jinja is more interesting, this is where the internal service discovery configuration is done.

The idea is to come up with a well-known port number for every discoverable service and create an HAProxy front-end which listens on this port. We’ll make use of meta information stored along with every registered service. Consul allows to specify a list of tags associated with the service, and marathon-registrator reads them from service environment variable called SERVICE_TAGS. See service.json template of test-server, it contains two tags separated with the comma: $environment and internal-listen-http-3000. The latter is used in a consul-template template to mark services which expose a port (3000 in our case) for service discovery. The following snippet automatically generates the necessary HTTP front-ends:

Its outer loop lists all Consul services while inner loop lists tags of each service and tries to find a match with internal-listen-http-<port>. For every match, an HTTP front-end section is created. Every service here has two hardcoded environments: production and staging, to differentiate them, the port number for staging is prepended with “1” so that production front-end will listen for 3000 and staging for 13000.

Additional if statements allow specifying multiple discoverable ports on the single service. For that just place additional internal-listen-http-<port> markers in tags list, like

$environment,internal-listen-http-3000,internal-listen-http-3010

Now you’ll need to add a newly exposed port to container.docker.portMappings array of the service definition file in order for Marathon to properly configure your container’s network. Note, that in this case marathon-registrator will register two separate services: test-server-3000 and test-server-3010 to resolve them independently and avoid name ambiguity.

You may come up with other predefined markers to implement other kinds of logic in templates, for example, introduce internal-listen-tcp-<port> to generate TCP front-ends or control balancing strategy with something like balance-roundrobin or balance-leastconn.

This template allows to configure HAProxy in such a way that every machine has access to every service known to Consul by connecting to localhost:<well-known-port>, thus solving service discovery problem.

In haproxy-wellknown-services.ctmpl.jinja we specify more or less statically managed services like Marathon, Consul, and Chronos for their easy discovery. They are started by systemd/upstart/etc during machine provisioning. For example, the following snippet allows for very convenient access to Marathon instance by simply contacting localhost:18080 from any machine in the cluster, localhost:14400 and localhost:18500 for Chronos and Consul respectively (master_nodes collection comes from the configuration management software in this case):

haproxy-external-frontend.ctmpl.jinja describes HTTP and HTTPS front-ends. It contains several Jinja macros which define ACL rules for domain name matching and to bind back-ends to those rules:

And finally, there is a haproxy-backends.ctmpl.jinja file. It lists available service instances referred by previous sections. All backends here are crafted manually since they might have very special requirements in terms of health checking or load balancing configuration:

Internal balancer configuration file is a bit simpler, it only needs to route connections to internally accessible services:

Zero downtime deployment

A good property of every dynamic runtime system is the possibility to deploy services with zero downtime. When you deploy often you want to be sure that service availability is not affected. Let’s see how to achieve this in our system.

The idea behind it is to run multiple service instances serving incoming requests and during deployment replace them one by one: only when new instance catches up we can shut down the old one. Such a rolling restart is handled completely by Marathon but in order to maintain constant uptime we need to make sure that load balancer stops routing traffic to instances which are about to shut down.

Let’s discuss how Mesos stops Docker-based service instances first. Like other Unix process control systems, it tries to stop service gracefully. First, SIGTERM is sent to the process, then the process is expected to finish processing pending requests, clean up and quit by itself. If it does not quit within allotted time, SIGKILL is sent to forcibly shut it down. Mesos effectively executes the following:

docker stop -t=<TIMEOUT> <CONTAINER>

(see https://docs.docker.com/engine/reference/commandline/stop/)

Now it’s clear that our service should correctly handle termination signals (it’s also one of the twelve-factor app principles http://12factor.net/disposability). After receiving SIGTERM, it should somehow tell the load balancer to stop routing traffic to it. The simplest way to do that is to fail all subsequent health checks but still serve other potential requests normally. Once the load balancer discovers that instance is unhealthy it will stop routing traffic there. To make the whole thing work, we need to be sure that force shutdown timeout is large enough for the load balancer to discover health changes and for service to finish processing of pending requests. By default, Mesos slave is configured to wait for 5 seconds, but it’s possible to change that by specifying Mesos slave command line parameters:

--executor_shutdown_grace_period=60secs --docker_stop_timeout=30secs

An example of handling SIGTERM could be found in https://github.com/x-cray/test-server/blob/master/server.js.

A separate note should be made for dockerized services. In order to allow passing of signals to Docker container process, one should make sure to use ENTRYPOINT instead of CMD in Dockerfile to specify executable. When being run, CMD executable is wrapped by a shell process which doesn’t forward any signals (see the Docker documentation for details).

Basically, that should be it for achieving zero-downtime deployment. But with HAProxy, it’s only true for *BSD systems. For Linux systems there is a small issue with it: when it reloads configuration there is a small time frame (around 20–50ms) when a server is not listening for incoming connections. All incoming requests during this gap will obviously fail. The reason behind it is the way of handling SO_REUSEPORT socket option by the Linux kernel (more details on that http://lwn.net/Articles/542629/). There is a great article by Yelp on how this issue could be mitigated. We ended up using simple SYN packets dropping approach. It’s pretty sufficient for our current use. This is how our consul-template now reloads HAProxy instance:

Service configuration management

Every service you run requires its own configuration: connection strings, API keys etc. Normally, it should not be hardcoded nor derived from the build (e.g. Debug/Release/etc.) because you’ll want to have the different configurations for the same executable in different environments (production/staging/split testing groups).

Service configuration normally comes from different sources (each with its own priority): defaults, file, environment variables, argv. If the service could be configured via environment variables (see http://12factor.net/config) it opens a number of possibilities to manage it in a centralized manner—all we have to do is to prepare correct environment before starting it. The good thing is that we already have centralized configuration store for free. Consul provides built-in key-value storage (Consul KV) which can be used to keep configuration values but there also should be a way to forward those values into the service environment.

There is a utility called envconsul which reads Consul KV data and passes it to the service environment and when KV data is updated it restarts its service instance. However, it doesn’t play well with service rolling restarts while maintaining service availability—it will restart all the instances at once which may lead to dropped requests. An alternative way would be to instruct Marathon to restart the service when KV data is updated as we already have everything in place for flawless restarts when they are done by the scheduler. That’s why instead of the envconsul we use a small Shell script service-wrapper which reads data from Consul KV, sets the service environment and restarts it via Marathon when KV changes occur, it will also forward received SIGTERM to the underlying process for graceful shutdown. The script should be placed in Docker image along with the service being deployed. It makes use of curl and jq, so make sure they are also present in the Docker image. This is how it should be used:

service-wrapper.sh <marathon-host:port> <consul-host:port> <prefix> <command>

where marathon-host:port and consul-host:port are self-descriptive, prefix value by convention should be the Marathon app ID, and command is the actual service executable with its arguments. In order to set service environment variables, one has to put the values under prefix path in Consul KV storage. Following is the example of running service-wrapper:

service-wrapper.sh dockerhost:18080 dockerhost:18500 app/staging/test-server node /app/server.js

Notice dockerhost:18080 and dockerhost:18500—these are the examples of server-side internal service discovery which was discussed earlier.

From now on if you add/remove/modify any value under prefix the service will be gracefully restarted with new configuration applied.

Conclusion

Provided Mesos-based architecture gives a great ability to mix-and-match different components to get a complete working system suitable for one’s needs. As an example, for load balancing, we could use HAProxy or NGINX with configuration dynamically generated by consul-template or even replace consul-template with something that could talk to Consul on one hand and configure IPVS load balancing on another. We could replace entire service discovery/load balancing layer with Netflix Eureka or write custom solution based on Etcd or ZooKeeper. Building such a system helps to understand processes behind distributed runtime platform and its core components.

This platform is highly portable, hosting it is not limited to particular cloud/IaaS but could also be done on-premise or even in hybrid (cloud and on-premise) environments. There are even efforts to port Mesos to Windows.

Update: After having some bad experience with HAProxy and long-living TCP connections we actually switched away from load balancing TCP traffic with it.

--

--