Do we need a cloud chassis?

Prior to the emergence of containers like Docker we used tarballs and Puppet to streamline our deployment pipelines. In a private data center you can evolve conventions for the destination of log streams and how applications receive traffic via load-balancers.

We replaced a small number of monolithic Java applications with thousands of microservices and deployed them on multiple cloud environments via container schedulers. Conventions are often tribal knowledge and hard to scale if the number of deployments and size of your development community grow exponentially.

As we embraced a polyglot development community we saw a proliferation of application frameworks. In the past we could share modules and libraries to define the contract between infrastructure and applications. Common libraries become a significant overhead if they have to support multiple languages. We quickly went from only supporting Java to adding support for various other JVM-based languages, node.js, Python, and Go. Moving to containers, we were able to retain conventions for handling logging and configuration, but new interactions became necessary, including service discovery and routing via a service mesh. We found a reason to map out the way application code interacts with underlying infrastructure.

Defining a Cloud Container Chassis

Cloud Container Chassis

Chassis Components

  • Application - code that drives your application, e.g., “hello world”:
http.createServer(function (request, response) {
response.writeHead(200, {'Content-Type': 'text/plain'});
response.end('Hello World\n');
}).listen(8081);
  • Framework - libraries that support efficient re-use of common functions, e.g., express.js:
const express = require('express')
const app = express()
const port = 3000
app.get('/', (req, res) => res.send('Hello World!'))
app.listen(port, () => console.log(`Example app listening on port ${port}!`))
  • Config and Secrets - external sources of application configuration and sensitive values e.g. database passwords.
  • Logging - debug output from application code.
console.log('hello world!');
  • Tracing and Metrics - support for collecting application metrics and trace data via distributed systems.
  • Discovery - a registry of services that can be discovered via APIs and DNS.
  • Mesh - provides access to applications that have registered with the Discovery service.
“A service mesh is a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable.” — William Morgan

Defining the cloud container chassis has proven to be a useful design pattern but iterating beyond a marchitecture diagram has proven a tricky undertaking. We tried RFCs and other forms of standardization but migrating from a file-based strategy for log output quickly led to a debate about log formats… JSON or not? There are other areas where we definitely needed to provide governance, such as application metrics. Some developers love metrics and their apps can emit 1,000s of uniquely named metrics if tags are not used. Tags are an invaluable tool to reduce the unique number of named metrics, which supports application-specific views of infrastructure metrics like request latency observed by load-balancers.

Exploring application discovery and observability

Discovery

Discovery

Adding container-based applications to a service registry is a prerequisite for ensuring that they are discoverable. This is essential when cloud compute is spread across multiple data centers and vendors because it ensures that systems and humans have a consistent way of finding applications once they are deployed. One challenge with service registration is that it is harder to accomplish outside the container itself. We currently use self-registration via an init agent, but decoupling container images from compute schedulers would simplify the chassis contract. This is particularly important if you want to write more portable applications that can be shared in the open-source community.

A service mesh works in concert with the registry to proxy traffic from application to application. We have used a sidecar solution for the last few years. It cleaned up the interactions between microservices in our cloud environments. Prior to using side-car proxies we experimented with service registry–based DNS and internal load balancer pools.

We have observed a tradeoff between the separation of concerns evident in using different solutions for service registry and mesh, and the challenges of operating multiple interlinked distributed systems at enterprise scale. I prefer to eschew lock-in to specific solutions but our chassis would not exist without the pioneering work of vendors realizing the vision of Cloud Native Computing. This composable platform ensures that we can engage different cloud compute vendors with ease.

Observability

Observability

Prior to embracing polyglot software development, we instrumented Java applications and emitted metrics for over a decade. Today we have open standards on the horizon for the development of applications and infrastructure. Quite recently infrastructure only offered proprietary options for collecting logging and metrics.

Modern infrastructure promises the ability to provide insight into applications without requiring direct instrumentation of application code. If we can prove the viability of the service mesh offering application-specific metrics and traces then the need for applications to emit these signals themselves is diminished. Last year we adopted a stack that supports tracing using open standards and it quickly became apparent that adding tracing instrumentation to hundreds of legacy microservices is time-consuming and impacts multiple development teams. Infrastructure-derived trace graphs do not necessarily offer a viable alternative because automated tagging (based on the service registry) is required to make it comprehensible to application developers.

Lighting up the cloud chassis

Less is…

We have found the chassis pattern useful to help delineate the conventions and interfaces between container-based applications and cloud compute infrastructure. The biggest challenge has been reducing complexity. Is it possible to have a chassis with less components? Stay tuned for the next installment where we share our journey to a simpler solution. I’ll leave you with a splendid song about container drivers….

The Fall — Container Drivers