It was November 1st, 2016. I’d spent the better part of the day at Brian Ketelsen’s training on Go + Distributed Computing. O’Reilly had offered me a free ticket in exchange for feedback. During the training, I’d got to re-familiarize myself with a number of concepts around distributed systems in general and the Go microservices ecosystem in particular. On the first day the topics the training covered were service discovery, consensus, authorization (authz) and authentication (authn).
After the training had ended, I remember feeling very tired and had entertained the idea of heading home but had got a reminder on my calendar that I had an event to attend at 6:30pm. It was a talk on Envoy at the Reactive Systems and Microservices meetup. Looking back, I’m glad I didn’t give the event a miss as I’d been planning to because it ended up being the best talk I’d been to in 2016. It was a particularly memorable day for me because I’d spent the first half of the day learning about how to build traditional microservices and then later that evening learned about the myriad shortcomings of actually operating said traditional architecture, turning everything I’d learned earlier that day on its head.
Envoy had been launched a few weeks (maybe months? I don’t recall) prior to the talk and there’d been quite a bit of excitement in certain operations and systems engineering Slack groups I’m a part of about how Envoy could potentially replace things like nginx, HAProxy and OpenResty. I’d briefly skimmed through the Envoy docs a couple of days before the talk and I was pretty familiar with the most of the functionality (routing, rate-limiting, health-checking, load balancing, service discovery etc.) that Envoy provided but not the specific implementation or design decisions. It was also just as well I’d gone to this meetup straight after the Go training, since I had my notebook with me and I ended up taking about 8 pages of notes, which was a first for me.
What is Envoy?
The first thing to establish here is just what even is Envoy? The docs state that:
Envoy is an L7 proxy and communication bus designed for large modern service oriented architectures.
L7 proxy? Communication bus? Large ‘modern’ (as opposed to staid?) service oriented architectures?
The docs also mention:
At its core, Envoy is an L3/L4 network proxy.
There’s quite a bit to unpack there.
The first thing one needs to understand is that Envoy is not a library or a framework like Finagle; it’s a server. Envoy runs as a separate process that sits in front of application servers.
If this sounds similar to nginx or HAProxy, it’s because by design Envoy aims to replace these services by providing the functionality they do. In other words, Envoy can be run as an edge proxy instead of nginx for TLS termination can also replace HAProxy as a load balancer.
Which begs the question — why build a replacement for something as battle-tested and ubiquitous as nginx and HAProxy?
Why not nginx?
Well, for one thing, the creators of Envoy argue that for the most part an edge proxy does the same things as a service proxy — request routing, load balancing and the whole caboodle. Ergo running Envoy as the edge proxy in addition to running it as a service proxy becomes operationally easier. You then only have one thing to reason about, update, deploy, debug and instrument. Let’s hold this thought for now since it’s something I revisit later in this post.
Second, while nginx only supports H2 for downstream connections (inbound requests coming from hosts that connect to nginx such as web browsers, mobile devices, CDNs), Envoy can do bidirectional H2 proxying since it supports H2 for upstream connections (connections originating from Envoy to another host) as well. Bidirectional H2 support becomes pretty germane inasmuch as gRPC has emerged as the leading RPC framework in the recent years, and Envoy’s ability to act as a transparent H2 proxy for gRPC calls between services certainly gives it an edge over similar edge proxies.
Third, as anyone who’s worked with nginx for anything other than the most trivial of tasks would attest, nginx scripting is not the easiest to test, gotchas around rewrites in nginx (if is evil) especially so. The nginx codebase itself isn’t one that was written with an eye towards making testability and maintainability as optimal as possible. We are definitely seeing this where I work where certain hard/impossible to test nginx frontends and proxies are being replaced by standalone levee services. I asked during the talk if Envoy could compete with nginx when it came to raw performance and was told that Envoy was primarily built with the goal of making it easily testable and extendable, as opposed to trying to eke the most performance out of it.
nginx might be an acquired taste for many but almost no Operations engineer I’ve met has had anything but love for HAProxy. I’m sure there exist people who dislike HAProxy but I’m yet to meet them.
HAProxy is incredibly performant. It’s single threaded like nginx and Redis, uses single buffering and zero copy message proxying done mostly in user space with fixed sized memory pools, makes use of elastic binary trees to keep timers and the runqueue ordered, manage round-robin and least-conn queues and to look up ACLs or keys in tables with only an O(log(N)) cost (if you want to learn more, here’s a link to the HAProxy source — which has both a pretty decent explanation of how this works and a not too hard to grok C implementation) and allows for blazing fast header parsing.
So then. Why use Envoy as a load balancer instead of HAProxy or ELB/ALB (if you’re on AWS)?
The docs list a number of reasons, but support for H2, hot restarts and the multithreaded architecture stood out as the biggest differentiators. Hot restarts is an especially neat feature, since dynamic HAProxy config reloads, while possible, is achieved not always without bending over backwards or introducing a ton of additional complexity. Envoy achieves this using shared memory and communication over a Unix domain socket, in an approach which bears some similarities to GitHub’s tool for zero downtime HAProxy reloads.
It’s hard to make a case for a multithreaded architecture in an era where evented async or m:n scheduling a la Go or Erlang seems to be the concurrency paradigm du jour for building systems software, but Envoy is premised upon the notion that it is “substantially easier to operate and configure circuit breaking settings when a single process is deployed per machine versus potentially multiple processes”.
The Kitchen Sink
The biggest draw of Envoy is that its sole purpose isn’t to act as a reverse proxy or a load balancer, but in addition also provide a glut of other functionality—service discovery, authorization, authentication, circuit breaking, retries, timeouts, load shedding, rate limiting, protocol conversion, first class observability in the form of metrics collection, logging and distributed tracing — required to operate a service oriented architecture.
All this functionality compiled into a single binary.
When I heard this my immediate reaction was one of scepticism. It flies in the face of the Unix philosophy of doing one thing well. Envoy aims to do everything that nginx, HAProxy, Consul/etcd/Zookeeper, Redis and so forth do in unison. This was another question I’d posed during the talk and the answer I was given was that, again, it’s operationally simpler to have all this functionality in Envoy as opposed to having to operate a multitude of systems responsible for each of these tasks.
Ease of operation happened to be the reason cited for most of Envoy’s design decisions. I’m not an SRE and have never operated even tens of services at scale, let alone hundreds or thousands, but when I spoke about this with an SRE the next day after the talk, he agreed that yes, it was absolutely much simpler to operate and reason about something like Envoy as opposed to deploying a small army of services to operate an even bigger army of microservices. He added that if only weren’t the fact that he’d been using HAProxy and friends for over a decade now, they wouldn’t be his first choice today, since by his own admission, writing nginx and HAProxy configs and rules was one of his least favorite things to do.
So then. What is Envoy?
At heart, it’s a network proxy, making the network transparent to applications. I’d imagine an architecture with Envoy to look something like the following:
Envoy, as stated previously, adopts an out of process architecture. This isn’t something new — the sidecar approach is pretty common and is used at several other companies. Netflix built Prana. Uber, I’m told, has something called muttley. New Relic built a service which they very inventively named sidecar.
Every service that sits behind Envoy listens on localhost for requests. Envoy supports writing pluggable filters which can be chained together. Most of the functionality that Envoy offers is implemented as filters. It can act as both as a L3/L4 proxy and a layer 7 proxy. What this means is that Envoy can be used for things like as MongoDB rate limiting at the wire protocol level, as well as H2 to H1 conversion at layer 7. As a bonus, filters for the MySQL and Kafka wire protocols (or the Kinesis API) would be pretty awesome too.
Eventually Consistent Service Discovery
One of the biggest takeaways for me from the talk was how Envoy approached service discovery in an eventually consistent manner.
This was a topic that had come up just the previous day before the meetup on the dist-sys Slack where Kelly Sommers had been arguing that service discovery was fundamentally an eventually consistent problem and that existing tools such as etcd/Consul (both which are backed by the Raft log) and Zookeeper (backed by ZAB, a variant of Paxos, another fully consistent algorithm) which used a strong consistency model were barking up the wrong tree.
When I was chatting with one of my friends who works on infrastructure at Uber, he told me that he routinely saw 1/8th of all DNS queries fail on any given day. Using a strongly consistent model would require every single process in the network to have a fully consistent view of every other process. While there are definitely times that call for such strong consistency guarantees, is service discovery really one of those, especially in the current era of cloud native applications where failure is the norm?
As someone who works at a company where Consul is used for service discovery, leader flapping, sluggish Raft appends and high Raft churn has often lead to excess polling. The default behavior in Consul is that queries are ‘almost’ strongly consistent which means these queries hit the quorum leader (I say ‘almost’ and not ‘always’ because in Consul, the concept of leader leasing might mean that during a network partition, there might be two leaders and an old leader could service a read, resulting in a stale value). Fortunately, with Consul one can weaken consistency guarantees for service discovery by querying the local Consul agent with
stale=true query param for the state of the world, which might well be the eventually correct state.
As Kelly observes, the point of service discovery is to know how to communicate at all times, so that your service discovery system being down won’t stop you from scaling up or operating normally. To this end, Envoy takes the approach of creating an overlay routing mesh using a combination of active health checking and service discovery data, where health checking data has primacy over that reported by service discovery.
The question was posed as how services register with the discovery service at Lyft. The initial mesh is formed by having every host check into the discovery service once every minute. Only if the health checks fail and the process isn’t in the service discovery database is the process deleted from the mesh. A motivating example was provided during a talk at SRECon where this method of service discovery kept Lyft up during the 2017/02/28 S3 outage. Pretty neat.
Treating service discovery as an eventually consistent problem is something that has got increasingly popular in the last couple of years. Netflix’s Eureka adopted this approach years ago but many non-JVM organizations are loath to introducing any Java based services into their stack. Weaveworks’ Weave Mesh, which is written in Go, offers an eventually consistent network leveraging CRDTs and gossip. Watch this fantastic talk by Peter Bourgon and Matthias Radestock to learn more about Weave Mesh — it’s pretty cool. This presentation by Martin Kleppmann provides a good introduction and overview of how conflict resolution works in order to guarantee eventual consistency. New Relic’s sidecar which offers gossip based service discovery built again upon health checking. Where I work we have a policy of always setting
stale=true when long polling Consul for changes. It’s fairly clear from an empirical standpoint that service discovery is an eventually consistent problem and should be treated as one.
This does come at a certain cost, though. Having every side car also be tasked with health checking leads to an O(N2) problem. As another engineer who has previously worked with the sidecar pattern stated on a Slack:
Instead of a central or per-datacenter health check, every sidecar did its own health check, and suddenly the logs were full of health check spam from five hundred bazillion sidecars asking the same question.
Logging health check information (or for that matter, anything) that isn’t actionable is an anti-pattern. Centralizing health checks can circumvent this problem, but that might, however, introduce other problems in its own right, leaving the onus on the engineering organization to choose a tradeoff reasonable for specific use-cases.
Load balancing is most commonly done using a per service level load balancer like AWS ELB or HAProxy sitting behind a reverse proxy like nginx. Since Envoy aims to run as a service mesh in addition to functioning as an edge proxy, it’s tasked with implementing service to service load balancing as well as graceful degradation in the face of upstream or downstream failures instead of delegating this work to applications. In the docs, Envoy’s support for “advanced load balancing” is oft-cited as its advantage over the alternatives.
I’ve wondered in the past when rate limiting (and other forms of load balancing) should be implemented in something like nginx/HAProxy as opposed to have each individual application also shoulder the burden of implementing this logic (which almost invariably these days involves integrating Redis). I got several answers from a variety of folks. Of course, running a proxy comes with its own overhead and there might be many cases where this is less than ideal. I was learning about Algolia’s stack over the weekend where they directly inject latency critical C++ search code into nginx. But if SLA guarantees can be met while operating a service mesh, requiring applications to be network aware seems like a violation of the separation of concern principle.
Envoy supports zone aware least request routing — which I’d imagine would be great for cloud deployments across different AZ’s— in addition to consistent hashing, round robin and random, with a configurable panic threshold which, when triggered, causes Envoy to discard health check status and uniformly load balance across all upstreams. I doubt this approach is perfect for all situations since it’s nearly impossible to get perfect load distribution (here’s a fantastic talk by Fastly’s CTO, for those interested in learning more), but for most use cases I suppose these approaches are more than sufficient. It’d be interesting to learn if specific filters can be written to integrate with Varnish’s VCL or Squid.
Additionally, Envoy ships with support for dark traffic testing (the docs official call this
shadowing) where a certain percentage of traffic is replayed on to a test cluster, outlier detection, outer timeouts (retries and per try timeouts) and inner timeouts (only per try timeout) which are configurable via headers (so clients can define a per service timeout header), circuit breaking and blue-green/canary deployments.
Polyglot architectures and Separations of Concern
Still with me? Good.
Kafka was just an example. Hadoop is another. The same is, to a lesser extent, true with other systems like Cassandra and pretty much everything else in the “Big Data” ecosystem. On the other hand, if one were to opt into what I call a “Cloud Native” stack — you’re most likely going to end up with most of your services written in Go. One of the biggest challenges of using a distributed system is that the quality, performance and reliability of clients of said system varies vastly from language to language and runtime to runtime.
And of course, no company I’ve worked at has been without its fair share of legacy code written in slower languages like Python or Ruby, and from what I hear from my friends, this is equally true at bigger companies as well. Sometimes, rewriting applications in a more performant language yields significant performance and cost gains. However, many a time rewriting legacy glue services or non-performance critical services is more trouble than its worth, so what many companies end up doing is running a polyglot architecture.
Polyglot environments differ vastly not just in the language in which they are written but also in terms of code quality, library support for integration with other shared architectural components, quality of observability and ease of deployment (for instance, deploying Python code with a tangle of native dependencies is still an unsolved problem unless one opts into sophisticated tooling like conda or nix, which most companies don’t for just deploying Python web applications).
Even in the hypothetical scenario where all the languages in use at a company are uniform and consistent when it came to the aforementioned desiderata, a standalone Node.js application that worked perfectly well in isolation could end up becoming the bottleneck in a service oriented architecture unless all of its upstream and downstream service dependencies employed graceful degradation techniques like rate-limiting, circuit breaking, retries, timeouts and what have you. What most service oriented architectures end up doing is marry the very distinct domains of application logic/business logic and networking logic.
Envoy treats these are separate concerns by decoupling networking logic from business logic, which in turn empowers your application developers to build services that do one thing and one thing well, the Unix way. Not too shabby a tradeoff, in my opinion. One could, I suppose, say that Envoy is built not for the soon to be passé microservices era but for the glorious nanoservices future where application services encapsulate purely business logic and leaves the rest to Envoy.
As an aside, I personally believe that the Unix philosophy — quite like Unix itself — originated at a time when the computing landscape was vastly different to the current era where the ability to make the right tradeoffs, reason about edge cases and failure modes tends to matter more than idealogical purity.
DRYing up Observability
Microservices inherently embody a tradeoff.
The truth about service oriented architectures is that they trade application complexity for tooling, infrastructure and operational complexity. Here’s a fantastic HackerNews comment which summarizes the basic requirements you absolutely need in place to successfully run such an architecture. However, what must your engineering organization look like in order to make that happen?
In an ideal world, you’re going to need the following to successfully adopt and execute a such an architecture:
- have operationally-savvy developers across the board equipped with best in class observability tools to help debug their own services
- a well-staffed team of world class operations engineers, systems engineers and SREs always ready to consult and assist in debugging any issue that might arise at any layer of the stack
- possibly embed an SRE or two per application team
Decentralizing the ability to debug is impossible without centralizing and unifying the main pillars of observability — the troika of logging, metrics and tracing. Having every single application emit stats or log to a file descriptor possibly in a different format is something that is hardly ideal. Letting Envoy be in charge of handling these things allows your applications to be built without having to deal with integrating instrumentation libraries or code into your applications. For instance, Envoy allowed Lyft integrate distributed tracing into its stack without requiring any of the developers to add any extra code to their applications.
I see this as the DRYing up of instrumentation which has a great number of benefits, not least because it lends itself exceptionally well to swapping out one spec/protocol for another (syslog for socklog, for instance, or maybe plaintext logging for structured logging) or making the switch from one vendor to another or to an open source solution (Prometheus!) without requiring application developers to change their application code or packaging or deployment. I can also see how this approach can help provide a global view of the state of the health of the entire infrastructure without requiring to collect and aggregate stats from different services individually.
Another question I’d posed during the meetup was how would one run Envoy itself? The answer was that Lyft used
runit to run Envoy. Other process managers like systemd, supervisord, circus, s6 would do the job as well.
Sidecar containers in Kubernetes is definitely a pattern that I’m seeing emerge. While Lyft wasn’t running containers/Docker in production at the time of the meetup (things could’ve changed now), there’s nothing about Envoy that looks like it won’t translate well to be used in conjunction with the standard container tooling, paving the way for those who desire to run a Kubernetes style infrastructure but want to leverage the benefit of Envoy’s features.
Saved this for the last.
Envoy is written in C++.
When I heard about Envoy, I’d expected it to be written in Go, since Go is emerging as the lingua franca of the cloud native era. The choice of language was something that was discussed during the talk and I learned that the main reasons why C++ was chosen was performance and developer productivity.
Performance — absolutely. Envoy is a network proxy and it’s hard to argue that there could be anything more performant than native code for things like encoding/decoding data in the most resource efficient way, protocol parsing, zero copy reads and so forth. I’m not going to go into the reasons as to why — but if you’re curious, there’s a good blog post on this I’d recommend you peruse.
GC’ed languages like Go can result in pretty hard to pin down high tail latencies especially in cloud environments, and while there is some interesting research happening in recent years that might help improve GC induced tail latencies, at the end of the day it defeats the purpose of a central service like Envoy if it can’t be entirely relied upon to accurately report the data pertaining to the state of the network as seen from Envoy’s coign of vantage.
Developer productivity? C++? This certainly raised some eyebrows.
I haven’t programmed in C++ since 2010 and I’ve barely kept up with the latest and greatest so I really am in no position to comment on developer productivity. What I can say is the the last time I looked at C++11 briefly in 2014, it did have some pretty neat features. While I still believe that Go is better suited for developer productivity, I can’t deny that native code will reign supreme for most performance critical systems software for decades to come.
I ran into a friend of mine who was also present at the meetup sometime in early December. He’d been really excited about Envoy when it had launched and had felt that it could’ve (nay, should’ve) been a commercial offering, since there was a real need out there for something like Envoy that companies would be glad to buy rather than build or operate themselves. He wanted to use something like Envoy where he worked but was a tad apprehensive that Envoy was just something “a team of, like, 4 people” was working on.
Well, it now has 42 contributors (including, incidentally, an nginx core contributor) which might allay my friend’s fears a bit, but having the backing of an open source foundation like the Cloud Native Computing Foundation would definitely help adoption. I see Envoy as a very integral piece of infrastructure that works well with many of the existing tools in the cloud native ecosystem (gRPC, Kubernetes, OpenTracing). Envoy’s similarity to linkerd (a CNCF project) shouldn’t, in my opinion, affect its prospects of getting accepted since Envoy can stand on its own two feet.
A lot of the features Envoy provides out of the box are things many companies are inventing on their own and I do wonder if this presents a good opportunity for standardization. Going forward, I think the industry would benefit greatly from OpenTracing-esque standardization as opposed to more reinvention of the wheel in slightly different diameters. While I don’t see companies chomping at the bit to replace nginx and HAProxy any time soon, I’m pretty excited for Envoy’s approaches to service discovery, load balancing and observability in particular to gain more traction.