The universal data plane API

As I’ve said before, the excitement around and uptake of Envoy in such a relatively short period of time has been both awesome and humbling. I often ask myself: what aspects of Envoy have lead to the exceptional community growth that we have seen? Although Envoy has a lot of compelling features, ultimately I think there are three primary attributes that together have driven uptake:

  1. Performance: Along with a large number of features, Envoy provides extremely high throughput and low tail latency variance, while consuming relatively little CPU and RAM.
  2. Extensibility: Envoy provides rich pluggable filtering capabilities at both L4 and L7, allowing users to easily add functionality not present in the OSS distribution.
  3. API configurability: Perhaps most importantly, Envoy provides a set of management APIs that can be implemented by control plane services. If a control plane implements all of the APIs, it becomes possible to run Envoy across an entire infrastructure using a generic common bootstrap configuration. All further configuration changes are delivered seamlessly and dynamically via the management server in such a way that Envoy never needs to be restarted. This makes Envoy a universal data plane that, when coupled with a sophisticated enough control plane, vastly reduces overall operational complexity.

There are existing proxies that have ultra high performance. There are also existing proxies that have a high level of extensibility and dynamic configurability. The union of performance, extensibility, and dynamic configurability is what in my opinion makes Envoy so compelling to many.

In this post I will outline the history and motivation behind the Envoy dynamic configuration APIs, discuss their evolution from v1 to v2, and end by encouraging the wider load balancing, proxy, and control plane community to consider supporting these APIs in their products.

The history of the v1 Envoy APIs

One of the original design goals of Envoy was to utilize an eventually consistent service discovery system. To this end, we developed a very simple discovery service and a Service Discovery Service (SDS) REST API that returns upstream cluster membership. This API overcomes some of the limitations of DNS based service discovery (record limits, lack of additional metadata, etc.) and allowed us to quickly achieve high reliability.

When Envoy was initially open sourced, we received quite a few queries about supporting other service discovery systems such as Consul, Kubernetes, Marathon, DNS SRV, etc. I was concerned that our lack of direct support for these systems would limit uptake. The code is written in such a way that it’s not too difficult to add new discovery adapters and I hoped that interested parties would implement new ones. What has actually happened in the past year? Not a single new adapter has been contributed to the code and yet we have seen incredible uptake. Why?

It turns out that almost everyone went ahead and implemented the SDS API in a way that made sense for their deployment. The API itself is trivial, but I don’t think that’s the only reason that people did it. Another reason is that the further away you get from the data plane, things naturally start becoming more opinionated. Consumers of Envoy generally end up wanting to integrate service discovery into site specific workflows. The simplicity of the API allows for easy integration into almost any control plane system. Even users of systems like Consul (see for example Nelson) have found it useful to have an intermediate API that can do more intelligent processing around membership and naming. Thus, even at this early stage we were seeing the beginnings of a desire for a universal data plane API: a simple API that abstracts the data plane from the control plane.

In the past year, multiple v1/REST management APIs have been added to Envoy. They include:

When a control plane implements SDS/CDS/RDS/LDS virtually all aspects of Envoy can be dynamically configured at runtime. Istio and Nelson are both examples of extremely feature rich control planes that have been built on top of the v1 APIs. By having a relatively simple REST API in place, Envoy can iterate rapidly on performance and data plane features, while still supporting a variety of different control plane scenarios. At this point the universal data plane concept is moving closer to reality.

Downsides of the v1 APIs and introduction of v2

The v1 APIs are JSON/REST only and polling in nature. This has a number of downsides:

  • Although Envoy uses JSON schema internally, the APIs themselves are not strongly typed, and it’s difficult to safely write generic servers that implement them.
  • Although polling works fine in practice, more capable control planes would prefer a streaming API in which updates can be pushed to each Envoy as they are ready. This might lower update propagation time from 30–60s to 250–500ms even in extremely large deployments.

In strong collaboration with Google, we have been hard at work over the last several months on a new set of APIs that we are calling v2. The v2 APIs have the following properties:

  • The new API schemas are specified using proto3 and implemented as both gRPC and REST+JSON/YAML endpoints. Additionally, they are defined in a new dedicated source repository called envoy-api. The use of proto3 means that the APIs are strongly typed while still supporting JSON/YAML variants via proto3’s JSON/YAML representations. The use of a dedicated repository means that it will be substantially easier for projects to consume the API and generate stubs in all of the languages that gRPC supports (we will continue to support REST based JSON/YAML variants for those that wish to use them).
  • The v2 APIs are an evolution on v1, not a revolution, with a superset of v1 capabilities. Users of v1 will find that v2 maps very closely to what they are already using. In fact, we have been implementing v2 inside Envoy in a way that will allow v1 to continue to be supported likely in perpetuity (albeit with an ultimately frozen feature set).
  • Opaque metadata has been added to various API responses, allowing for a large amount of extensibility. For example, metadata in an HTTP route, metadata attached to an upstream endpoint, and a custom load balancer can be used to build site specific label-based routing. Our goal is to make it easy to plug in rich functionality on top of the default OSS distribution. Look for more robust documentation on writing Envoy extensions in the future.
  • For API consumers that use gRPC (vs. JSON/REST) for v2, bidirectional streaming allows for some interesting enhancements that I will discuss more below.

The v2 APIs are composed of:

  • Endpoint Discovery Service (EDS): This is the replacement for the v1 SDS API. SDS was an unfortunate name choice so we are fixing that in v2. Additionally, the bidirectional streaming nature of gRPC will allow load/health information to be reported back to the management server, opening the door for global load balancing capabilities in the future.
  • Cluster Discovery Service (CDS): No substantial change from v1.
  • Route Discovery Service (RDS): No substantial change from v1.
  • Listener Discovery Service (LDS): The only major change from v1 is that we now allow a listener to define multiple concurrent filter stacks that may be selected based on a set of listener routing rules (e.g., SNI, source/destination IP matching, etc.). This is a cleaner way of handling the “original destination” policy routing required for transparent data plane solutions such as Istio.
  • Health Discovery Service (HDS): This API will allow an Envoy to become a member of a distributed health checking network. A central health checking service can use a set of Envoys for health checking endpoints and reporting status back, thus mitigating the N² health checking problem in which every Envoy potentially has to health check every other Envoy.
  • Aggregated Discovery Service (ADS): Envoy has been designed in general to be eventually consistent. This means that by default each of the management APIs run concurrently and do not interact with each other. In some cases, it is beneficial for a single management server to handle all of the updates for a single Envoy (for example if updates need to be sequenced in such a way as to avoid traffic drops). This API allows all other APIs to be marshalled over a single gRPC bidirectional stream from a single management server, thus allowing for deterministic sequencing.
  • Key Discovery Service (KDS): This API has not yet been defined, but we will be adding a dedicated API for delivery of TLS key material. This will enable decoupling primary listener and cluster configuration delivery via LDS/CDS from key material delivery via a dedicated key management system.

In aggregate, we call all of the above APIs xDS. The move from JSON/REST to well-typed proto3 APIs that can be more easily consumed is extremely exciting and in my opinion will further increase uptake of both the APIs themselves as well as Envoy.

A multi-proxy multi-control plane API?

The service mesh / load balancing space is very active right now. Proxies include Envoy, Linkerd, NGINX, HAProxy, Traefik, software load balancers from all major cloud providers, as well as physical appliances from traditional hardware vendors such as F5 and Cisco. The control plane space is also heating up with solutions like Istio, Nelson, integrated cloud solutions, and forthcoming products from many vendors.

Speaking of Istio specifically, Linkerd has already announced support, which means that at least at some level it already implements the v1 Envoy APIs. Others are likely to follow. In this new world of rapid development of both data planes and control planes, we are going to see mixing and matching of components; data planes will work with many control planes and vice versa. As an industry, would we benefit from a general purpose API that would allow this mixing and matching to happen more easily? How would this help?

In my opinion, over the next several years, the data plane itself is going to become mostly commoditized. Much of the innovation (and by extension commercial opportunity) will actually become part of the control plane. Using the v2 Envoy APIs, control plane capabilities can range from a flat endpoint namespace utilizing N² health checking, all the way to an extremely rich global load balancing system which does automatic subsetting, load shedding and balancing, distributed partial health checking, zone aware routing, automatic percentage based deploys and rollbacks, etc. Vendors will compete over providing the most seamless microservice operational environment, and automated control over routing is a major part of that.

In this new world, a common API that data planes can use to talk to control planes is a win for everyone involved. Control plane providers can offer their services to any data plane that implements the API. Data planes can compete on features, performance, scale, and robustness. Furthermore, decoupling allows control plane providers to offer SaaS solutions that do not require also owning data plane deployment, which is a major pain point.

An invitation to collaborate on the Envoy API

Although it’s hard to know what will happen over the next several years, we are extremely excited by the uptake of both Envoy and its associated APIs. We see value in a common set of universal data plane APIs that can bridge disparate systems. Along these lines, we invite the larger community of data plane and control plane vendors and users to collaborate with us in the envoy-api repository (note that when Envoy moves into the CNCF and transitions to a dedicated envoyproxy GitHub organization we will rename this repository data-plane-api). We can’t promise that we will add every conceivable feature, but we would like to see other systems use these APIs and help us evolve them to meet their needs. It is our view that the commoditization of the data plane will provide tremendous benefit to end users via increasing the speed of iteration and competition in the control plane space which is where the majority of innovation will occur over the coming years.

Show your support

Clapping shows how much you appreciated mattklein123’s story.