Cover image by pexels

We rolled out Envoy at Geckoboard

It made us stop worrying about our gRPC traffic.

Poolski
16 min readApr 15, 2019

--

In this post we’re talking about why we made the decision to start using Envoy and how we went about rolling it out in production. The whole process took us about a month — here’s how it went.

The services underpinning the Geckoboard platform are constantly evolving. Over the past 7 years, we’ve been moving from monolithic Ruby apps towards single-purpose microservices written in Go. This has allowed engineers to be able to work on specific functionality with confidence that their changes will be isolated to a single service.

Historically, all of our services have communicated using RESTful APIs over HTTP or by using queues that producers/consumers would write to and read from.

Over time we also found that the type of work our newer services perform can’t as easily be modelled using HTTP RESTful APIs, so more and more of our services rely instead on gRPC to communicate.

Load balancing HTTP traffic is fairly straightforward and we’ve been able to build services which communicate using HTTP APIs while handling the load balancing using Nginx or HAProxy as well as Consul’s round-robin DNS.

gRPC makes it much easier to automatically generate client libraries to interact with the data your services present from protobuf definitions. By contrast, creating a client for a ‘traditional’ API would require a developer to implement every function of the API by hand in the client. Ensuring changes to your services are backwards-compatible protocol is also much easier when using gRPC compared to doing the same to an HTTP API.

While it’s possible to load balance gRPC out of the box, that logic needs to be built into the gRPC clients and none of our apps had it. Our gRPC traffic wasn’t being load balanced at all.

To begin with, this was fine because the volume of gRPC requests was low, but it was only going to grow and eventually become a problem, so we decided to make the changes long before circumstances forced us to.

To make sure that load is distributed evenly across multiple instances of the same service requires either writing your own logic for handling that in your applications or that you use a load balancer, also sometimes referred to as a proxy. It takes care of splitting traffic evenly across your backends, allowing you to concentrate on more important work.

Rather than badly paraphrasing the experts on load balancing and how gRPC traffic over HTTP/2 is different to RESTful API requests over HTTP/1.x, here are some really useful links if you want to dig deeper.

Load balancing options

Our existing tool stack wasn’t well-suited to handling multiplexed gRPC traffic over HTTP/2. We had been using Nginx and HAProxy for a few years, but back in November/December 2018 they didn’t seem a good fit for this project since their gRPC proxying capability wasn’t well-tested yet. We needed a better solution.

Several tools for load balancing gRPC requests over HTTP/2 are predicated on using a containerised architecture. Since Geckoboard uses virtual machine instances rather than containers, this option wasn’t feasible for us.

One option that we considered was LinkerD. LinkerD v2 required Kubernetes, which we don’t use. LinkerD v1 was an option, but in the end we discarded it because it required installing a JVM and we don’t have the in-house Java expertise to be confident in managing a mission-critical Java service.

We had looked at Nginx, which had announced expanded support for gRPC proxying, but based on our research, we felt that it wasn’t mature or used widely enough for our needs.

The last candidate was Envoy, a proxy developed by Lyft during their transition away from monolithic services, where many functions were colocated within the same app, to a more modern microservice-based architecture. After a lot of research, we found that Envoy was almost perfectly suited to our needs.

Why Envoy?

The biggest appeal of Envoy for Geckoboard was that it was purpose-built by Lyft for proxying both gRPC and HTTP traffic, with a strong focus on gRPC. That and the fact that it is shipped as a pre-built binary makes it a lot easier to deploy, without needing to install platform-specific dependencies.

By design, Envoy enables application developers to create a service mesh where the applications they write no longer need to be aware of the locations of other services that they might need to talk to.

As part of our aim to help developers spend more time on meaningful work, being able to free them from needing to write logic to handle load-balancing, discovery and network connectivity was also great.

Another reason why we chose Envoy was that even by the time we’d started researching it, it had already been used in production by Lyft and thoroughly tested.

Configuration Options

One aspect of Envoy that we found immensely appealing is just how flexible it is in terms of configuration. Although that flexibility can be a mixed blessing because of just how configurable it is.

Unlike other proxies like Nginx, which has its own configuration language, Envoy is configured using a single YAML (or JSON) configuration file, which meant that there was no need to spend extra time working out the idiosyncrasies of another language.

The downside, though, is that Envoy doesn’t support using multiple configuration files and including them through an include conf.d/* style pattern, as used in many other services. This can make the config file unwieldy — even more so if using JSON.

When every resource in Envoy is configured using a single static YAML file, it can be a challenge to make changes to running config. To make Envoy pick up the changes, it must be reloaded or restarted, meaning in-flight requests may fail or be lost entirely.

To mitigate this, Envoy is designed to be able to dynamically configure itself based on data it receives from a set of discovery services, of which there are five:

  • Cluster Discovery Service. A cluster represents a single service configured within Envoy which can respond to client requests.
  • Endpoint Discovery Service. An endpoint is a single instance of a given service, running on a specific host and port.
  • Route Discovery Service. Routes are what define which cluster a given request is routed to, or whether the request is dropped entirely.
  • Listener Discovery Service. An Envoy listener listens on a given address and port for incoming requests.
  • Secrets Discovery Service. This can be used to centralise distribution and rotation of secrets like TLS certificates.

These can be used independently or together to dynamically provision Envoy. A running instance of Envoy will periodically poll its configured xDS endpoints to get up-to-date configs.

The truly exciting thing about this for us is that it makes it possible to be certain that every Envoy instance in your infrastructure is running with the correct configuration, without relying on traditional file-based configuration management.

Not that file-based configuration is necessarily bad, but it can be inconsistent both between services and sometimes even within the same service. As an example, sometimes an HAProxy configuration change requires a reload of the service. But sometimes a change requires a restart — as is the case with TLS certificate updates. Abstracting this configuration away to a dedicated service and allowing Envoy to dynamically update running configuration is much more flexible and consistent.

Envoy configuration can fall on a spectrum — all the way from fully-static config files to a fully dynamic xDS-based setup with only minimal configuration to tell Envoy where to find the xDS. Which approach you take ultimately depends on the use case and time/resources available.

Having looked at the plethora of options for configuring Envoy available, we went with the simplest — a flat file. The main reason for this was that we wanted to slice the work in a sensible way and we felt that getting bogged down with rolling our own xDS would be a poor use of our time to start with.

The documentation for Envoy is extensive but structured in a way that isn’t immediately intuitive to anyone not used to reading proto3 protocol buffer definitions. Envoy also takes configuration in the form of

  • YAML (make sure you use .yaml rather than .yml — we learned the hard way that Envoy won’t start with .yml)
  • JSON
  • Proto3 Protocol Buffer definition files.

It can also be unusually specific about the formatting/data types of certain values. For example, specifying timeouts can be done using one of two ways:

timeout: 
seconds: 15

or

timeout: 15s

However, timeout: 15 will throw an error, even though the documentation doesn’t make it clear that the sis important.

However, due to the way Envoy is built, if you want to be able to hot-reload Envoy without dropping connections, it needs to be run via a Python script.

Lyft made this script for handing its open sockets off to the new instance, which complicated matters a bit. The post goes into a lot of detail about exactly how the script does this, but for us the challenge was turning the script into a SystemD service which luckily Envoy already had a solution for!

Stats and Metrics

Envoy was designed from the very beginning to allow engineers insight into the traffic flowing through it without the need for any third-party instrumentation.

Since we work with metrics and data, this was an incredibly refreshing design decision to see. In the case of the other two most popular proxies — Nginx and HAProxy — their stats are much trickier to get at and parse/digest.

We use the built-in statsd sink to send metrics generated by Envoy directly to statsite, ready to be shipped elsewhere. Envoy also supports a number of other metrics backends.

One thing to be aware of, though, is that Envoy generates a lot of metrics, constantly. If you’re using a SaaS metrics provider like DataDog or Librato, it pays to audit your data and tune your Envoy instances to only emit the stats you care about. If you don’t, you’ll be billed for storing null or zero values you have no interest in. Envoy allows you to exclude certain metrics namespaces from being shipped.

stats_sinks:
- name: envoy.statsd
config:
address:
socket_address:
address: 127.0.0.1
port_value: 8125
stats_config:
stats_matcher:
exclusion_list:
patterns:
- regex: cluster.([^.]*).update_attempt$
- regex: cluster.([^.]*).update_no_rebuild$
- regex: cluster.([^.]*).max_host_weight$
- regex: cluster.([^.]*).health_check.attempt
- regex: cluster.([^.]*).membership_total

We also made a change to our statsite rewrite sink to allow us to drop the metrics we didn’t want to ship and to tag metrics we were shipping so we could better analyse them.

The Implementation

The Container Problem

Although Envoy is designed to be a sidecar process — one instance of Envoy running alongside every app, this approach is somewhat predicated on running in a containerised environment where this approach makes sense. In our case, every VM runs an instance of Envoy which is aware of every gRPC service available.

This isn’t as elegant as it could be but it also means that any app running on any server can communicate with any gRPC service by connecting to its local Envoy instance, rather than trying to discover the gRPC service itself.

The first and most frustrating challenge was how to deploy the Envoy binary to our infrastructure, given it wasn’t available to download out-of-the-box.

Ultimately, we ended up going down the path of least resistance and built a CI pipeline to extract the binary from the official Docker image and re-package it for our use. Although the official guidelines suggest building your own, we felt that at least for the first iteration, this approach was sound.

Service Configuration

Having configured the listeners, routes and associated virtual hosts, the next thing that needed to be configured were the clusters that Envoy would route traffic to.

static_resources:
listeners:
- name: internal-grpc-traffic
address:
socket_address:
address: 127.0.0.1
port_value: 8888

Once again, xDS can automate this to a certain extent, but after some thought we realised we already had this information available to us. Consul has been part of our stack for a long time and we’ve been using it for this exact purpose the whole time.

This meant that we could make use of the data that Consul had on our services to populate the clusters with their respective endpoints. We did this by making use of the strict_dns cluster type in Envoy. strict_dns clusters periodically and asynchronously look up the configured hostnames (once every 10 seconds or so) and update the list of endpoints in each cluster based on the DNS response.

clusters:
- name: my-awesome-service-grpc
connect_timeout: 0.5s
http2_protocol_options: {}
type: strict_dns
lb_policy: round_robin
health_checks:
- grpc_health_check:
authority: my-awesome-service.prod.example.com
timeout: 1s
interval: 2s
interval_jitter: 1s
healthy_threshold: 3
unhealthy_threshold: 3
event_log_path: "/var/log/envoy/healthcheck.log"
tls_context:
sni: my-awesome-service.prod.example.com
hosts:
- socket_address:
address: my-awesome-service.service.consul
port_value: 5100

The DNS discovery method isn’t ideal for all scenarios. One of the biggest limitations is that DNS responses over UDP are limited to 512 bytes.

This means that for scenarios where there are many instances of a service registered with DNS, not all of them are guaranteed to be returned by a query. This can lead to endpoints never being added to a cluster because they weren’t included in the DNS response.

In our case, though, this wasn’t an issue as the number of records returned by Consul DNS lookups was well within the 512-byte limit because we have relatively few instances of a given service.

Request Routing

As we’ve discussed before, Envoy uses internal route configuration to tell it where a given request should go (or whether to drop the request for various reasons). This can be implemented similarly to Nginx or Apache vhosts which usually use hostname-based matching to work out what to do with the traffic.

It’s also possible to use path-based routing in combination with host-based routing in Envoy but that was unnecessary levels of granularity for us.

Every gRPC request has an :authority header (the equivalent of the HTTP/1.1 Host header), so we use that header to route to the correct cluster by matching on the domain of each virtual_host in Envoy.

listeners:
- name: internal-grpc-traffic
address:
socket_address:
address: 127.0.0.1
port_value: 8888
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
access_log:
- name: envoy.file_access_log
config:
path: "/var/log/envoy/access.log"
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: my-awesome-service
domains:
- my-awesome-service.envoy.staging.example.com
routes:
- match:
prefix: "/"
grpc: {}
route:
cluster: my-awesome-service-grpc
timeout:
seconds: 15

Retries, Timeouts And Circuit Breaking

Another benefit of Envoy is the ability to handle upstream errors and timeouts in a much more pragmatic way. In the event that an upstream service starts to fail, Envoy will detect this and begin to apply back pressure to clients trying to send requests.

If a cluster reaches its configured maximum request count, Envoy will prevent new connections from being established to the cluster, rather than allowing the requests through and overloading the cluster.

In addition, Envoy will automatically retry a request on a client’s behalf a set number of times, before returning an error, which helps mitigate short-lived transient upstream failures. This can be coupled with configurable timeouts to ensure requests complete within a specific time period, rather than hanging indefinitely.

What’s fantastic about these features is that they can be configured on a per-route level, meaning that you can very precisely tune these values and use them to shape how requests flow between your apps. It also frees up engineers from having to build in the circuit breaking and retry handling logic into their apps, relying instead on a single consistent set of rules for these values.

Encryption

gRPC strongly encourages you to use encrypted connections between services. In fact, if you wanted to use gRPC over plaintext, you would have to explicitly state you were creating an insecure connection.

conn, _ := grpc.Dial(“localhost:50051”, grpc.WithInsecure())

Or in ruby:

stub = Helloworld::Greeter::Stub.new(‘localhost:50051’, :this_channel_is_insecure)

To ensure that TLS was enforced at every step along the way, we generated a new set of wildcard internal certificates specifically for use with Envoy. That way we could run Envoy on a separate internal subdomain like ${service}.envoy.${environment}.example.com to make switching clients to using Envoy easier. Having a separate subdomain for Envoy also made debugging easier as it was clear how clients were configured to send their traffic — directly to gRPC services or via Envoy.

Using wildcards also meant that engineers could create new gRPC services without worrying about also needing to generate yet more TLS certificates for the new service.

listeners:
- name: internal-grpc-traffic
address:
socket_address:
address: 127.0.0.1
port_value: 8888
filter_chains:
...
tls_context:
common_tls_context:
alpn_protocols: h2
tls_certificates:
- certificate_chain:
filename: "/etc/envoy/ssl/wildcard.envoy.crt"
private_key:
filename: "/etc/envoy/ssl/wildcard.envoy.key"

This snippet configures the listener to use the wildcard.envoy.crt certificate and key pair for all requests.

On The Way To Production

To begin with, we aimed to get Envoy up and running locally on our laptops so that we could experiment with it.

Since Envoy is designed with containerised environments in mind, we had to also work out how to make sure it was run as a fully-fledged service under a process manager. Since systemd already manages a bunch of services inside our infrastructure, it made sense to create another systemd unit for Envoy.

Once that was up and running and we configured a gRPC cluster, its endpoint’s address and port, plus the vhosts and routing rules, we saw that the gRPC service we were proxying for had started receiving Health messages from Envoy.

This was amazing to see! After that, a config change in one of our gRPC client apps to point it to Envoy was all that was necessary for that app to start sending its traffic through Envoy.

Going To Production

One of the greatest things about rolling out Envoy was that since it wasn’t used by anything yet, we could deploy it everywhere without worrying that it would interfere with anything. We could then point individual clients at it and test how it worked with staging and production traffic in a controlled way.

We automated the entire provisioning and configuration process by leveraging our existing config management pipeline built on top of Chef. This meant that Envoy could be deployed exactly the same way to all of our nodes, with minimal human involvement.

One difficulty lay in generating the YAML configuration file itself — because everything had to be contained in a single file, it ended up being almost 300 lines long.

Chef made this work easier as we were able to define all the resources we needed as Ruby hashes and do a certain amount of templating. This was all compiled using the YAML ruby library into a single YAML file.

Here’s a template that contains a fairly complete configuration for an Envoy instance to proxy traffic to a single gRPC service.

Making the switch

When changing production apps over to Envoy, it was imperative to us that we didn’t affect customers’ experience with timeouts and errors while the switch was made. What made the whole experience a lot smoother was the fact that we were able to set up Envoy ahead of time without having to connect any clients to it at the same time.

An excellent side effect of this is that we were also able to test Envoy’s latency and throughput in our Staging environment when proxying to different gRPC services — something we hadn’t really ever quantified.

To be on the safe side, though, we opted to switch one gRPC client app at a time and monitor its metrics for a few hours to confirm that everything was working as intended before moving on to the next one. This was also born from the desire to change as few things as possible at every step so that we could more easily diagnose faults.

The great thing was, though, that the switch to Envoy was as simple as changing a single line of config for every client service and restarting it — not dissimilar to a regular deployment. To make sure that we didn’t run into issues with dropped requests, especially high-traffic services were reconfigured manually and their restarts were staggered so that as few requests as possible had to be retried due to the services restarting.

Wrapping Up

Having rolled Envoy out everywhere, we are now much more confident in creating new gRPC services and integrating them into our stack. Our engineers no longer have to worry about writing load balancing logic into gRPC client apps or that a single instance of a gRPC service is going to get overwhelmed by requests.

The process, from start to finish, took about a month. It was exciting and challenging, but also frustrating at times. There were moments of anger at unexpected behaviours, sometimes caused by a mistyped variable (see the timeout example above). Still, we’re really excited about having added Envoy to our stack as it opens up a whole lot of opportunities for even more improvement further down the line.

Some final observations

  • It feels like Envoy really wants you to use xDS over statically-defined clusters, routes, endpoints, etc. Given that it’s designed to be run in immutable sidecar containers, that makes sense and it’s certainly something that we might look to implement.
  • It took a bit of tweaking to get the data we wanted out of Envoy in its logs and to have it log in the format we wanted (JSON) rather than plaintext. JSON logs make it easier to quickly check that all is well. One thing to be aware of, though, is that the logs do not appear in real-time — events are written to file in batches.

The Future

One of the logical next steps is to move towards a more dynamic xDS-based infrastructure where routes and clusters can be managed by a separate service, rather than relying on static configuration files.

This further empowers engineers to make changes to infrastructure such as adding new services or the capacity for blue/green deployments of canary deployments using more fine-grained routing in future.

To help our frontend engineers work faster and better, we already use a lightweight version of our Vagrant development environment which runs only bare minimum services required for frontend development and hooks into a remote environment for the rest. It would be awesome if we could extend the same flexibility and portability to our backend folks.

Lyft already use Envoy internally to allow engineers to access remote services from their local development environments.

This allows them to only run the components that they need to work on locally. The rest is run in a cloud environment and Envoy is used to transparently and securely proxy that traffic between the engineers’ laptops and the Cloud.

This could free our engineers from the need to run the entirety of the Geckoboard stack on their laptops, allowing them to work against a consistent set of services, rather than falling prey to “It works on my machine”.

Finally, Envoy is an excellent piece of groundwork on the way to possibly moving to a more container-based service architecture in future.

If you liked this, you’ll likely love a bunch of the other posts at Geckoboard Under The Hood. Head over to the link below to check them out!

--

--

Poolski

Amateur human. Internet exploder. Sometimes I think about things.