SRE: Observability: No Friction Application Observability Using Envoy
Envoy provides a no-friction approach to observing application-level network traffic. Envoy enables getting visibility into all external calls a service is making, their latencies and their HTTP status codes. Envoy is significantly easier to deploy than language level libraries that expose external call metrics. Finally, Envoy static configuration is a great entry point into enabling observability without requiring service meshes or complicated infrastructure and configuration. This post covers how Envoy can be used to gain visibility in any outgoing HTTP based connections.
Problem
In many of the projects I’ve worked on, if observability and resiliency primitives aren’t already available in frameworks or libraries, they get deferred until the very end of projects, putting them at risk to get pushed off completely. In order to illustrate consider a service that makes requests to Facebook API:
Pretend that there has been a number of issues around using Facebook: Requests are randomly rate limited, error rates are high at certain times of day, and sometimes requests timeout. In order to better understand these issues the Service should capture the following request metrics:
- Rate of requests
- Status Codes
- Latencies
- Timeouts
Common solutions to this may involve adding log statements around external calls or instrumenting requests inside of the service, or in a shared Facebook Client library for the language that the service is written in. All of these changes require modification of application-level code in order to add the metrics/log statements. If a client library is being instrumented, it naturally creates a tight coupling between itself and the code:
This coupling is both on the language level (ie a node facebook client cannot be used by a python service) and the component level (ie the Library needs to be careful of not introducing breaking changes, and clients need to now be aware of Library released, bug fixes, etc). Finally, having a code level coupling requires a redeploy of the service when updates become available.
Envoy
Envoy mitigates these issues by running external to the Service as a separate process (aka Sidecar), and having the Service send some, or all requests, through Envoy:
Envoy provides observability out of the box for all traffic flowing through it. The amazing part is that only selective traffic needs to flow through Envoy! This means that the Service can start by only sending traffic for Facebook through Envoy without changing the destination of any other requests it’s making. This greatly reduces the risk/friction involved with adopting Envoy by allowing a service to slowly begin migrating traffic to Envoy. Since Envoy is a separate process no code level changes need to be made to the Service to get gain HTTP observability!
Envoy provides a number of benefits:
- Language agnostic
- Requires Minimal application changes (config)
- Ships with many different metric sinks (prometheus, statsd, datadog)
- Battle tested: Envoy accepts a lot of traffic over Lyft, Amazon, Google, Istio
- Gigantic community (Lyft, Amazon, Google Microsoft, checkout the Used By:
Consider the alternative of adding metrics to HTTP calls at the code level. Each function needs to be instrumented in order to capture request information and surface it. Hopefully this can be done at a library level and distributed to all applications. With Envoy the only change required is to direct outgoing Facebook traffic to Envoy; meaning a much shallower configuration change instead of “deeper” code level changes:
Modifying the software involves a high level of risk, and requires retesting and reverifying the calls that are instrumented at a code level. Even if the metrics are encapsulated in a client library the client calls still need to be modified and carry this risk of modification. Contrast this with introducing Envoy; Envoy only requires configuration and verifying the integration between the Service and Envoy, and Envoy and its target (Facebook). If at any time an issue arises with the integration, the Service can be trivially reverted to route traffic directly to Facebook. Envoy shifts the dependency graph from code level, to infrastructure level:
The benefit of Envoy’s side-car approach should be obvious. The alternative consists of creating a language specific library and integrating it into each service. Every time the library requires an update, all services need to pull in the change and be redeployed. Work for a single language library doesn’t extend to other languages. Since Envoy is a separate process it can be used for any language. When newer version of Envoy become available, upgrades can happen so the service isn’t even aware of any changes.
The rest of the post will show a hands on example of how Envoy can be used to generate observability into a service.
Example
This example will setup a Service which makes outgoing requests to “Facebook”. It will then show how to use Envoy to route traffic to “Facebook”. All configuration for these are available on github.
In order to illustrate this we’ll use a local HTTP server to emulate “Facebook”:
$ busybox httpd -p 127.0.0.1:8080 -h /home/vagrant/
The client will be represented by vegeta. vegeta is a CLI for load testing and supports making HTTP requests at a configurable rate (100 requests / second in our case):
$ echo "GET http://localhost:8080/test.txt" | vegeta attack -rate=100 -duration=0 > /dev/null
Running this provides no visibility into the number of requests that are succeeding or failing. This service is flying completely blind. Is the client able to actually make 100 requests per second? At what latencies? Are all of the requests succeeding?
In order to answer these questions we’ll put Envoy between the Service and “Facebook”. To do this, Envoy needs to be configured to match requests and forward them to “Facebook” correctly. We’ll create an Envoy listener which envoy will use to bind to a port and expose rules to match requests using a virtual host:
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager
use_remote_address: true
stat_prefix: ingress_http
codec_type: auto
route_config:
name: facebook_api
virtual_hosts:
- name: facebook_api
domains:
- "*"
routes:
- match:
prefix: "/"
route:
cluster: facebook
http_filters:
- name: envoy.router
This shows that Envoy listener will bind to 0.0.0.0:10000 and all HTTP requests on that port will be routed to the facebook
cluster. The next step is to define the facebook cluster:
clusters:
- name: facebook
connect_timeout: 1s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: facebook
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 0.0.0.0
port_value: 8080
The specifics of this configuration are describe in Envoy’s awesome documentation. The critical parts for the tutorial is that we’re using envoy to impose a connection timeout of 1 second (connect_timeout: 1s) and that envoy should send all traffic for the facebook cluster to 0.0.0.0:8080
(remember our fake Facebook http service from above).
The final step is to configure envoy to expose stats, which is done through its “admin” interface:
admin:
access_log_path: "/dev/null"
address:
socket_address:
address: 0.0.0.0
port_value: 9901
After starting envoy any HTTP requests to Envoys listener (`0.0.0.0:10000`) should be routed to the facebook cluster (`0.0.0.0:8080`):
$ curl localhost:10000/test.txt -v
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 10000 (#0)
> GET /test.txt HTTP/1.1
> Host: localhost:10000
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: text/plain
< date: Sun, 14 Apr 2019 20:00:47 GMT
< accept-ranges: bytes
< last-modified: Sun, 14 Apr 2019 13:11:43 GMT
< content-length: 0
< x-envoy-upstream-service-time: 0
< server: envoy
<
* Connection #0 to host localhost left intact
Next, in order to visualize metrics, a local prometheus and grafana will be used. Prometheus needs to be configured to scrape metrics from the envoy instance:
- job_name: "envoy"
scrape_interval: "15s"
metrics_path: /stats/prometheus
static_configs:
- targets: ['host.docker.internal:9901']
The Envoy admin port 9901 is being used to gather metrics. Finally, the Service will be updated to make requests to “Facebook” through Envoy:
$ echo "GET http://localhost:10000/test.txt" | vegeta attack -rate=100 -duration=0
Envoy exposes many metrics for free, one of which is the Request Rate:
HTTP Status Codes which enables calculating the availability of Facebook from the Services Perspective:
Last but not least (envoy exposes a TON of metrics) timing information of HTTP transactions are surfaced:
In addition to the above metrics, Envoy surfaces detailed information about the upstream cluster (facebooks) health and detailed TCP connections stats.
Conclusion
Making external requests through Envoy provides a huge amount of visibility which would have traditional required many risky application level changes. Additionally, once traffic is being routed through Envoy it opens up many possibilities including, timeout, retries, circuit breakers, and rate limiting. What’s even is Envoy doesn’t require a full service mesh or containers to start to utilize, making it a compelling choice whenever application level network visibility is required.
Links
- https://www.envoyproxy.io/
- https://medium.com/dm03514-tech-blog/sre-resiliency-bolt-on-sidecar-rate-limiting-with-envoy-sidecar-5381bd4a1137
- https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar
- https://github.com/dm03514/sre-tutorials/commit/02c7649f95091f1b6ea3225e13f5d91f49d4c7f7
Hands on Migration using a similar approach
The great thing was, though, that the switch to Envoy was as simple as changing a single line of config for every client service and restarting it — not dissimilar to a regular deployment.