Implementing Observability in a Service Mesh

Published in

The PayPal Technology Blog

8 min readSep 28, 2020

A Command-and-Control Center image taken from www.appliedglobal.com

I work as an SRE at iZettle and, at times during my on-call, it can be challenging to understand the contributing factors of incidents when they occur or the nature of failures as they happen. As someone running a Microservices Architecture, it’s not an easy task setting up some decent level of observability into the system to quickly understand what’s going on.

In this article I’d like to share my experiences on how one can improve the observability of services through the use of the Consul Connect Service Mesh and later I’ll explain my justification for using it compared to other alternatives. Just for a recap, a service mesh allows you to route HTTP traffic through proxies that are deployed as sidecar containers to a service. These proxies can then expose metrics, logs and traces that allow you to gain rich visibility into the system thus making it more efficient to debug applications quickly and easily. Consul Connect, in addition provides service-to-service authorization and encryption using mutual TLS and uses Envoy as the data plane for the data transmission. In terms of observability, I will cover basic Layer 7 metrics, access logs and distributed tracing using Prometheus, Grafana, AWS CloudWatch Logs and AWS X-Ray.

All the code for this demo can be found in my Github repo.

Prerequisites

This article will focus mainly on the implementation details, thus the reader is expected to have a basic understanding of Consul’s service mesh feature. If you aren’t already familiar with it you can learn more by following this guide.

The Service Mesh Environment

This demo is designed to run on AWS and it uses AWS ECS as the Container Orchestration of choice. The following is a diagram that depicts the high level architecture of the Service Mesh implementation.

High level Architecture of the Service Mesh Implementation using Consul Connect running on ECS Fargate

It comprises of the following:

A Consul cluster.
A dashboard and counter service running on AWS ECS Fargate. The dashboard service has an upstream connection with the counter service and communicates securely with it to increment a counter.
Each of the service’s Task Definitions has a sidecar container definition for the Envoy Proxy and a Consul agent that communicates with the Consul server.
A Consul managed Ingress Gateway for external services to communicate with services inside the mesh.

Since Consul is the control plane, it is responsible for configuring each of the Envoy proxies. It does this via the same Service Definitions entries that are used to register services into Consul. Following is a example config:

Here, the connect.sidecar_service field tells Consul to register a proxy service with default values. For more details on how the sidecar service can be registered read the documentation here.

You can also define global default settings for the proxy via proxy-defaults configuration entries. And if that’s not enough, it gives you the flexibility to define more advanced and custom configurations in the form of Escape Hatches. For a full list of options, check out the Envoy integration page here for more details.

Throughout the remainder of this article, I will be discussing a couple of these configuration options, in particular, the ones that can help us setup observability so that we can get HTTP Layer7 Metrics, Logs and Tracing.

Implementing Observability

Envoy is just awesome! It has a comprehensive set of configuration options and has tons of metrics to share. It is also the component that can give you rich access logs and HTTP Request Tracing. As discussed above, Consul is the control plane through which you can configure Envoy with these options. In this section I will zoom into the implementation details of how this can be done.

HTTP Layer 7 Metrics

In this setup I’ve used Prometheus to scrape metrics from all the sidecar proxies. By setting the envoy_prometheus_bind_addr field in the proxy.config service definition, Envoy can be configured easily to listen for scrape requests on port 9102.

Further, Prometheus can discover the proxies automatically using its Consul service discovery mechanism as show in the example config below:

Prometheus config that discovers Envoy Proxies via Consul SD

These metrics can then be visualized in Grafana.

Screenshot for Overview of Basic Service Health Layer7 Metrics

Access Logs

Access Logs can also be configured in the Envoy proxy using the Escape-Hatch Override option. This option allows you to define a custom listener configuration that enables access logging by setting the envoy_public_listener_json field as part of the proxy.config definition. You can also define it as part of the global proxy-defaults configuration entry as shown below:

Consul Connect `proxy-defaults.hcl` config for HTTP Tracing and Access Logs in Envoy

With this config (skipping over the tracing related information which will be covered in the next section), access logs will be written directly to /dev/stdout and automatically collected by AWS CloudWatch Logs driver and pushed into AWS CloudWatch Logs. You should then see logs similar to the following and of-course you can add more fields to it depending on your requirements:

{
    "method": "GET",
    "request_id": "420e2950-50be-43c3-8d41-9b8e9ac9937b",
    "bytes_sent": "188",
    "origin_path": "/",
    "authority": "dashboard.ingress",
    "x_forward_for": "10.140.21.77",
    "protocol": "HTTP/1.1",
    "upstream_service_time": "10",
    "duration": "10",
    "downstream_remote_addr_without_port": "10.140.21.77",
    "user_agent": "curl/7.61.1",
    "upstream": "127.0.0.1:8080",
    "response_code": "200",
    "bytes_recv": "0",
    "response_flags": "-",
    "start_time": "2020-09-03T11:47:01.662Z"
}

The request_id field is important here as with this field you can correlate specific traces of request for example in tools like AWS X-Ray and even in Application Logs.

HTTP Request Tracing

An instance of an AWS X-Ray HTTP Trace b/w the Dashboard and Counter service communication

Finally, HTTP request tracing can be configured in Envoy by setting the envoy_tracing_json and envoy_extra_static_clusters_json fields of the proxy.config service definition. The former defines the tracing provider, in my case it’s AWS X-Ray, and the later defines the corresponding static cluster configuration that contains AWS X-Ray’s endpoint so that HTTP request tracing data can be sent to it.

In order to preserve end-to-end tracing between all services in the service mesh, the services themselves require code that extracts tracing data from the request headers on every inbound connection and forward that on every outbound connection. Additionally, tracing data, in particular the trace_id and request_id must also be included in every log message for traceability and better debugging. With AWS X-Ray, this is possible using their language specific SDK. In the case of Python, you can achieve this using the aws-xray-sdk. Similarly, there is an SDK for Go, Node.js, Java, Ruby and .NET.

Putting it all together

For a single service running in AWS ECS, at least 3 sidecar containers are needed for it to be fully integrated into the Consul Service Mesh, these are the Consul Agent, Envoy Proxy and AWS X-Ray daemon. The Dockerfile for each of these can be found in the linked Github repo above. An example list of ECS Container Definitions may look like the one below:

AWS ECS Container Definitions for a Service to adapt to the Consul Service Mesh

For simplicity, in this setup, these containers are configured to run on AWS ECS Fargate with the AWS VPC Task Networking enabled so that all containers get access to the same localhost and can communicate with each other over that interface.

The first container in the list is the docker image for the application service itself. The service is completely unaware of Consul or the Proxy and is meant to focus only on its own business logic.
The second is the local Consul Agent which communicates with the Consul Server and is responsible for getting information about other services in the mesh to feed into Envoy for it’s dynamic service discovery. It also provides a way for services to register themselves into Consul.
The third is the Envoy Proxy which is the dataplane and is responsible for intercepting all inbound and outbound connections between services. In this setup, the container upon initialization, receives the Consul service definition and its related Consul Connect Proxy configuration through the consul_service_config_b64 variable which is encoded in Base64 and registers it into Consul via the local Consul Agent. It then starts the Envoy Proxy.
Lastly we need distributed tracing on the HTTP level and on the Application Logging level. The amazon/aws-xray-daemon sidecar container helps to achieve this by allowing the application and the Envoy Proxy to send tracing data to it locally over the localhost interface.

Conclusion

If you’re still reading this, awesome! I hope you’ve enjoyed my walk-through of how one can implement an Observability setup using a Consul Connect Service Mesh. This was my humble attempt to share my experience and I hope you found it useful.

You might ask why Consul Connect? I think it’s a really great product from Hashicorp and its simplistic design as a control plane makes it a winning factor for my choice of a Service Mesh. This is one of the reasons I chose not to use Istio, also because of its tight coupling with Kubernetes. AWS AppMesh is also a good and simple alternative but as of this writing, due to its usage of Route53 for Service Discovery which in my experience has resulted in delayed DNS propagation and for its limitation around more custom configurations of Envoy, I think Consul Connect has much more to offer so I would still keep a close eye on that and see how things go.

With regards to AWS CloudWatch Logs, I’ve used it mainly for simplicity but alternatively one can use Splunk, SumoLogic or the ELK Stack. On the other hand AWS X-Ray is quite a decent tool for tracing and is simple to setup but because Consul provides a pluggable mechanism for tracing providers, one can use Jaeger as well.

Lastly, I’m a big fan of Site Reliability Engineering and if you loved reading this post, I’d be happy to hear from you and your recent experiences with implementing and managing a Service Mesh in the comments below. Feel free to connect with me on LinkedIn as well. Huge shout-out to Nataliya, Davy and Krishna for their assistance in proofreading this article.