Implementing Observability in a Service Mesh

Jude Dsouza
Sep 28 · 8 min read
Image for post
Image for post
A Command-and-Control Center image taken from www.appliedglobal.com

I work as an SRE at iZettle and, at times, during my on-call it can be challenging to understand the contributing factors of incidents when they occur or the nature of failures as they happen. As someone running a Microservices Architecture, it’s not an easy task setting up some decent level of observability into the system to quickly understand what’s going on.

In this article I’d like to share my experiences on how one can improve the observability of services through the use of the Consul Connect Service Mesh and later I’ll explain my justification for using it compared to other alternatives. Just for a recap, a service mesh allows you to route HTTP traffic through proxies that are deployed as sidecar containers to a service. These proxies can then expose metrics, logs and traces that allow you to gain rich visibility into the system thus making it more efficient to debug applications quickly and easily. Consul Connect, in addition provides service-to-service authorization and encryption using mutual TLS and uses Envoy as the data plane for the data transmission. In terms of observability, I will cover basic Layer 7 metrics, access logs and distributed tracing using Prometheus, Grafana, AWS CloudWatch Logs and AWS X-Ray.

All the code for this demo can be found in my Github repo.

Prerequisites

The Service Mesh Environment

Image for post
Image for post
High level Architecture of the Service Mesh Implementation using Consul Connect running on ECS Fargate

It comprises of the following:

  • A Consul cluster.
  • A dashboard and counter service running on AWS ECS Fargate. The dashboard service has an upstream connection with the counter service and communicates securely with it to increment a counter.
  • Each of the service’s Task Definitions has a sidecar container definition for the Envoy Proxy and a Consul agent that communicates with the Consul server.
  • A Consul managed Ingress Gateway for external services to communicate with services inside the mesh.

Since Consul is the control plane, it is responsible for configuring each of the Envoy proxies. It does this via the same Service Definitions entries that are used to register services into Consul. Following is a example config:

Here, the connect.sidecar_service field tells Consul to register a proxy service with default values. For more details on how the sidecar service registration, read the documentation here.

You can also define global default settings for the proxy via proxy-defaults configuration entries. And if that’s not enough, it gives you the flexibility to define more advanced and custom configurations in the form of Escape Hatches. For a full list of options, check out the Envoy integration page here for more details.

Throughout the remainder of this article, I will be discussing a couple of these configuration options, in particular, the ones that can help us setup observability so that we can get HTTP Layer7 Metrics, Logs and Tracing.

Implementing Observability

Image for post
Image for post
Implementing Observability in a Consul Connect Service Mesh using Prometheus, AWS CloudWatch and AWS X-Ray

HTTP Layer 7 Metrics

Further, Prometheus can discover the proxies automatically using its Consul service discovery mechanism as show in the example config below:

Prometheus config that discovers Envoy Proxies via Consul SD

These metrics can then be visualized in Grafana.

Image for post
Image for post
Screenshot for Overview of Basic Service Health Layer7 Metrics

Access Logs

Consul Connect `proxy-defaults.hcl` config for HTTP Tracing and Access Logs in Envoy

With this config (skipping over the tracing related information which will be covered in the next section), access logs will be written directly to /dev/stdout and automatically collected by AWS CloudWatch Logs driver and pushed into AWS CloudWatch Logs. You should then see logs similar to the following and of-course you can add more fields to it depending on your requirements:

{
"method": "GET",
"request_id": "420e2950-50be-43c3-8d41-9b8e9ac9937b",
"bytes_sent": "188",
"origin_path": "/",
"authority": "dashboard.ingress",
"x_forward_for": "10.140.21.77",
"protocol": "HTTP/1.1",
"upstream_service_time": "10",
"duration": "10",
"downstream_remote_addr_without_port": "10.140.21.77",
"user_agent": "curl/7.61.1",
"upstream": "127.0.0.1:8080",
"response_code": "200",
"bytes_recv": "0",
"response_flags": "-",
"start_time": "2020-09-03T11:47:01.662Z"
}

The request_id field is important here as with this field you can correlate specific traces of request for example in tools like AWS X-Ray and even in Application Logs.

HTTP Request Tracing

Image for post
Image for post
An instance of an AWS X-Ray HTTP Trace b/w the Dashboard and Counter service communication

Finally, HTTP request tracing can be configured in Envoy by setting the envoy_tracing_json and envoy_extra_static_clusters_json fields of the proxy.config service definition. The former defines the tracing provider, in my case it’s AWS X-Ray, and the later defines the corresponding static cluster configuration that contains AWS X-Ray’s endpoint so that HTTP request tracing data can be sent to it.

In order to preserve end-to-end tracing between all services in the service mesh, the services themselves require code that extracts tracing data from the request headers on every inbound connection and forward that on every outbound connection. Additionally, tracing data, in particular the trace_id and request_id must also be included in every log message for traceability and better debugging. With AWS X-Ray, this is possible using their language specific SDK. In the case of Python, you can achieve this using the aws-xray-sdk. Similarly, there is an SDK for Go, Node.js, Java, Ruby and .NET.

Putting it all together

AWS ECS Container Definitions for a Service to adapt to the Consul Service Mesh

For simplicity, in this setup, these containers are configured to run on AWS ECS Fargate with the AWS VPC Task Networking enabled so that all containers get access to the same localhost and can communicate with each other over that interface.

  • The first container in the list is the docker image for the application service itself. The service is completely unaware of Consul or the Proxy and is meant to focus only on its own business logic.
  • The second is the local Consul Agent which communicates with the Consul Server and is responsible for getting information about other services in the mesh to feed into Envoy for it’s dynamic service discovery. It also provides a way for services to register themselves into Consul.
  • The third is the Envoy Proxy which is the dataplane and is responsible for intercepting all inbound and outbound connections between services. In this setup, the container upon initialization, receives the Consul service definition and its related Consul Connect Proxy configuration through the consul_service_config_b64 variable which is encoded in Base64 and registers it into Consul via the local Consul Agent. It then starts the Envoy Proxy.
  • Lastly we need distributed tracing on the HTTP level and on the Application Logging level. The amazon/aws-xray-daemon sidecar container helps to achieve this by allowing the application and the Envoy Proxy to send tracing data to it locally over the localhost interface.

Conclusion

You might ask why Consul Connect? I think it’s a really great product from Hashicorp and its simplistic design as a control plane makes it a winning factor for my choice of a Service Mesh. This is one of the reasons I chose not to use Istio, also because of its tight coupling with Kubernetes. AWS AppMesh is also a good and simple alternative but as of this writing, due to its usage of Route53 for Service Discovery which in my experience has resulted in delayed DNS propagation and for its limitation around more custom configurations of Envoy, I think Consul Connect has much more to offer so I would still keep a close eye on that and see how things go.

With regards to AWS CloudWatch Logs, I’ve used it mainly for simplicity but alternatively one can use Splunk, SumoLogic or the ELK Stack. On the other hand AWS X-Ray is quite a decent tool for tracing and is simple to setup but because Consul provides a pluggable mechanism for tracing providers, one can use Jaeger as well.

Lastly, I’m a big fan of Site Reliability Engineering and if you loved reading this post, I’d be happy to hear from you and your recent experiences with implementing and managing a Service Mesh in the comments below. Feel free to connect with me on LinkedIn as well. Huge shout-out to Nataliya, Davy and Krishna for their assistance in proofreading this article.

iZettle Engineering

We build tools to help business grow — this is how we do it.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store