CBI’s DevOps Needed A New Microservice Mesh. Here’s How We Tackled The Challenge — CB Insights Research

Published in

CBI Engineering

7 min readFeb 5, 2020

As network architecture improves, it also gets more complex. Here’s how the engineers at CBI dug into this challenge and created a best-in-class network architecture.

The work of the engineering team and overall product complexity at CB Insights has grown significantly over the past year, as we’ve added new features, and broken up legacy services, with the number of jobs and services tripling in the past year alone.

As the number of services has grown, so has the complexity in how these services interact with each other and legacy systems.

Microservice architecture has been key to implementing all these new tools — microservices are a collection of loosely coupled services which make up an application. The application, by definition, is highly dependent on the network layer.

And with the adoption of all these services and microservice architecture, there’s a new challenge: improving service-to-service discovery and communication.

Here are some examples of how this challenge plays out:

Instead of debugging function and module calls within a monolithic application, engineers now need insight about network traffic between multiple services.
Instead of focusing on common problems such as exception handling and bad input, engineers must also consider network request behaviors and defend appropriately with retry-handling, and granular route control.

Since the majority of the operational problems were ultimately grounded in the areas of network and observability we looked to Envoy to ease these pain points.

Envoy is an open source edge and service-to-service network proxy designed to run alongside each service in your microservice service mesh. With all service traffic flowing through the Envoy mesh it becomes much easier to observe, tune, and control that traffic in a single place.

In this brief, we explain why and how CB Insights chose Envoy as well as how we approached and managed deployment of the tool. This is the comprehensive and collaborative approach our team takes whenever approaching an engineering challenge.

The “Dumb Pipe”

Ever since we started building out new services and splitting features out from our monoliths, CB Insights’ backend infrastructure has been operating via a very basic service mesh .

A service mesh is essentially a network of microservices that make up a larger application and the interactions between those microservices, as opposed to a monolith which has the majority of code in a single service.

In the case of our service mesh architecture, a service that comes up registers itself with Consul, our service mesh, and is then discoverable by other services via DNS queries. This is commonly referred to as “Dumb Pipe” design, which focuses on simplicity and assumes that the network is “dumb.” In this design, communication and networking features are defined as close to the application as possible leaving the application to do all of the heavy lifting for common operations such as retries, backoff, circuit breaks, and routing.

While this worked at its basic level it left much to be desired:

There was redundant code in every application, inconsistent logging and metrics, and no load balancing.
More often than not a service would pick the first instance of a remote service that it could find and send all traffic to that instance while the other instances remained idle.
This model also made it difficult to pinpoint intermittent communication errors due to the variability in communication metrics and logs between each service.

Moving forward

As the number of services we relied on continued to grow, we began to realize we needed something more than the “Dumb Pipe” design. We decided to explore the possibility of managing all of this heavy lifting in a shared communication layer like our network proxy.

At this point, we needed to evaluate what was available and best suited to CB Insights’ infrastructure. We evaluated many service mesh and proxy offerings, focusing on:

Performance: Any performance losses at the proxy level needed to be offset by considerable feature gains. We planned to run a proxy instance alongside each service instance so we wanted the solution to be one that we were comfortable with in terms of both latency and resource utilization.
Features: gRPC support was the main differentiator among the options we had. gRPC is our main inter-service remote procedure call protocol, and without exceptional support for this it wouldn’t make sense to switch, since we’d need to retool all of our services. Easy integration into our current service registry was another key factor; the last thing we wanted was to be messing with config files by hand every time a new service is created.

In addition to Envoy we investigated two network proxies, NGINX and HAProxy , both established and well-known projects. The open source version of NGINX lacked the advanced load balancing (zone aware balancing, ring hash, weighted least request) that Envoy provided, and turned out to be a little too heavy to run alongside each service. Another popular proxy option, HAProxy, had no gRPC support at the time of our evaluations so that was dropped right out of the gate.

Towards the start of the evaluation we sort of pushed Envoy to the side due to its complexity, but each time we evaluated a different proxy we were drawn more to what Envoy could do: WebSocket support, advanced load balancing, gRPC support, capable of running in a container using just less than 20MB of memory at its peak.

Removing the “Dumb Pipe”

The DevOps team began laying the foundation for the first generation of our Envoy rollout to run on our production servers. This iteration would utilize a Consul watcher embedded in each Envoy container. Essentially, this updates a complicated predefined template of the Envoy configuration each time there is a change in Consul, then hot restarts Envoy.

This is a method we had previously utilized with NGINX, and while it was a bit clunky, it would work well enough in the short term to get enough buy in to roll out a more polished method.

Once the infrastructure was in place a team serendipitously began reporting that a service using the “Dumb Pipe” model was experiencing flaky communication with another service.

This was the chance to show off our new integration, and we quickly made the switch from the “Dumb Pipe” to Envoy and immediately saw an improvement. All available instances of the upstream service were balanced, and dropped gRPC calls were automatically retried. We went from having a few calls failing per minute to a few calls failing per day.

Performance-wise, Envoy showed no measurable impact in the latencies between those services compared to direct communication, with a measurable benefit in terms of load balancing and retries. The holy grail, basically.

Following the undeniable success of the first service switch, work began on upgrading more services away from the “Dumb Pipe” to Envoy. We have much more visibility into the communication on our network thanks to logs and metrics Envoy provides. This has allowed us to adjust deployment and service registration strategies to minimize traffic loss between services.

Setbacks & future outlook

The rollout wasn’t entirely without setbacks however, especially when it came to some of our legacy services.

The way we configured routing for our gRPC services, the system expected the protocol buffer package name to match the name of the service registered with Consul since nearly all of our services followed this standard. Unfortunately a few legacy services were created before this standard was put in place.

Rather than bend the rules to accommodate these services, they were left out of the Envoy rollout until their protocol buffer package names conformed to the standards. While this required more work it was an overall benefit as it forced all our services, even the legacy ones, to follow the same standards.

Another minor issue we had during the rollout was the integration with the local development environment.

Our local development process involves spinning up a mini environment on an engineer’s computer and merging that into the hosted development environment; if the service isn’t running locally then traffic was expected to be sent to that service running in the development environment. This issue was surprisingly easy to fix by configuring load balancer subsets with fallback policies in Envoy.

While the first generation of our Envoy rollout utilizing the Consul watcher worked well we could tell it just wouldn’t scale in its current form. As the number of our services grows so does the number of Envoy sidecars querying Consul for configurations. The complexity of the configuration template also grows with each service, making it difficult to verify and test.

Luckily, Envoy has a control plane API to dynamically stream configurations to Envoy instances over gRPC. This allows us to query Consul for services from one location, create configurations for different types of Envoy instances (edge proxy, service-to-service, etc), and more easily test since it’s such a critical part of infrastructure. Learn how we handled building a fully custom Envoy control plane in a future post.

Interested in joining a team tackling tough infrastructure challenges? We are hiring!

Originally published at https://www.cbinsights.com.

CBI’s DevOps Needed A New Microservice Mesh. Here’s How We Tackled The Challenge — CB Insights Research

The “Dumb Pipe”

Moving forward

Removing the “Dumb Pipe”

Setbacks & future outlook

Written by Zach Hanmer