Service Mesh @Box

Published in

Box Tech Blog

11 min readJan 19, 2022

Illustrated by Navied Mahdavian / Art Directed by Christie Folsom

What is Service Mesh?

This topic is explained in detail on many popular blogs like Tetrate.io and RedHat. Modern day systems are built with lots of micro services, which depend on other micro services to successfully accomplish their tasks. That’s where a Service Mesh comes in. It is a framework and infrastructure layer that lets micro services securely connect and communicate with each other. This means individual services don’t have to worry about where dependent services are located. Service Mesh takes care of that and enables individual services to route their requests through Service Mesh components. It also provides clear observability of the entire traffic pattern within the infrastructure and provides capabilities to control it. Many solutions are available in the market to address Service Mesh needs. Some of them include Istio, Linkerd, Consul, App Mesh from AWS, trafeik-mesh, kuma, and OpenServiceMesh.

Why Service Mesh at Box?

When Box started in 2005, all application services were built as a monolithic application with single source control. This application was deployed on multiple instances for better scalability and availability. Services were deployed as web services and could talk to each other on the same instance. Pretty cool! As Box grew, the complexities of maintaining a monolithic application also grew. Therefore, Box started migrating services to micro services architecture. All new features and services are now built as micro services. Today, hundreds of micro services are running in Box infrastructure.

All Box services are deployed in complex heterogeneous infrastructure systems. Some services run on Virtual Machines, some on Windows hosts, and some on Kubernetes. These systems are hosted in data centers located in different regions of the USA and in the public cloud. Services need to connect to other services, irrespective of where they are hosted, to accomplish their tasks. If there were only a handful of services with a few instances each, we might have been able to use a central load balancer for each service. However, our services grew large enough that it was difficult to manage central load balancers. Some services run on 1,000+ instances for better availability and scalability. This is where a Service Mesh comes in.

A Service Mesh solution can help:

Service clients discover instances (endpoints) of services — Service Discovery
Services are discoverable by its clients — Service Registration
Services connect to each other securely — Service Proxy

The Service Mesh at Box is implemented using open source components — Nerve, Synapse, and Envoy.

SmartStack-based Service Mesh Architecture at Box

When Box started adopting a Service Mesh, very few mature solutions were available in the market. We adopted Nerve and Synapse from Airbnb and modified them to better integrate with Box infrastructure.

The three core functionalities of Service Mesh solution are Service Discovery, Service Registry, and Service Proxy. Box uses Synapse, Nerve, and Envoy for Service Discovery, Service Registration and Service Proxy, respectively. Synapse and Nerve together is called SmartStack. All these components are deployed next to services in the same instances. In Virtual Machines, these are deployed as individual processes, and in Kubernetes Pods they are deployed as sidecar components.

Synapse: Synapse is an open source project from Airbnb. It is used to discover service endpoints and generate the proxy configuration. It runs next to a service as a separate process or sidecar container. The open source version of Synapse can discover endpoints registered with Zookeeper. But, services at Box are deployed on heterogeneous infrastructure systems with most of them on Kubernetes. Synapse is forked locally and enhanced to discover service endpoints from Kubernetes. It is also enhanced to generate Envoy proxy configuration. The open source version generates configuration for HAProxy, another service proxy. Box migrated from HaProxy to Envoy proxy. But that is a story for another blog! Synapse can simultaneously discover endpoints of multiple services.

Nerve: Nerve is another open source project from Airbnb. It is used to register end points of a service to a central location (Zookeeper) so that Synapse can discover from that location. It runs next to a service as a separate process. Nerve is deployed only on Virtual Machines, as Kubernetes provides its own service registry mechanism. Nerve constantly checks the health of a service and updates Zookeeper accordingly. It is also forked locally and enhanced to meet the needs of our Box infrastructure.

Envoy: Envoy is a popular service proxy used by most of the Service Mesh solutions. Services connect to other services through Envoy. All traffic goes through the Envoy proxy and it provides better observability of the entire Service Mesh traffic flow. Envoy also provides features like advanced load balancing algorithms, outlier detection, and circuit breaking. It is deployed next to a service as a separate process or a sidecar container.

Putting It All Together

Synapse, Nerve, and Envoy proxies work cohesively and make the Service Mesh architecture at Box. These components are deployed next to services on the same machine or Kubernetes pod. Let’s look at some examples to understand how these components work together.

Example 1: Requests from Upload Service to Encryption Service

When the user uploads a file to Box, the upload request passes through Edge, API Gateway, and authentication layers before it reaches Upload Service. All files are encrypted before they are stored in a block storage. Let’s look at how Upload Service discovers and connects to Encryption Service. Although Upload Service can connect to a few other services, for this example let’s look at the connection to Encryption Service.

This assumes Upload Service is deployed on Kubernetes Platform and Encryption Service on a Virtual Machine. The image below shows all steps for Service Registration, Service Discovery, and Service Proxy functionalities of Service Mesh.

Nerve is running as a separate process next to Encryption Service on the same Virtual Machine. Every couple of seconds, Nerve checks the health of Encryption Service.
Depending on the health of Encryption Service, Nerve updates the status to Zookeeper. The service endpoint (IP Address of Virtual Machine) is stored as ZNode in Zookeeper against Encryption Service Pool. The service is deployed on multiple instances for better availability and scalability. Nerve running on all instances of the service, updates Zookeeper with its endpoints as ZNodes against the Service Pool. This is the Service Registration functionality of Service Mesh.
Synapse is running as a sidecar container next to Upload Service. Synapse constantly watches for any changes to ZNodes of Encryption Service in Zookeeper. This is the Service Discovery Functionality of Service Mesh.
Whenever there is any change to ZNodes (endpoints are added or removed), Synapse generates new Envoy Configuration and writes it to a file. Here, Synapse updates the endpoints configuration (EDS) file for Encryption Service.
When the EDS file for Encryption Service is updated, Envoy proxy reloads its configuration in-memory and uses the updated endpoints for Service Communication.
When a user makes an upload request, the Upload Service connects to Encryption Service running on Virtual Machines through Envoy Proxy. This is the Service Proxy functionality of Service Mesh. Envoy load balances requests from Upload Proxy across all the Encryption Service endpoints. It has features like circuit breaking, outlier detection, etc. The connection is established between Envoy Proxies of Upload Service and Encryption Service, and secured by using mTLS.

Example 2: Requests from Download Service to Metadata Service

When a user makes a download request for a file, the Download Service has to connect to file Metadata Service to get more details about the file. Similar to the earlier example, the user request reaches Download Service after going through Edge, API Gateway, and Authentication layers.

For this example let us assume Download Service is deployed on Virtual Machines and Metadata Service on Kubernetes.

Metadata Service is deployed as multiple replicas in a Kubernetes namespace for Metadata. A Kubernetes Service is created for Metadata Service. Kubernetes automatically creates an “Endpoints” resource with all pod endpoints (IP addresses of pods). When any pod is bounced, Kubernetes creates a new Pod with a different IP address. New IPs of Pods are updated in the “Endpoints” resource automatically by Kubernetes. The Service Registration here is performed by Kubernetes. There is no Nerve deployed on Kubernetes Pods.
Synapse is running as a separate process next to Download Service on the same Virtual Machine. As Metadata Service registers its endpoints to “Endpoints” in Kubernetes, Synapse watches them through Kubernetes API Server. Synapse does Service Discovery from Kubernetes through its API Server.
The Envoy configuration is updated when there is any change to endpoints of Metadata Service. EDS files of Metadata Service are updated accordingly.
Envoy Proxy is running as a separate process next to Download Service on the same Virtual Machine. When EDS files are modified, Envoy loads those changes in-memory and uses updated configuration for service communication.
Similar to the previous example, when a user makes a download request, the Download Service connects to Metadata Service running on Kubernetes through Envoy Proxy. This is the Service Proxy functionality of Service Mesh. Envoy load balances requests from Download Service to Metadata Service instances. The connection is established between Envoy Proxies of Download Service and Metadata Service. This connection is secured with mTLS by Envoy.

In the two examples above, Envoy is used for ingress and egress traffic by a service. This ensures that all traffic goes through Envoy.

The Big Picture

Services and its clients may be deployed on Virtual Machines or Kubernetes. They connect to each other using SmartStack components and Envoy Proxy. There is only one instance of SmartStack components and Envoy Proxy processes next to a service. The same process is used for multiple services discovery and connecting to multiple dependent services.

For example, Upload Service depends on Encryption Service and Metadata Service. Synapse running next to Upload Service can discover endpoints for Encryption Service from Zookeeper and Metadata Service from Kubernetes API Server. The same Envoy Proxy is used by Upload Service to connect to Encryption Service and Metadata.

Nerve is not deployed next to services deployed in Kubernetes as Kubernetes takes care of Service Registry. Synapse and Envoy are deployed next to all services, irrespective of where they are deployed. The connection between services is secured with mTLS between Envoy Proxies on either side of all services. This eliminates the need for individual services to encrypt or decrypt data in transit. Envoy provides extensive observability into the traffic going through the entire infrastructure.

We can track information like:

Bytes in and bytes out of a service — this helps us in scaling services
Errors from individual services — this helps with debugging issues faster
– 5xx errors
– 4xx errors
– Timeout errors
Number of requests between services
Request latencies between services

Static Control Plane

Unlike other Service Mesh solutions, the SmartStack based Service Mesh Solution at Box doesn’t have a dedicated control plane. The responsibility of Control Plane is to manage Service Mesh components and propagate appropriate mesh configuration to them. We achieve similar functionality with Source Control systems at Box.

The mesh configuration on Virtual Machines is maintained using Puppet. The mesh configuration for Kubernetes is maintained in our GitHub Enterprise repository. Kube Applier, an open source component from Box, is deployed on each Kubernetes Cluster. It will make sure that Kubernetes manifests are applied to all resources in Kubernetes. Sometimes developers may change the mesh configuration of any service on individual Virtual Machines or any Kubernetes Cluster. This can also happen due to some faulty nodes or processes. Puppet agents and Kube applier checks configurations periodically on local instances. If there is any mismatch with master repos, the configuration is synchronized from the master.

Pain Points with SmartStack-based Service Mesh

As Box continues to grow and add more features, our internal infrastructure is also growing. The load on our infrastructure has increased due to more services and more traffic with growth in business. With this, we’ve started seeing some issues with our Service Mesh Solution.

Here are few of them:

Synapse, Nerve and Envoy are running next to services. As services scaled, the load on those components also increased. We needed to provide more resources to Service Mesh components. To scale micro services horizontally, we needed to increase the instances of services. This added more load on Synapse, as it had to discover all those instances. At some point, Synapse was difficult to scale.
Synapse and Nerve are forked locally to add more features to them. It has become hard to rebase with upstream code. Also, there is no new development going on with those Open Source projects.
When any new service is onboarded to SmartStack-based Service Mesh, it becomes less easy to deploy Service Mesh components. To enable mTLS between services, required certificates are created using internal PKI service. As services and its dependencies grew, it became hard to maintain and create certificates.
The Service Mesh configuration is maintained in two repos (Puppet and GitHub), depending on where services are deployed. It has become difficult to maintain consistent configuration between these repos.
The mesh configuration has become complex over time. If we want to rollout a common configuration change to all services, it takes more than 3 weeks to rollout the change. The slow rollout ensures that live traffic is less affected while applying those changes to services. The canary rollout process has become complex.
When we need to upgrade any of the Service Mesh components, it takes more than 3 weeks to upgrade them across the entire mesh. If a new version of any Service Mesh component has a complex change, which requires restarting the process, it may take more than 4 weeks to rollout such changes. For example, when we upgraded Envoy configuration from V2 to V3, it took more than 8 weeks to rollout across the mesh.
In Kubernetes, any small change to Service Mesh configuration of a service requires its pods to bounce. Some services are more sensitive to restarts, requiring us to exclude them from those changes. This makes it difficult to modify Service Mesh configuration for those services. Due to this we have multiple versions of mesh configurations in the repo.
Services are deployed on multiple data centers for better availability. We need to enable Service Mesh so that services can be discovered and connected across data centers. This configuration varies from service to service and adds more complexity to the Service Mesh configuration.

Considering the above pain points, we started exploring alternate Service Mesh solutions.

NextGen Service Mesh at Box

Box is running on a heterogeneous infrastructure. We have services deployed on Virtual Machines and Kubernetes, across multiple data centers and public cloud. When we started exploring a potential Service Mesh solution, we needed something that supported our current infrastructure and also enabled us to migrate seamlessly from SmartStack-based Service Mesh.

These factors were considered while deciding our NextGen Service Mesh solution:

Where application services are hosted
– Kubernetes or any containerized orchestration platform
– Virtual Machines
– On Prem or Public Cloud
Extensibility
– Public Cloud integration
– Multi-cluster mesh
– Mesh expansion to VMs and containers
Resiliency features
– Circuit breaking
– Retries and timeouts
– Fault injection
Security Features

https://servicemesh.es/ helped us in deciding the right Service Mesh solution. With multiple iterations of research, we ended up selecting Istio to build our NextGen Service Mesh at Box. We’re already seeing a few interesting patterns, benefits and complexities with the migration. Stay tuned for more details in the next series of blogs.

Interested in learning more about Box? We are hiring. Checkout our careers page!