A Tiny Slice of Heaven
Building Better Microservices Development Tools
Over the past few years, microservices architectures have moved out of the shadows and become a more and more mainstream way of developing complex web applications. But while developers are building a consensus around micro-services, several peripheral questions remain under debate. For example, once you build your microservices architecture, how should individuals developers build and run all the microservices while they developing features?
Currently, there are three main approaches that are widely used.
- Run all of the containers locally on the developers laptop, using a container orchestration platform installed on the system, such as minikube or docker-compose.
- Set up prefabricated environments hosted in the cloud, that developers can reserve when they want to run the entire system.
- Set up a hybrid solution, where some portion of the microservices run locally (usually the service you are actively developing) and others run in a cloud environment.
Bread has been using microservices for over 6 years, and during that time period we have experimented with a number of different tools and approaches of enabling local development. Recently, our development team designed, built and deployed entirely new microservices architecture, and we took the opportunity to rebuild our development environment and tooling as well. In this post, I will briefly summarize our previous systems we used to solve this problem, and then present our latest environment, code-named Slice. After showing our latest work, I will discuss learnings from the Slice project.
A Brief History of Prior Tools
Our original development environment, code-named Breadbox, ran all the containers locally, on the developer’s machine using docker-compose. This worked well initially, but the Docker environment also relied on locally defined configuration, leading to finicky behavior. Any change in the configuration could result in cascading container errors that were difficult to debug. This also became very CPU-intensive. Running the microservices reliably started up the fans on our MacBook Pros, and caused the laptops to heat up.
Our second iteration, code-named Bakery, embraced a hybrid local and remote approach using Kubernetes. A slack bot application was used to deploy the services on remote development k8s cluster. After the environment was created, core services were port forwarded to the local machine. With Bakery, we no longer needed to run everything locally, we would only run the services we wanted to develop locally while connecting to the port-forwarded core services like the message bus and database. This solution solved our performance issues, but was not an optimized developer experience. This approach required moving between a variety of different tools and manually typing lots of commands to set up the environments and port forwards. The process would start in Slack, then switch to the terminal to port forward, then we would need to use the Kubernetes Dashboard to tweak the system further.
Introducing Slice
Slice is our latest iteration of a microservices development environment. It has been in use for about 6 months now. It uses the hybrid approach described above, where some containers are run locally, and some containers are run remotely. Slice is an internally built CLI tool that allows the developer to create and delete a remote environment, as well as run a local version of a service and exposing it to a main cluster that can be accessed remotely as well. From a developer perspective using Slice is very straightforward: You only need to run two commands to create a remote environment. In order to replace container on the the remote cluster, you simply navigate to the service directory, and run a few commands to build a Docker container and replace the remote container with the locally built version.
Slice Architecture
The diagram below shows the architecture of our Slice system, and by extension, our microservices setup. Our microservices are run on a Kubernetes cluster using Istio and Envoy as a service mesh. The Slice tool uses kubectl, Telepresence and CI/CD tooling in order to create our development environment. The Slice tool goes through the following steps under the hood in order to create our development environment.
- When a user requests a new development environment, we use kubectl to provision a new development environment on our remote systems.
- When the developer specifies which group of services that they want to run on the cluster, the tool will deploy the all the selected services on a remote Kubernetes cluster.
- When the developer asks to add a locally built service to the cluster, the slice tool performs a number of steps:
- It compiles the local code and and creates a Docker image running the local application.
- It replaces the remote version of the selected service with a Telepresence proxy.
- It creates a Telepresence proxy client Docker image on the developer’s computer.
- It proxies the locally built Docker image through the telepresence system into the remote cluster.
Insights and Design Choices
One of the core design choices we made when building Slice is to build a hybrid development environment where some services are running locally, and some services are running remotely. We had previously had positive experiences with Bakery, our old hybrid system, combined with the the past industry experience of the engineers, led us to select this approach over the alternatives. Our principal architect evaluated several tools to enable this workflow, and selected Telepresence to enable the proxy connection.
However the benefits of speed and convenience for developers comes at a tradeoff: increased server costs. Our engineering organizations usually is running about 50 slice different environments at any one time, along with 3 or 4 other dedicated non-production environments, which translates to several thousand dollars in AWS costs per month.
Our second core design choice was to work hard to ensure that our development environments mimicked our production environments as closely as possible. In our past systems, we used Kubernetes in our development environment, but not in production. Now that we are using Kubernetes from development all the way to production, we had the opportunity to mirror the environments very closely.
This goal is part of the reason why we used Docker so extensively. Many hybrid systems, including Bakery do not require that the local service be running insider a Docker container. The default mode for telepresence is to run on the users machine, communicate with a local process on the user’s machine and connect it to a remote Kubernetes cluster. However, we chose to put in the extra work required to build the local code inside Docker, as well as run telepresence inside Docker, in order to as closely as possible simulate the exact same conditions as all other environments.
For example, our system uses authentication and logging services provided by our Istio service mesh. These which would be difficult to provide and require a nonstandard implementation if the local code were running on the users machine rather Docker. If we make an HTTP request to the service that running locally, if we make the call to localhost, we will not pass through service mesh and the authentication code will not run. We need to make the call to the remote service and have the call proxied through the kubernetes cluster. This is much easier when the code is dockerized.
Adding this Docker support proved to be the most technically challenging parts of building out the new Slice system. While in theory the telepresence tool works inside of Docker out of the box, we found that we had to fork and modify the Telepresence code in order to get it working with our Docker setup. Additionally it was challenging to set up all the proxying and connection infrastructure between the Docker containers running locally and the telepresence proxy running on the cluster. Docker networking was the most time consuming part of the project . The main engineer behind the initial launch of Slice, said he completed the project in about 4 weeks, but “They were very long weeks, about 14 hours a day.”
The final design choice we undertook was make the process as user friendly as possible for developers. One of the choices we made in order to facilitate that was to run the Slice tool itself inside a Docker container. This makes the upgrade process much easier for developers, instead of have to manage several dependencies manually, there is one upgrade command to use the latest Docker image specification.
Additionally, Slice has added lots of new functional upgrades to support more advanced developer workflows. It has added a debug mode, where local code can be added to the cluster in debug mode. Debug mode enables the service to be run through your IDE, allowing you to use breakpoints when running the service. We also have made use of helm charts that enable us to run any subset of the services you want. Currently, the most common use case is to run all the microservices at once, but we can support running just a subset, for example, only running the services related to the loan servicing part of the business.
The Slice tools is a great example of our engineering team’s dedication to experimenting and innovating to improve our systems and impact our developer’s experience. Active development continues on Slice in order to add new features and functionality, and we are always open to new approaches and perspectives. Having personal experience with all three of the previous systems, it has definitely encouraged me to see our solutions evolving and improving over time. Hopefully our approach and the issues we encountered along the way will spark ideas and thoughts about different approaches and features you can add to your own microservices development environment.