Monolith to Microservice Without Downtime — A Production Story
We at SalesLoft have a Ruby on Rails monolithic base that we drove development on until about two years ago. Secretly I love this monolith because it is fast and easy to add new features into. However, there’s also a lot of disadvantages to a large central codebase that became apparent as the team and product area grew.
This post is going to look at the migration of a core product feature out of monolith into a self-contained micro-service. I’m going to focus on the techniques that we used to ensure zero seconds of downtime on the feature as we rolled it out, as well as discussing the timeline, advantages, and disadvantages of the approach.
I’ll first set the stage for the feature and then get into the good bits in the plan.
The feature that this post will focus on is our live feed section. This feature is front-and-center on our dashboard, as well as in a menu widget on every page. The feature was originally built as a collection of three RESTful API calls to our monolithic back end to assemble email clicks, views, replies, send failures, and a few more things into a single feed. The feed was kept up to date through Pusher.
The reason that we looked to refactor this feature in the first place is that we wanted to add a new live feed notification for our Live Website Tracking feature that had nothing to do with email events. Rather than adding yet another endpoint call and update, it seemed to make sense to pull this into a unified API. In addition to this feature, we have a goal of allowing our partners (who play a very important role in bringing our customers’ sales tools to the SalesLoft platform) to create live feed entries.
In one of our past posts, I discussed the new SalesLoft API methodology and why it’s beneficial for us and our customers. Since that post, we have created Elixir library bindings that allow us to create API endpoints that follow the same methodology that powers our public API. We have become big Elixir fans here, and so we knew that we wanted to write the new API endpoints in Elixir using this framework. In addition to the RESTful nature of the API, we wanted to build a WebSocket interface to replace the Pusher calls.
Planning for no downtime is critical when doing a refactor like this. A single feature is almost never worth downtime for users, and we felt this feature was no exception. In planning for no downtime, it is vital that the new and old code paths coexist in harmony. To verify this, we developed QA test plans that included toggling the new feature on and off and testing that both the new and old implementations worked. We use feature flags at SalesLoft to roll out new features without any interruption to our users.
Building the RESTful API portion of the new notifications service was pretty uneventful. It’s a standard RESTful endpoint that accepts live feed items that fit a certain format. We made a decision early on that the endpoint would be given IDs of associated records and load them via our public API (the same one you can use!). This solves the problem of having stale data not owned by the microservice be served to our users.
We built the WebSocket layer using Phoenix Channels and had great success. We ran into a few bugs along the way, of course, but were able to keep from affecting paying users because we only had it turned on for our internal teams; feature flags are critical for doing a slow rollout like this. Another important part of the WebSocket layer was to know who was connected so that we wouldn’t enrich our items unnecessarily. We achieved this by using the Phoenix.Tracker module, which allowed us to track who is connected to our WebSocket layer and make corresponding decisions.
On the front end, we connect directly to the new microservice to power our live feed. This meant that the three requests happening on every refresh were no longer happening! It lowered the total number of requests to our monolith, which greatly increased its stability. While we do enrich the items from the API, which hits the monolith, our new API endpoints are much more optimized — so this was considered a big win.
Once the microservice was stood up and accepting traffic, we had to find the seam for how our monolith would send data over to it. The monolith still owns our email event data, so it’s important that it can communicate with the microservice API. There are a few seams that we could have used:
- Diverge at the highest point in the system to have two independent flows
- Diverge before Pusher is invoked to share a majority of the flow
- Diverge after Pusher is invoked to have a sequential flow
The particular seam that we picked was very important because we could have introduced bugs if we placed the new code too high in the stack. We chose to trigger the new API call (via a background job) after the Pusher event was dispatched to our front end. This means that the user would only see the new events, once turned on for them, if the old Pusher event would have fired. The old and new systems were able to live in harmony with each other due to this, with no chance that the new code would impact the old.
The timeline was honestly a bit difficult for me. While the initial work was something that I was able to prioritize during our engineering innovation days, some new work came in that was much more important. In total, we started development in January and finished roll out by the summer, about six months. I do feel like there were about three weeks of work done, but the work was spread out over a longer period.
One advantage of this timeline, however, is that we were very confident in our notification system by the time we launched it. We had let it run in production, consuming production levels of traffic for over two months without an issue. We also shared this knowledge back to the community, so that others could learn about the techniques we used to ensure that our WebSocket system was rock-solid.
And now for the best part: the tl;dr of the post. These are some of the lessons that I learned from this process:
- Migrating from monolith to microservice is not easy. In fact, it involved a lot of nuance in the coding and planning that had to be a forethought and not an afterthought.
- Stability for customers is the most important part of a refactor like this; our product exists to serve our customers and not to be a microservice-powered entity.
- Finding the seam in the monolith is important when it comes to stability. If your new code, which shouldn’t be running until it’s ready, causes old requests to fail…that is a problem.
- The timeline was significantly longer than I anticipated. This is important because we have to prioritize refactoring old code against developing new features and fixing bugs. On the flip side…
- The slower timeline helped us ensure rock solid stability. We had zero stability concern on release day, which turned out to be like any other normal day.
Steve is a Staff Software Architect at SalesLoft and thoroughly enjoys bringing a more authentic sales experience to SalesLoft customers. He will be speaking at Lonestar Elixir on “Bringing Elixir to Production.”
If this type of problem sounds interesting to you, check out the SalesLoft careers page.