Decoupling Microservices: Things We Have Learned Along the Way

Published in

Xendit Engineering

8 min readAug 9, 2021

In this post, I’ll share how we decoupled one of the core logic of our invoice service into a separate service and the lessons we learned along the way, so you don’t need to face those challenges in your journey.

Intro

As the principle of microservices, we want every service to be highly maintainable testable, loosely coupled with each other, can be independently deployable, and organized around business capabilities. That way, it will enable faster development and reduce the dependencies across teams.

Our invoice service dealt with the creation and payment of hosted, recurring, and on-demand invoices. It also had the logic to handle merchant-based settings such as notification preferences, invoice customization, invoice duration, callback settings, etc. So we wanted to decouple the setting’s logic into a new microservice called invoice settings service to achieve the benefits mentioned above.

The reasons why we wanted to decouple this logic into separate services are:

No clear separation of concern

The existing invoice service also handled the invoice settings that rarely changed. They were general settings for merchants; thus, it was clear that they should be in a different service because the invoice service should only focus on invoice creation and payment and not any more than that.

Mixed approach of the repository

The second reason was more about the approach taken into the existing service. The repository didn’t have a defined strategy when it came to software design. The folder structure and the organization of the files were quite messy and scattered all over the place. So we thought it would be the right time to use the opportunity to create a proper repository with a mature design. In this case, we chose to use a domain-driven approach.

Low visibility of functions

The negative effect if we don’t have visibility of our code’s usefulness.

We wanted to avoid the above image in this effort. But the problem was we didn’t know which endpoints / functions were being actively used, either by our service or other services (downstream). So we had to audit and deprecate the unused ones.

This effort aimed to benefit both developers (in terms of productivity) and the business (customers).

With that in mind, here’s the list that we’re trying to accomplish by doing this decoupling.

Fewer issues & better reliability

The important goals we aimed to achieve were fewer bugs reported and higher reliability. That would translate to fewer issues faced, a higher uptime, and far fewer incidents caused by the logic of invoice settings. I’ll talk more about the other team’s dependency on this logic in the upcoming section.

Better documentation using OAS / increase visibility

Better API documentation leads to less time explaining the code to those who use it. We use Open API Specification defined in a YAML file to generate a client library for other services to use. That way, they can read the documentation on a platform like stoplight and move forward. It also increases the visibility of the endpoints used internally and externally.

Easier to maintain & increase the developer velocity

We used a domain-driven approach that’s already battle-tested in various conditions. We’ve seen other teams in our company successfully use it, so we wanted to adopt it too. We also followed our new infrastructure to provide a better and faster deployment pipeline so we can focus more on the application logic itself.

The preparation

Gathering the requirements

First, we gathered the details of the new service — the database, the runtime dependencies, the connection, the environment, etc. We decided to do a full regression test related to that logic that will be moved to the new service to validate the collected details.

At the same time, in the existing service, we took the approach to isolate the logic by putting the files related to that into one directory so they’ll be much easier and clearer to understand.

Creating centralized document

Since many services across teams depend on this logic, we created a centralized document with a detailed project plan. The plan included a high-level flow of the invoice settings service, service architecture, and dependencies. The intention was to make the decoupling as smooth as possible for all.

Set up infrastructure

Also, we had to coordinate closely with the infrastructure team to provision required instances, network connectivity test, deployment strategy, etc.

The process

Development

Development of the new service was quite tricky since the existing service was still in active development mode. Developers frequently touched the service even though it didn’t directly impact the invoice settings logic. So we moved the files related to that logic to a dedicated folder to isolate them.

In parallel, we also started replicating all the features into the new service. The changes we planned to make were:

no interface / signature changes (request/response body stays the same) to avoid breaking changes
if there’s an endpoint change, it’s only for improvement and also the only change is on the URL, i.e previously we used to have PATCH invoice-settings/:id/update and since it’s quite verbose, so we decided to change to PATCH invoice-settings/:id for simplicity
add improvements to the current logic to re-queue the message in our queue runner if one of our upstream services was down
switch to TypeScript from Javascript to implement stricter type-checking to minimize runtime error

Migration

The most challenging thing in the decoupling was the migration. This new service already had many downstream services (meaning many services already called this service in the previous state). Also, those downstream services were spread across teams with various implementations as well as the development process.

Hence, we divided the migration into 3 phases as attached in the image below:

Phase 1: Proxy request from current service to the new service

In this phase, we forwarded the requests that came to the old service to the new service. We took this approach to have complete control over the risk (if something went wrong on the new service, we could swap the environment on the old service to revert the changes immediately thus we don’t need to make changes on the downstream services). We did this gradually, one module at a time, and also did a regression test after each gradual change.

Phase 2: Changes on downstream service to use new service

This phase was the most time-consuming because we had over 11 downstream services across 6 different teams. We had to coordinate closely in terms of PR review, testing, and deployment. The other challenge was that their implementations were different (no standardization). So in this effort, we standardized the implementation for those services across languages to have better maintainability.

Phase 3: Clean up unused logic on old service

The last phase was to make the old service to be the downstream service of the new service, clean up unused logic, and remove the proxy request. The assumption was that all downstream services already hit the new service, and there was no disruption in the process.

We also decided to use shared DB for both existing and new service to minimize the risk of migrating data. Also, the new invoice settings service was only responsible for a relatively small scope, so we wanted to keep the infra cost low while maintaining clear data ownership between both services. As such, each service only got access to the collection/table it owns. In the future, we plan to completely separate the DBs between those services to make them truly independent of each other.

Lessons learned

This decoupling was far from easy. It was pretty overwhelming as this was our first effort for such an initiative. We learned a few lessons that are worth sharing.

Know your dependencies & maintain backward compatibility

When we were in the phase 1 migration (proxy request from the old service to the new service), one of our downstream dependencies faced an unexpected 400 HTTP Status (Bad Request). The error was due to our accidental changes on the new service, which implemented stricter request body checking on our PATCH endpoint. The downstream service used the GET endpoint to fetch all data for each record and perform a PATCH with the whole body, which was allowed in the old service. It was an unintended change that we missed during the development and testing. We should’ve known that one of our dependencies was using the PATCH endpoint with such details; thus, we must have kept it that way.

Plan until the very details

The second lesson was to plan until the exact details when making changes on the downstream services. When we changed the implementation of one of the dependencies, we realized that we could only test it after the deployment. We unnecessarily added a function to wrap the method to call the new service because it was done that way in the older implementation and caused an issue. It turned out the new service client didn’t have such an additional wrapper function (because we also had a different method when generating service clients in the new service).

Therefore, the lesson here was that we had to be entirely sure about the changes we’re about to make. Also, a review from our peers and the downstream service team would have been very helpful in preventing the issue as they had more context.

Buffer time for unexpected things

The last lesson was to improve team coordination. Cross-team collaboration played a massive part in this effort. We didn’t think a lot during preparation and planned resources based on the high-level project details. It resulted in the project getting delayed days beyond the initial timeline.

To prevent this from happening again in the future, we should always budget time for unexpected things to a certain percentage of our capacity. That way, we can deliver on target or what we’d like to call

Under promise, over deliver.

Conclusion

Decoupling microservices is complex. We should first think about why we need to do it in the first place. What could be the benefits of it instead of doing other things? If we already figured that out, we still need some time to prepare, coordinate, and eventually get the job done without sacrificing user experience.