Conquering the Microservices Dependency Hell at Postman, with Postman (Part 1 — Introduction)
At Postman, we have been scaling our teams to meet the requirements of building and managing services for more than 5 million developers worldwide. Postman has grown rapidly as a product and as an organisation since the company was founded in 2014. Today Postman is a complete API Development Environment with tools for every stage of the API Lifecycle. Our efforts have been focused on helping teams build better software together.
Through a series of articles, I’ll talk about how we chose a microservices architecture when we started developing a backend system to complement our rather popular app. This introduction piece sets the stage by describing the technical and organizational challenges we faced.
Early 2018: Workspaces release
We shipped Workspaces earlier this year with Postman 6.0. Workspaces form the base of collaboration in Postman. They are the logical blocks that hold collections, environments, monitors, mock servers and integrations that can be shared with others. Shipping workspaces was a big achievement from an organisational perspective. We managed to release it in a record time of 3 months. It was a feature that touched all of our products as well as about 20 micro-services that we ran internally at the time. It was a mammoth effort.
It was amazing to see the whole team come together and be that passionate and invested about the product.
During the development of workspaces, we started making heavy use of consumer-driven contracts, which I will talk in detail in the next installment of this series. Along with that, we used Mock Servers to parallelize development which helped our teams to make progress without blocking each other.
It was amazing to see the whole team come together and be that passionate and invested about the product. The Postman community responded warmly about workspaces. We received much constructive feedback on our community forum and GitHub issue queue which helped us to continuously improve the feature.
Though we were all thrilled with the adrenaline rush from delivering workspaces to production, it had its burnouts. The early team, those responsible for driving the product direction, which includes me, spent a good amount of planning and executing every small detail of this feature. We wanted to make sure that we did not leave any part of the workflow undecided. It had to be as perfect as possible. Our engineers and designers worked through the holidays to push for the release. While things looked good on the product front, there was a reasonable amount of frustration from the resulting overwork.
In retrospect, this affected us both technically and organizationally.
There were strong coupling between the services that we built. We had some core services on which other services depended heavily. Any downtime in one of these core services would bring down the entire infrastructure.
Without a proper guideline in place, developers would follow their own styles to define APIs
We had to take extra precaution to ensure these services had as much uptime as possible. This meant 3 AM fixes to the infrastructure and weekend struggles to resume broken services. We could not deploy services independently. All the services with dependencies had to be tested together on beta. If the tests passed, we had to deploy them together to production. Addressing issues often involved applying fixes across more than one service.
The coupling affected roadmaps of these services. Roadmap items had intertwined dependencies across services. This tied the evolution of one service to the progress of the ones it depended upon. We had database schemas that were shared across services, making it difficult to build something in one service without impacting the service it depended upon. Abstractions were not always clearly defined and leaked across service boundaries.
Though we ensured that our internal APIs had similar structures, some inconsistencies still crept in. Without a proper guideline in place, developers would follow their own styles to define APIs, leading to minor but nagging differences in request and response data structures.
One of the main reasons for this was the lack of clearly defined service boundaries. Our dev team would build out new services arbitrarily whenever there was an addition or change to business requirements. On one hand, this led to more new services than what we probably needed. On the other hand, a developer familiar with a specific code repository would add core functionalities directly in an existing service, which otherwise should have been a separate service entirely.
The organisational challenges reflected the technical challenges. Most of our early hires and the founding team, who were roughly 20% of the total team size, would end up doing about 80% of the work.
This was just by virtue of them having the context of how the team worked and how the systems were built. The knowledge base was in their heads. Newer hires would not have full context into how all the systems fit together back then and would focus more on the non-core elements, or would take a long ramp-up time to contribute more effectively.
This led us to have misleading expectations from candidates when it came to hiring. Ignoring the cost, we struggled to find candidates who could join such a team and be capable of pushing microservices to production. We expected them to create new services, manage existing ones and improve the velocity at which those services were deployed. Not realising that anyone coming in may not have such a breadth of experience and depth of understanding about how Postman works, we were not able to add any new member to our backend team in the second half of 2017, even though we were actively hiring.
Development cycles kept getting slower around that time owing to these gaps in requirements, team capability and organisation structure. The reason we managed to pull off workspace release was that it was a concerted, top-down, organisation wide initiative. Having a committed team who were passionate about delivering a high-quality product helped the case. Yet, developers had to put in extra effort in subsequent release cycles to make sure they catch any bugs from previous release cycles. Our teams continued to work in the ad-hoc manner that they had done till then. This slowed down releases and reduced code quality.
This situation may be familiar to you if you have built software together in lean environments. We hear a lot around the challenges startups face in the tech industry when it comes to making architectural decisions about their products. We found that we had to do changes in our internal tech practices as well as rethink our organization structure to mitigate these issues.
Over the next articles in this series, I’ll touch upon the position of “Dependency Hell” in microservices that we found ourselves in at the beginning of the year, and the steps we took to claw our way out it.