Conquering the Microservices Dependency Hell at Postman, with Postman — (Part 3)
In this series (Part 1, Part 2), I have so far talked about the challenges we faced with building a microservices stack for Postman and how we put processes in place to overcome those. In this third and final part of the series, I will share how we evolved our practices to a growth framework for developers and maturity model for APIs that act as guidelines to how we work now as a team.
Adopting principles for growth and maturity
Building towards a maturity model and adopting a growth framework needed us to build and practice some key principles of microservices architecture.
Postman largely runs on Postman.
Pursuit of automation
Automation is one of the cornerstones of engineering at Postman. Postman itself was built to automate the tasks around API development. As we take it further, we have kept a constant emphasis on building an engineering culture centered around automation. We encourage new hires to build a collection and demo it in our demo days. A large part of our security operations and devops use collections, often mounted on Monitors. Postman largely runs on Postman.
Manual provisioning and manual deployments work at a small scale. If you are starting up or have a small team who manage a reasonably small number of software applications, you will not face much problem. But this does not scale. Especially, when you are serving over 5 million developers globally, spanned across more than 100K organizations.
Growing from there requires some investment and planning. It can be difficult to predict the places where a new organization needs to automate. These points became clearer as we grew. Having all of the background, that I have shared in the last two articles in this series, we worked towards building tooling to will help us practice our automation goals. We divided the tooling into three groups:
- Infrastructure automation
- Testing automation
- Continuous delivery
Microservice as an architectural pattern highlights autonomy. Autonomy in this context stands for two major ideas:
- Freedom: Giving people as much flexibility as possible to do the job at hand.
- Self-service: Empowering teams and individuals to do their job on their own without waiting for someone else to do it for them.
Consider this, if I have to provision a machine to deploy my service, do I need to raise a ticket and wait for someone else to do it for me, or do I have the right access and tools to do it myself? Having the right infrastructure in place ensures teams can operate without friction and manage the lifecycle of a service on their own.
Building a low friction infrastructure requires keeping the plumbing simple, but the endpoints smart. Services require a messaging platform to orchestrate and exchange data. Such a messaging platform should focus on plumbing data and not start taking smart decisions. All the smarts should be present in the services. Not outside them.
In this way, developers interact with other services in the infrastructure through their APIs. Each service taking its own decisions on how to do their jobs best. The transport/plumbing layers simply focus on making sure the connectivity exists as required.
With more services come more complexity, and with more complexity come more chances of failures. As the list of services grow, so do the number of machines and the number of network connections. The surface area for failures grow significantly. Machines can fail or network can fail. If a system is not designed with these contingencies in mind, it would invariably lead to more failure as the infrastructure grows.
Isolating failures across microservices becomes significantly important. To catch these failures in time, we decided to set up a distributed tracing system (DTS). The key points we focused on for this were:
- There should be debug calls through the entire stack trace. So, developers will be able to figure out where in the chain of services did an error occur.
- IDs should flow downstream to all microservices. This helps keep track of resources and entities without adding ambiguity as a resource flows downstream and eventually out to the world.
- Keep all databases separate. No database should be shared across services. The data persistence boundaries should not spill across services. Instead, all access to data should be exposed through APIs.
These points are non-trivial to add into an architecture later on. They have to be planned and implemented upfront. Well-documented architectures with a live view into of the state of the system can be game changers.
How Postman functions today
With all the context set so far, let us look at how we work together. I have previously mentioned that we created high level teams to focus on different functions of the Postman ecosystems. Here is how they look in the end of 2018.
Under the new work-in-progress organization structure, we decided to bucket responsibilities for implementing the Principles of Microservices Design under the Services and Platforms teams as this:
This led us to set the service expectations from these two broad teams.
Service expectations from Platform
The Platform team is responsible for recommending and enabling systems on which the rest of Postman is built and run. The systems that Platform builds need to allow others to deploy application code, across redundant instances. These need to be rolling deployments with support for application restarts. There can be multiple version of an API live at a given time. Platform achieves these goals today with a mix of automated and semi-automated processes.
Platform then needs to monitor the uptime, response time, performance metrics and error rates of the services that run on the infrastructure they have built. In case of any anomaly, the systems built by Platform alert the service owners — this can be due to a downtime, slow response times and high error rates beyond acceptable thresholds.
All of these actions are logged. Any event is logged and sent to a common logging infrastructure. Any error is sent to a common error-reporting infrastructure. Analytics events are sent to a common analytics infrastructure. These logs provide live and historical views of service health and maturity and are priceless when it comes to debugging issues and identifying bottlenecks that can be improved.
Platform harnesses these information and publishes build status and live health status of the whole system. As an additional dimension, we get code quality metrics of each service through lints and unit tests.
Beyond these, Platform handles provisioning and manages AWS and other resources in a self-serve way. Platform is also responsible for monitoring resource metrics and know how to react to any abnormalities. They publish cost metrics that result from our usage of AWS and other resources.
Service expectations from consumers
All consumers in the Postman infrastructure are expected to conform to a few set guidelines to ensure standard practices:
- All services should conform to a cross-service tracing standards while making API calls to other services.
- They all make themselves discoverable through a common mechanism to other services.
- They build contract tests for the services they are consuming.
Growth framework — Leveling up
In Postman’s culture, “Leveling Up” is the most important thing that everyone has to do. We use a modified version of Medium’s Engineering Growth Framework to measure this.
The Products-Services-Platforms organization design allowed us to define the Growth Framework for everyone within Engineering. Everyone within the team has a clearly communicated ownership, accountability and a growth path. This helped us identify the roles and skillsets we will need for the members of these teams.
We created two roles within Services team: Service Owner and Service Contributor.
The responsibilities of a Service Owner include:
- Confirm and file JIRA issues for tasks
- Code reviews
- Architectural diagram for the service (including dependencies)
- Manage Mocks, Docs, Monitors, Tests using Postman
- Pushing patch fixes
- Deploy and monitor versions, retire older versions
- Administrate the database
- Interface with Engineering Managers+ Product Manager + Platform
- On-board, mentor and off-board developers within the service
- Maintain SLAs for the service
The list further includes:
- Interface with Support for user tickets
- Interface with the Quality team
- Interface with Security team
- Interface with Platform team
- Metrics: bugs, regressions, uptime, unit test coverage.
The responsibilities of a Service Contributor include:
- Pick up/be allocated JIRA tasks as defined in the Product Specs
- Peer code reviews
- Write code, build mocks and monitors, prepare docs, write unit and integration tests
- Address support tickets and bugs reported by Support and Quality teams.
Service Maturity Model
These aforementioned helped us in establishing a measurable evolution pattern for the services. In the context of our microservices, we define an SLA to be a commitment between the team providing the (micro)service to the products and services teams consuming that service. Internally, we quantify the SLAs as a maturity model for those services.
There are six pillars of the maturity model:
- Quality — derives from the end-user experience and measures the quality impact of the service towards it
- Performance — an index that measures the actual performance of the service against satisfactory and tolerable user experience
- Availability — uptime of the service globally and with respect to its dependent services and products
- Resilience — how resilient is the service to the (non) availability of each of its dependencies
- Security — how secure is the service
- Agility — a yardstick for the service owners, evaluating on planning, completion rates and eventual impact
Back to the present, I see all of our efforts since 2014 bear fruits as how much we have evolved as an organization. Today, all of our teams operate autonomously with their own, independent roadmaps. We have managed to make decision making distributed. Everyone gets as much flexibility as they need for their work, but everyone is accountable for their actions and deliverables.
The original 10% of the team, including me, is now free to focus on newer, more impactful ideas. We have streamlined our hiring process and onboarding for the engineering team. Students joining us straight out of college ship their code to production in 2 months!
We have clearly defined SLA metrics between all our three main functions — Products, Services and Platforms. Platforms have grown to have a clear abstraction and plan for longer term stability, scalability and security.
The best outcome of all, with all these efforts, we now ship quality software faster!