Better Practices
Published in

Better Practices

How Postman Engineering handles a million concurrent connections

The Server Foundation team at Postman shares the origin story of the Bifrost websocket gateway

Photo by Toni Reed on Unsplash

Development teams at Postman

the Server Foundation team is an example of a functional team at Postman that creates and manages stuff used across the entire engineering organization

The monolithic Sync service

  1. You add a parameter to the Postman collection.
  2. Postman keeps a record of the update in version control stored with your profile.
  3. Postman displays the latest information to viewers of the collection in real time.

Sync under pressure

“Stuff that goes unnoticed in smaller systems becomes inescapable in more complex systems.”
Kunal Nagpal, engineering manager at Postman

  • Cascading failure due to backpressure: Every deployment to Sync results in disconnecting Postman clients connected over websockets. When a million sockets reconnect, server resources are degraded, which can then result in more disconnections, causing a predictable but unavoidable surge that can take 6 to 8 hours to recover.
  • Impacting user experience: Even though it didn’t happen often, dropped connections meant an occasional delay in seeing the latest updates and activity in a Team Workspace.
  • Higher cost of maintenance: Since every squad relied on Sync, virtually every engineer at Postman had to learn how to handle dropped connections, initiate new ones, and then reconcile any conflicts in the data.

“This is the natural evolution of software design. Microservices start nimble, but they build up, and need to be broken back down. We wanted to separate socket handling from Sync because we were about to introduce a lot more functionality.”
-Yashish Dua, software engineer at Postman

some internal services grow too big requiring careful coordination across teams

Here is what happened

Step 1: We got organizational buy-in

Step 2: We identified the unknown unknowns

Step 3: We built the Bifrost websocket gateway

  • Public gateway: The gateway uses the Fastify web framework and Amazon AWS ElastiCache for Redis as a central message broker to manage all websocket connections.
  • Private API: The API also uses Fastify as a low overhead web framework to proxy traffic to other internal Postman services.
Bifrost is composed of two parts: a public gateway and a private API

Step 4: We tested the new gateway

Step 5: We migrated to the new gateway

“It’s like changing an engine on an aircraft in mid-flight.”
-Numaan Ashraf, director of engineering at Postman

Step 6: We scaled the service

  • Horizontal scaling: Most of the time, Postman services handle increased usage by either scaling to higher capacity instances or by adding more compute instances to the fleet. So engineers at Postman usually scale up a service by increasing the size and computing power of AWS EC2 instances, for example, by using AWS Elastic Beanstalk. But for Bifrost, websocket handling scales out by using more machines. Its optimum efficiency is achieved when smaller-sized instances are used in large numbers. This type of hyper-horizontal scaling works well for Bifrost because clients don’t require high network throughput, and limiting each machine to fewer connections limits the blast radius of failures.
  • New load factor of CPU and memory: Most Postman services can effectively scale with a single dimension of scaling metric, like CPU, memory, or latency. However, for Bifrost, things get a bit more nuanced because both memory and CPU usage have different impacts on operations at various levels of throughput. To account for that, Bifrost uses a custom scaling metric based on load factor. The load factor is a multidimensional calculation that imparts a custom non-linear scaling profile.

The Bifrost architecture and tech stack

“Bifrost is our gateway for all websocket connections. It’s a proxy for all Postman clients, and responsible for handling low-level socket operations for internal Postman services.”
-Mudit Mehta, software engineer at Postman

the Bifrost gateway used AWS, Redis, and Fastify to handle websockets

“Simple components. Complex logic.”
-Kunal Nagpal, engineering manager at Postman

The journey is far from over

  • Build additional redundancy: The Redis cache is a central message broker. Websocket handling still relies on a single point of failure, so what happens if the cache ever goes down?
  • Increase bandwidth and throughput: The gateway is currently capable of handling 10x concurrency, but the Postman community is growing fast and engineering is building out more collaboration features. The need to handle more websocket traffic is coming up quickly.
  • Continue breaking down the monolith: The Sync service contains a jumble of other services entwined within its codebase. Decoupling socket handling from Sync loosens its grip on other services, so other services can now be more easily peeled off.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store