Microservice Software Architecture at Airtime

Published in

Airtime Platform

9 min readJul 7, 2016

by Nathan Peck

Microservice architecture is a well known development approach at this point, but for every success story in which microservices have helped a company make a fantastic backend there is another failure story where things didn’t work out. The purpose of this article is to share how the Airtime engineering team has built an efficient, scalable, and extensible microservice backend.

From the beginning we made a conscious choice to leverage appropriate AWS services to solve most scaling issues so we could focus on implementing product features. One choice was to leverage AWS Elastic Container Service (ECS) to remove much of the overhead of deploying microservices (as covered in a previous article on our use of ECS). Once deployments were figured out we had a powerful platform on which to build our microservices, and the next task was figuring out where to make service boundaries and what different services should do.

One API, Multiple HTTP Services

Airtime has a basic REST style API in which URL paths map to various backend resource types, and the HTTP verb is used to determine what action should be performed on that resource. For example:

POST /api/v1/rooms/:roomId/messages - Post a message in a room
GET /api/v1/rooms/:roomId/messages - Fetch a list of messages in a room
GET /api/v1/rooms/:roomId/messages/:messageId - Fetch a message
PUT /api/v1/rooms/:roomId/messages/:messageId - Edit a message
DELETE /api/v1/rooms/:roomId/messages/:messageId - Remove a message

One of the major advantages of REST style url paths is the ability to easily reroute requests based on URL patterns. For example, we can route any request for a path starting with /api/v1/rooms to a service which handles room details, while routing any request for a path starting with /api/v1/users to a service which handles users.

On the AWS platform there is a perfect tool for accomplishing this: AWS Cloudfront. Cloudfront is a CDN layer provided by AWS, which recently gained the ability to handle requests with the PUT, POST, and DELETE verbs in addition to GET requests. Cloudfront can also route requests to different origin servers based on the URL pattern of the request. Airtime has multiple web services, each with its own AWS Elastic Load Balancer (ELB). We have one Cloudfront distribution for the main API, which sits in front of our microservice loadbalancers and directs each incoming request to the appropriate place.

The global average API network timings dropped considerably when we added AWS Cloudfront

Cloudfront also speeds up the experience for end users because it has edge servers all over the world. When a client makes a request to our API their request goes to a local Cloudfront edge, instead of directly to our backend servers. This means that the initial connection and SSL negotiation (red and green on the graph above) is much faster. And we even see faster delivery of content from our origin servers to the end user via the Cloudfront edges since our data is traveling over internet routes that AWS has optimized (blue on the graph above). In our case, this makes our API respond 50–500ms faster depending on the client’s geographical location.

Finally, Cloudfront allows us to cache frequently requested static resources so that they are served directly from the Cloudfront edges without hitting our backend at all. This makes the API feel faster and lessens the load on our backend.

Producers and Consumers

Airtime has two different types of microservices: producers, and consumers.

Producer services are the starting point for any logic flow through the backend. Most of these are web services that power the REST API. When a REST resource is modified the service produces an event to notify other services about the resource changing. For example, when a user creates an account the user service will create a new user, store it in the database, and publish an event to notify other services about the new user.

Consumer services listen for events from the producer services, and perform asynchronous logic in response to those events. For example, a service might be listening for new user signups: when it sees the new user sign up event, it checks to see if the user has any friends that need to be notified about them signing up for our app. Consumers run asynchronously from producers, meaning producers can stay fast and lean even if there is very heavy business logic running behind the scenes.

Airtime uses AWS Simple Notification Service (SNS) and AWS Simple Queue Service (SQS) to power the pub/sub event backbone that connects producer services to consumer services. Each event type published by a producer service has its own SNS topic, which the producer publishes to when the event occurs. Each consumer service has its own SQS queue, and the consumer polls the queue for events. AWS makes it easy to subscribe an SQS queue to one or more SNS topics. The events published on a topic feed into each subscribed queue and fill it up until the consumer pulls the events out to work on them.

Using SNS and SQS in this manner is very fast, very cheap, and very scalable. At the time of writing this article AWS does not charge for SNS requests that deliver data to SQS queues, and polling the SQS queue costs a mere 50 cents per 1 million requests (after the first 1 million free requests). Since consumer services can receive up to 10 items from the queue per SQS request, our consumer services can process up to 50 million events for only 50 cents. Latency between publishing to SNS and a consumer service receiving the event is extremely low, and whether we are publishing 100 events per minute or 1000 events per second the performance stays constant as AWS scales things on their end, with no intervention from us.

In terms of scaling on our application side SQS makes things very easy here, too. If one of our producers receives a large traffic spike the queue can absorb the events and store them until a consumer is able to pull them out and process them. If there aren’t enough consumers to keep up with the volume of incoming events then the queue starts to back up. To handle this, we have an AWS Cloudwatch alarm based on the queue length which triggers a scale up on the number of consumer services until they can keep up with the queue throughput.

We autoscale our consumer microservices in AWS ECS if the queue for that service gets backed up.

A Practical Example

To see how this all works together lets walk through a basic product feature:

A user should be able to invite another user to join an Airtime room.
The invite should be tracked in Redshift for analytics purposes, and the invitee should receive a push notification notifying them about the invite.
The invitee should be able to open the app and see the invite appearing in an activity feed from which they can accept the invite.

The following diagram shows how the Airtime backend implements this:

In this flow, the first step is that the inviter makes an HTTP request to our REST API to invite another user to a room. In our architecture this means they establish a connection to a geographically close AWS Cloudfront edge, which routes the request to the appropriate backend service based on the url path. In this case, the request goes to a core REST service named “Deathstar”.

Once the request is received, the Deathstar service a) creates a new entry in the room membership database table used for persistence, b) marks the invite recipient as invited to the room, c) publishes an event onto the “room_invite” SNS Topic, and d) returns a 200 OK response to the inviter. This initial request to the producer service is completed very quickly (10–20ms on average), and is not blocked by any of the complex business logic that will soon be triggered by this request.

Meanwhile, SNS has delivered the room invite event to three SQS queues that are subscribed to that topic. Each queue has its own consumer service that pulls the event from the queue, and they can begin running their respective business logic in parallel. Stormtrooper is the push notification engine, which generates push notification copy and sends the push notification to the recipient user using an AWS SNS mobile application endpoint. Darthvader is the Airtime analytics service, which uses Segment to insert a new event into an AWS Redshift cluster that is used for business intelligence. Rancor is a feed service, which consumes raw events, turns them into time series data feeds, and stores them for later retrieval.

The invited user receives the push notification on the mobile device and opens the app. Their client makes a request to the REST API via Cloudfront, which serves this request using Deathstar. Deathstar grabs the user’s activity feed from Rancor and fills in the room data, then returns it via Cloudfront to the client. The client is then able to display the activity feed showing an invite item in the feed.

This practical example is just one of many product features in Airtime. The Airtime backend has several dozen core event topics, and over a dozen microservices which run various business logic in response to events from those topics.

Network graph of the core microservices that power most of the Airtime API

Power from Simplicity

This architectural design pattern allows for simple code, and simultaneously makes it easy to add powerful new features with minimal engineering complexity. The basic principle is to keep the events fairly simple and generic but detailed enough that future services can be added without changing the event.

For instance, the example product feature used both the Rancor feed service, and the Darthvader analytics service, which did not exist in the initial prototype of the Airtime backend. We started with only the Deathstar HTTP service and Stormtrooper push notification service. Once the basics had been built we were able to go back and add business analytics and a feed very easily by adding two new services and subscribing their queues to the events that were already being published for Stormtrooper to use.

This architectural approach also minimizes the risk of adding new features. Engineers become empowered to develop complex new backend features, confident that their new consumer service can’t break existing producer services. Each significant new feature is contained in its own service. This feature separation, and the 1:1 relationship between significant functionality (like analytics, or push notifications) to services, helps keeps developer velocity high, and makes it easier to onboard a new developer and get them up to speed on the code.

Future Proof

The Airtime backend architecture, in addition to being very efficient and extendable, is also well positioned to use new tools. For example, we could choose to connect one of our SNS event topics to a Lambda function instead of to an SQS queue and create a “serverless” feature that is autoscaled magically by AWS:

As AWS Lambda becomes more powerful, and tooling around it becomes more mature it may become feasible to replace some of the existing SQS polling consumer microservices with Lambda functions.

Conclusion

Airtime’s architecture is by no means the only way to create an efficient backend, but it has worked well for us as a way for a very small team of engineers to create a nontrivial backend with many powerful features, while avoiding technical debt that would slow us down or hurt our backend’s performance.

If you have a passion for creating fast, scalable backends and think that this architecture sounds like something you’d like to work with check out our careers page. We might have a place for you on the Airtime team. :)

Originally published at https://techblog.airtime.com on July 7, 2016.