Eventing infrastructure for microservice-based ecosystems

Pallavi Santosh
VMware 360
Published in
5 min readJun 8, 2020

Workspace ONE Intelligence is an AWS-based cloud service that provides reporting, analytics, data visualization, and workflow orchestration to VMware customers. It receives data from a variety of sources including our UEM(Unified Endpoint Management) platform, mobile SDKs, Windows agents, and Trust Network partners. The Intelligence footprint is comprised of over 30 microservices and approximately 20 lambdas in addition to a number of AWS managed services. The figure below illustrates the basic composition of each microservice: Docker containers, deployed under AWS ECS, and clustered using Elasticache REDIS. While each application is individually clustered, there are often use cases where certain events need to be shared across the services that we deploy. For example, a customer is provisioned or de-provisioned, a customer’s entitlements change or certain features are enabled or disabled by configuration. This article describes how we created a “global event bus” that allows such events to be shared across all of our applications.

We considered three options to facilitate inter-service communication in the Intelligence system -

  1. HTTP calls between services for notification events — This would be the easiest and quickest solution to implement. Notifications would be posted by the producer service to each consumer service through HTTP calls. This works fine when the number of consumers are small (1 or 2) and unlikely to change. However, if this number were to increase, the producer would need to keep track of each service that is dependent on its notifications — not scalable. We could have the consumers poll the producer periodically but the extra load on the producer would increase linearly with the number of consumers we add.
  2. Attempt to use data streams as event queues (something like AWS Kinesis, for example). This solution requires some amount of infrastructure changes, but it was a good option to allow notifications to scale without introducing undue load or complexity on the producer. The producer writes notifications to a data stream and this stream can have multiple consumers that read/process notifications as required. We briefly considered AWS Kinesis for this implementation but decided against it. Kinesis infrastructure is meant to support heavy, sustained stream of data. In addition, it requires DynamoDB to store checkpointing information for the various consumers and has a limit on the number of consumers per stream (~20).
  3. Use an actual notification system.

AWS SNS (Simple Notification Service) is a fully managed messaging service.

A SNS topic is an access point that handles notifications and is identified by an Amazon Resource Name (ARN). Subscribers are entities that are interested in the notifications. They could be applications, end-users, or devices. Publishers are entities that generate notifications and post them to the topic. A topic encapsulates all of the information about who should receive the notification and how that notification should be delivered.

SNS offers four subscription options out-of-the-box for message delivery:

  1. Write to a specified SQS queue
  2. Post to an AWS Lambda
  3. Post to HTTP/HTTPS endpoint
  4. SMS/Email

Both AWS managed delivery options — SQS and Lambdas have robust retry configurations with a total of 100015 retries, over 23 days spanning over 4 phases of retry policies. Retries on Email/SMS are not as persistent — 50 attempts, over 6 hours with 4 phases of retry policies.

The retry policy on HTTP/HTTPS endpoints is configurable with a default of 3 retries (maximum 100) and a default delay of 20 seconds between retries (maximum 3600).

Some other key points to remember here are:

  1. SNS does not guarantee ordering on notifications — it tries to deliver messages in the order they were received but there is no guarantee.
  2. The maximum payload size for a notification is relatively large — 256 kilobytes.
  3. SNS allows configuring a dead-letter queue to hold notifications that could not be delivered to a subscriber.
  4. Message encryption — SNS provides the option to use encrypted topics. Notifications are encrypted using KMS before storage and then decrypted before fan out to each subscriber.

The Event Bus on the Workspace ONE Intelligence platform leverages a single SNS topic with each of the aforementioned 30 microservices exposing HTTP endpoints to receive notifications. Customer creation, customer de-provisioning, license changes, and a variety of other simple, infrequent control messages are communicated to the platform through this single SNS topic.

All services in an environment are allowed to publish to a common SNS topic called event_bus. All services are configured to reach out to SNS and subscribe to the event bus on startup if a subscription does not exist already. This means that there is no manual effort involved in adding a new service to the event bus.

Each service exposes an HTTP endpoint that is unreachable by any entity external to the Virtual Private Cloud (VPC) deployment. SNS exists outside of the VPC, so our API gateway handles the routing of each notification to the right service.

SNS allows notification filtering. On startup, each service specifies what types of notifications it wants to receive. SNS filters out all other notifications ensuring that a service receives and processes only notifications that it requires.

Incoming notifications are parsed and verified for authenticity before processing.

All the required functionality (like the ability to subscribe to the event bus on startup, the notification handler, and the asynchronous publisher) is packaged into a neat library and imported into the base configuration for all services. This library also enumerates all valid event types and defines the structure of each event type. In order for a service to begin receiving a new event, it only needs to implement a handler class. Based on the presence of that handler, the library updates SNS with the correct event filter for that microservice.

Going forward, there are some areas that could be altered based on requirements:

  1. While the retry configuration for HTTP/HTTPS endpoints is configurable and somewhat robust, for highly critical, time-sensitive communications, an SQS queue in conjunction with SNS might be considered for more exhaustive delivery attempts.
  2. Notifications should never contain any plain text information that might be considered sensitive though message encryption is an option.
  3. A single filter policy applied to a subscription can only contain a maximum of 5 notification types, with a total of 200 per account. As such, we need to be conscious of every new notification that is introduced into the system.

This framework has been a much-needed addition to the Intelligence system. A brand new service for managing user-preferences began leveraging these events last week to synchronize state across multiple microservices. We will continue to evolve our eventing framework as we encounter new constraints and/or requirements.

--

--