Sharing AWS SNS Topics Across Stages


Recently at FloSports we made a significant architectural change regarding how we deal with SNS Topics. We had certain instances when we would need to deploy to a backup stack, we will call this env02, so we could point our API to that stack while deploying the primary stack, which we will call env01. This strategy helps to minimize downtime, but presents a challenge with resources such as SNS topics. This is because we have a unique SNS topic set up in each stack for lambdas in that environment to respond to. However, the front end app only sends SNS messages to env01 topic names. This means any messages sent while the API was pointed to env02 would not get processed. To solve this issue, we wanted to share the SNS topics across stages.

This solution of sharing topics across stages required the following items to work properly:

  • Managing our topics directly rather than the Serverless framework
  • Adding a DeletionPolicy to our topics
  • Knowing if a topic already exists or not
  • Working with a “runtime stage”
  • Refinement of our “Stack-swap” process

The first step in this process was adding our SNS topics to the resources section of our serverless config. By creating the resources ourselves, rather than having Serverless automatically generate them, we are able to manage them exactly how we need. This is important because we needed to add a DeletionPolicy to each topic and set it to Retain. By adding the DeletionPolicy, the topic resource will remain in AWS, even when the containing stack is deleted. This is great because it allows us to point lambdas to an existing topic, so when we deploy to env02, we can point those functions to the already existing topic rather than deleting and recreating it! We quickly ran into an issue here, how do we know whether or not the topic already exists? The Serverless config doesn’t allow for logic to check if a given resource exists, and if you have an item in the resources section of your configuration file that already exists in AWS, a ResourceAlreadyExistsException will be thrown.

To solve the ResourceAlreadyExistsException, I created a small plugin which gets a list of all of our SNS topic names, compares them to those in the CloudFormation template generated by Serverless, and removes any from the template which exist in the obtained list.

So, now we’re checking on the fly, immediately pre-deploy, whether or not a topic already exists in AWS. This required a change to how we were setting up our SNS lambda functions from:

taken from Serverless docs:

To the following style:

taken from Serverless docs:

The second option allows you to subscribe to an existing topic by providing the topic’s ARN. This presented two additional items we needed to solve. The first was anticipated, the second was not.

The first was item was implementing an environment variable we call a “runtime stage”. Since we could have two stacks per environment, env01 and env02, and all topics were being named using env01, the env02 lambdas need to know to respond to the env01 topic’s messages. So, instead of setting the topic name to ${stage_name}-topic-name where the stage name would be either env01 or env02, we set it to be ${runtime_stage}-topic-name where runtime stage would be env01 in both cases. For example, we have stag and stag02 environments. The runtime stage for both of these is stag, meaning the topic names match between the two environments and the frontend does not need to know which is currently active.

The unforeseen issue came from a bug within Serverless itself, issue 4637. SNS subscriptions and permissions do not (as of this writing) get automatically generated when you configure a lambda to respond to an existing SNS topic. Since this is a bug within Serverless, and will eventually be fixed, we decided to write two scripts to solve the problem for us rather than another plugin. These two scripts run through the CloudFormation template and check whether or not the needed subscription and permission exist within the template, and create them as needed.

The final piece was refining what we call our “Stack-swap” process. This allowed us to stand up a stack with lambdas disabled by default (so lambdas responding to SNS, SQS, and scheduled events would not fire twice) and choose which environment the api should point to, env01 or env02. You can read more about that process and how it limits the downtime of our API here.