Notification sending pipeline at Flo

Published in

Flo Health UK

8 min readAug 2, 2022

In this article, I’ll tell you how we built a service to send tens of millions of emails and push notifications to Flo app users daily.

This story is about how we made decisions years ago that transformed into architecture, how we evolved, and where it led us.

Prologue

At Flo, we began considering email and push notifications as channels to nurture value, increase engagement, and retain churned users in early 2019. At the time, we had just moved all our services to AWS, including our backend for frontend server and database, a couple of fancy new microservices, and their operational storages, primarily designed for serving content via REST API. We were also in the process of building our own analytical pipelines and data warehouse.

We’d already had some experience launching a number of subsystems for recommending and delivering articles, chatbots, and internal social network digests. The domain of content was an important and elaborate part of our system. We had abstractions like different content items associated with predicates (e.g., custom domain-specific language expressions defining who the content is for).

Example 1: Content item predicate

Where country, days_since_install, platform, and new_feature_ab_test are attributes of a user, we are able to target content based on the following:

country and platform are regular attributes from the user’s device.
days_since_install is a dynamically evaluated attribute based on the date of installation.
new_feature_ab_test indicates the user’s participation in the product experiment, often AB testing of a new feature.

Together, all the user attributes make up a user profile, one of the core abstractions in the system managed by a service of the same name. We’ll refer to it later.

We moved into sending push notifications with great ambition and vague requirements, aiming to build something promptly, experiment, and evolve the solution later.

Research

We discovered three major use cases:

When the user performs an action, the system sends an instant (transactional) notification.
Examples: welcome email, subscription win-back push notification
When the user has done something in the past, and a certain amount of time has passed since then, the system sends a reengagement notification.
Examples: new daily or weekly content reminder
When the user does nothing, but the marketer launches a new campaign, the system notifies all appropriate users.
Example: push notification to support new feature launch, churned user reactivation email, Black Friday sales campaign

Design

To support these use cases, we decided to extend our system with two new components: a notification scheduler and notification transport service. As an integration point, we picked the user profile change stream. The user profile service (as mentioned above) is responsible for keeping the attributes of all users in sync.

The notification scheduler was designed to capture user profile updates caused by user interactions with the Flo app and decide which particular notification to send and when. That decision was supposed to be done based on the user profile and current notification setup, where every notification had a specific predicate to target a particular audience.

The notification transport service was designed to translate and deliver notifications to the media-specific format.

*Drawing 2. Notification scheduler and transport service*

We solved the problem of scheduling and deferred notification delivery using the handy time-to-live (TTL) policy built into AWS DynamoDB. All we had to do was save a record similar to the one listed in example 2 and let DynamoDB delete it at the moment specified in delete_at. Record deletions were captured and propagated via the change stream, another built-in feature of DynamoDB. As depicted in drawing 2, this change stream became the second input source for the notification scheduler.

Example 2: Deferred push notification database record

Unfortunately, later we faced an unexpected issue: the accuracy of the AWS DynamoDB TTL mechanism. According to their service-level agreement, TTL typically deletes expired items within 48 hours of expiration. Rooted at the core of our system, it caused problems with the precision of scheduled notifications. We observed that the delay (99 percentile) of an event actuation was often close to 24 hours. This led to notifications reaching a user too late or not at all. At scale, this culminated in a decrease in target audience segment coverage.

To overcome this issue, we decided to evolve our system by introducing an additional component: a notification trigger.

The notification trigger basically does the same thing as the built-in TTL mechanism:

Proactively queries the schedule database to find notifications
Deletes these notifications

Because the notification actuation is implemented as a deletion either by the trigger or the TTL mechanism, the rest of the pipeline remained the same.

After we implemented the notification trigger, the median delay of an event actuation dropped to less than an hour (the moment of the rollout is depicted in chart 2); reach and coverage issues were fixed.

*Chart 2. Actuation time discrepancy before and after notification trigger* rollout

The major differences between the notification trigger and built-in TTL mechanism are actuation precision and costs. DynamoDB TTL utilizes spare cluster capacity when it’s available. That’s unpredictable, but free. The notification trigger relies on a determined cron schedule and consumes dedicated provisioned resources. That’s precise, but comes with additional costs.

Notification delivery timeline

Now let’s take a look at how the designed solution supports major use cases:

Transactional notifications
Reengagement notifications
Mass campaigns

Drawing 4 depicts how the notification scheduler, given a set of notifications in advance, responds to user profile updates and immediately sends the appropriate notification. These are transactional notifications — the user does something and we respond instantly (e.g., send a welcome email).

Another scenario is depicted in drawing 5. When there is no notification to send ASAP, the most recent notification with a matching predicate is picked and scheduled to be delivered in the future. These are reengagement notifications — the user did something in the past, and after a certain amount of time has passed, they are notified.

But what about when there is no appropriate transactional notification and nothing to schedule? What if a marketer releases a new notification targeting inactive users? Here comes the rule:

Every incoming event processing, be it a user profile update or scheduled notification actuated by TTL, should end by scheduling something next.

And here comes a new special type of event: ping. This is what we schedule for a user if there is no real notification in the foreseeable future. It emulates a time tick or a dummy user profile update.

Connecting all these scenarios together, we get an infinite event loop for every user.

The loop has two major settings: scheduling horizon and ping interval.

The scheduling horizon describes how many days ahead we pick and schedule reengagement notifications.
Ping interval controls how often we emit a dummy user profile update for stale users and the pace of newly published notifications.

These two preferences control the number of transactions in the system, the load on external components, and the fees we pay to AWS. They also influence elasticity and product characteristics, including how much time is required to apply new notification configurations to all users, especially churned ones. Setting reasonable parameter values and adjusting them to current needs are the keys to success.

Conclusion

We managed to build a notification delivery pipeline based on the existing transactional database and its change stream. The solution avoids user segment materialization, so there’s no need to launch background workers building segments of users via querying data from analytics storage (which we didn’t have by the time of implementation).

The solution is streaming first; it scales horizontally up to the threshold equal to the provisioned number of stream shards. It’s also feasible to control peak throughput, accumulate delayed events in Kinesis streams with several days of retention, and process them later without losing data or dropping users out of the loop.

Utilizing a lot of AWS-provided services lets us focus on implementing business logic instead of introducing additional email and push notification service connectors. The amount of message transformation logic required to translate it into a delivery media-specific format was reasonably small.

One of the key learnings is related to the practical application of DynamoDB TTL combined with DynamoDB streams. It can be a good solution for low-cost deferred work coordination, although it has limits when it comes to ensuring demanding product requirements at scale. In such cases, you can either extend its capabilities, like we did with the notification trigger, or go further and build your own scheduler.

Product pros & cons

Pro: User centric

Each time we decide which notification to send to a user, we deal with the user and their context and conduct an auction for the user’s attention. It is a good place to implement notification rate limits, apply heuristic rules of prioritization, or experiment with machine learning.

Con: One-by-one loop nature

With the implemented system design, it’s not possible to immediately select all users who match the predicate. No one could publish a new notification, hit the “Send now” button, and trigger the delivery in real time. Instead, we must publish the notification and wait for the system to eventually schedule and deliver the message. It may take days to be sure that all users are reached. That’s an architectural limitation of the system and sometimes a cognitive challenge for marketers.