Powering real-time magic moments through notifications

Rebuilding a notifications platform for the future

Published in

Whatnot Engineering

9 min readApr 16, 2024

Ronnie Gilkey and Brian Tombul | Growth Engineering

As a livestream shopping platform that fuses real-time shopping experiences and live streaming, Whatnot has been redefining the way users shop online socially. At the heart of this innovation lies a critical component: the notifications system. Whatnot’s real-time live shopping experience demands instantaneous alerts to inform users of livestream show starts, transaction updates, and social interactions. Notifications are a major way our users know when to return to the app to engage with their favorite products and communities.

Whatnot’s notification system had begun to reach its capacity to expand and meet the organization and our user’s needs. It started as a small class hierarchy in our main python codebase then moved to background tasks powered by RabbitMQ. As time passed, the logic and complexity in the operations grew along with the volume of notifications being sent. In this post, we share our journey building the 3rd iteration of a platform we believe will scale for the long term and share our considerations, hurdles, and lessons learned along the way. Even in a short period, the gains in developer velocity and business impact paid off substantially.

Image Credit to Holly Keable via Unsplash

Considerations

Our teams needed the ability to do things beyond just “sending a message to a user”. We needed analytics, customizations, experiments, advanced decision-making, feedback loops, personalization, packaged in a low-friction developer experience.

As we set out on the journey, we talked to different teams to understand their needs and came up with a list of important considerations in User Experience, Stability, Latency, and Throughput.

User Experience

First things first, our app users. We needed to make sure we could provide the right experience and level of engagement for our users from notifications. This meant things like:

What’s the right medium for a notification? Push, email, SMS, or in-app? This is an especially important consideration when planning for transactional or marketing notifications.
Quantity and quality of messages. The number of notifications, the quality of the content, and the relevance of the content to that user rose to the top of our list.
Protection from mishaps. We’re engineers and things will happen. We wanted to make sure there were additional protections to shield users from the unwanted side effects of our mistakes.

Stability, Latency, Throughput

System performance heavily impacts the user experience. As we started iterating over the various impacts and outcomes, there were several that bubbled up to the top:

How long does it take to get a notification through the pipeline? Can that be kept at or near constant time for multiple messages?
How do we want to defend against misbehaving clients — both internal processes and user interactions?
How can we create traffic lanes for messages of varying types/priority?
How can we ensure our failure domains are minimal when one component of the platform is faulty?
The system will need to be prepared to respond gracefully to the sporadic volume of our notification workloads as well as with the organic growth of the business.
What safety nets can we put in place to protect the user?

Many of these are actually low effort to implement, and we were able to incorporate all of them into the platform’s architecture.

Making a platform

We didn’t want to just make another module that did something, we wanted to create a platform.

Communication Experience

Our ultimate goal was to provide the best communication experience for our users. Practically this meant:

The users can receive messages through one or more channels (push, email, SMS, and eventually in-app)
They have control over which messages they receive and from which channels
They receive high priority, time-sensitive messages fast, and all other messages promptly
They don’t receive duplicate, outdated, or incorrect messages
The messages are relevant and useful to them

These goals translate into platform features such as:

Support for all channels
Multiple channel support for each message
Support for notification subscription settings
Support for different message and delivery priorities
Rate limiting and idempotency checks
A/B test experimentation support

Developer Experience

We wanted to improve the developer experience in three areas: creating new messages, sending messages, and end-to-end testing. Our backend teams largely work in Python, which is also the language we utilize in our notification system. To keep the experience as seamless as possible, we planned to drop in the new framework next to the existing one and do a controlled migration.

In the existing system we used inheritance, mixins, and method overrides to define how a message behaves. This made it not only complicated to implement new messages but also hard to understand how a message would behave, and also hard to debug when it didn’t behave as the developers expected. Testing was mostly done by mocking different parts of the system itself and verifying that they were called. We weren’t checking if the message was rendered correctly, or if another part of the system might prevent the message from being sent.

In the new platform, we decided to take a declarative approach. We decided that explicitly defining different aspects of each message in a single class would allow us to bring the developers up to speed quickly, and would make it easy to understand how a message is going to behave. This allowed us to provide a much simpler interface for sending messages, and also allowed us to hide the implementation of certain behaviors inside the platform and change it easily if needed. For testing, we decided to provide fixtures that made it possible to check if a message would be sent and its content.

class NewFollowerTemplateInput(TemplateInputType):
    follower_id: UserID
    followee_id: UserID


class NewFollowerTemplateData(TemplateDataType):
    follower_name: str
    followee_name: str


class NewFollower(MessageType[NewFollowerTemplateInput, NewFollowerTemplateData]):
    push_template = PushTemplate(
        body="Hi {followee_name}! {follower_name} started following you.",
    )

    def load_data(
        self, template_inputs: list[NewFollowerTemplateInput]
    ) -> list[NewFollowerTemplateData]:
        ...
  ...

def follow_user(self, follower_id: UserID, followee_id: UserID) -> None:
    ... # User following logic

    NotificationsPlatform().send_messages(
        message_type=NewFollower,
        recipients_user_ids=[followee_id],
        template_inputs=NewFollowerTemplateInput(
            follower_id=follower_id,
            followee_id=followee_id,
        ),
    )

def test_follow_user(notifications_fixture):
    follower = create_user("Foo")
    followee = create_user("Bar")

    follow_user(follower.id, followee.id)

    ... # Verify user following logic

    assert notifications_fixture.push_notification_sent(
        body="Hi Bar! Foo started following you.",
        recipient_id: followee.id,
    )

Observability & Analytics

Observability and analytics were a top goal for us. We had a lot of pains in the existing system and we didn’t want to repeat history. We set our sights on the following objectives:

Consistency!
– We needed a centralized methodology, naming conventions, and strategy to follow, to prevent ad-hoc growth that deviated too far from the norms
Operationally
– Rates, errors, and durations (RED metrics) in all the places
– Measuring various outcomes for activities
– Logging, but only when it was required to get the information we couldn’t from metrics
Analytically
– We needed to be producers of analytics events. We’d been consuming some inference data for decision-making, but understanding and making decisions about what was happening inside the system and correlating it to user engagement was a gap prior.
– We needed a strategy for marrying frontend and backend log events together

Mashing the keyboard

Plumbing

The new platform consists of three major parts: the API, the pipeline, and dispatchers, with three major data types used in the platform: Message types, messages, and channel messages.

Data types

We wanted to take a declarative approach to implementing messages in the new platform. Message types are where this happens. Each message type consists of one or more channel templates, channel-specific parameters, and behavior configurations such as delivery priority, TTL, rate limiting, and a method to load the data necessary to render the templates.

class MessageType(Generic[_TTemplateInputType, _TTemplateDataType]):
    push_template: PushTemplate | None = None
    email_template: EmailTemplate | None = None
    sms_template: SMSTemplate | None = None

    ttl: timedelta | None = None
    rate_limit_config: RateLimitConfig | None = None

    def load_data(
        self, template_inputs: list[_TTemplateInputType]
    ) -> list[_TTemplateDataType | None]:
        ...

Messages are created based on the message types for each recipient, and they contain everything that’s needed to render a message for each recipient: the recipient details, template input, message variant name if there is an experiment, etc.

class Message:
    message_type: MessageType
    recipient: MessageRecipient
    selected_channels: list[MessageChannel]
    template_input: TemplateInputType
    experiment: str | None
    selected_variant_name: str
    …

Channel messages are rendered messages for each recipient for each channel. They contain all the information needed to call the provider APIs to send messages.

class PushMessage(ChannelMessage):
    title: str | None
    body: str
    link: str | None
    image: str | None
    badge: int | None

class EmailMessage(ChannelMessage):
    subject: str
    body: str

class SMSMessage(ChannelMessage):
    body: str

API

This is what the users of the platform interact with. It provides a single type-safe method that takes in the message type, the list of recipients, and the template inputs for each of those recipients. It validates the input, splits the recipients into batches, and kicks off separate pipeline Celery tasks for each batch.

class NotificationsPlatform:
    def send_messages(
        self,
        message_type: MessageType,
        recipient_user_ids: list[UserID],
        template_inputs: list[TemplateInputType],
    ) -> SendMessagesResponse:
        ...

Pipeline

This is where most of the processing happens. It is implemented as layers that have different responsibilities. Each layer takes a batch of messages and modifies them based on its responsibility. The platform provides certain layers that are executed by default:

ExpirationFilter: Checks if the message expired based on the TTL defined in the message type
RecipientLoader: Loads the recipient details, i.e. email addresses, phone numbers, push notification tokens
ChannelSelector: Selects channels for each message based on the recipient detail availability, e.g. select only email if there is only an email address
NotificationSettingsFilter: Filters out the recipients if they unsubscribed from the message type
ExperimentLayer: Selects template variants based on the experiment configuration defined in the message type
TemplateDataLoader: Loads the data necessary to render each message
Renderer: Renders the templates, this is where the channel messages are created
RateLimiter: Rate limits the messages based on the rate limit configuration defined in the message type

Data flows as follows:

It is also possible to implement custom layers if warranted. For example, certain message types implement custom RecipientFilters that implement machine learning models to determine whether a recipient should receive a given message.

Once the message goes through all the layers, the rendered channel messages are passed to the dispatcher tasks.

Dispatchers

The dispatchers are responsible for converting the channel messages to the formats required by providers, such as Firebase Cloud Messaging, Sendgrid, and Twilio, and calling their APIs. They are also responsible for processing the response and handling retries if needed.

We use separate Celery queues for each message and delivery priority. Not all messages are equal, for example, a push notification for an auction that has just started should be delivered as soon as possible, while a large email campaign can take longer and shouldn’t prevent other higher-priority messages. This also isolates the providers from each other. If an SMS provider is having issues, that doesn’t impact email or push notification deliveries.

class EmailDispatcher(Dispatcher):
    ...
    def send(
        self, email_message: EmailMessage, recipients: list[MessageRecipient],
    ) -> list[DispatchResult]:
        results = []
        for recipient in recipients:
            message = self.build_message(email_message, recipient)
            response = self.email_client.send(message)
            result = self.process_response(response)
            results.append(result)
        self.retry_failures(results)
        return results

Analytics

Throughout the platform’s evolution, a diligent approach to data reporting and analysis was pivotal. Early adoption of RED metrics and logs enabled proactive performance adjustments, while subsequent focus on comprehensive analytics addressed previous blind spots, providing insights into user behaviors like notification preferences and engagement patterns.

With a refined data schema in place, efforts shifted to thorough instrumentation of core decision points and satellite events, ensuring the availability of crucial data for enhancing machine learning processes. Leveraging this data, the team embarked on multiple fronts, including the development of dashboards for monitoring messaging outcomes and exploring possibilities for integrating data into experimentation systems. Operational enhancements were solidified with the incorporation of Service Level Objectives (SLOs), enabling the observation of key user journeys and ensuring a top-tier customer experience by aligning metrics with operational objectives.

Launching, darkly

We use dynamic configurations to do staged rollouts of our new features. We also added a “dark launch” feature where the messages to go through the platform normally, but get dropped just before calling the provider APIs. Using these two features, we were able to verify that the new platform could handle the load, and we could safely roll out new messages on the new platform.

Outcomes

With the implementation of the new notification platform, we’ve seen several gains within the organization. We ran experiments on notifications and were able to launch several features with substantial improvements to DAU by 1.2% and first time buyers by 7.9%. We increased engineering velocity for teams who migrated their notifications to the new platform or built new ones on it. We reduced unexpected outcomes by providing a testing framework that allowed frictionless verification. System reliability and deliverability increased. We’ve enabled meaningful data analysis for consumers of the platform and our product teams. And, most importantly, our internal and external users have gained more reliance and confidence in the system.

If you enjoyed reading about building our notifications platform and want to work on exciting projects at a fast-growing startup, check out Whatnot’s career page today!