Building a Distributed Webhook System

Mohit
5 min readMay 23, 2022

--

Background

Phyllo is a data gateway to access creator data from hundreds of source platforms like YouTube, Twitter, TikTok, Substack, Discord, Twitch, OpenSea, Shopify, and others. Phyllo builds the underlying infrastructure that connects to every creator platform, maintains a live data feed to the systems used by these platforms to manage creator’s data, and provides a normalized data set so that businesses can make use of creator data in a way that is simple yet impactful.

The creators connect their platform accounts using Phyllo’s Connect SDK. They provide consent to fetch the data from these platforms and then Phyllo streams the data for each creator from these platforms.

Developers need the latest data of creators to make critical business decisions. So, we notify them during all events like

  • when the creator’s account gets connected & synced for the first time
  • when any data updates are available for the account
  • when there is any change in account connection status.

Phyllo has several hundred thousand connected creator accounts whose data is updated frequently. The data change needs to be propagated back to developers in near real-time. Developers can also poll our APIs directly, but this results in high network loads and is an inefficient use of resources.

To solve this problem, Phyllo has implemented webhooks where it delivers messages in a very efficient and reliable way whenever any data change is available. It has safeguarded its infra by reducing the number of API calls. It also makes the client integration simpler since they don’t have to implement logic to perform periodic polling.

Problem Statement

Since the launch of the product, the number of connected accounts are increasing exponentially. Due to this, we have to sync a lot of connected accounts daily. As a result, the number of webhook requests has grown exponentially.

Key challenges which we encountered during the process

  • Failures: Developers’ webhook URL is not responding correctly and they are throwing errors.
  • Reliable: Sending millions of webhooks in real-time is complex and error-prone.
  • Security: Developers want security measures built in to trust the incoming webhook notifications.

In this blog post, we’ll be walking you through some of the learnings and design choices that helped in solving these challenges at Phyllo.

High-Level System Architecture

Figure 1: HLD for webhook systems

How We Made It Reliable and Fault-Tolerant

We aim to build the system with 100% reliability. It should be able to support all kinds of businesses and developers irrespective of their size and expertise in building the systems.

The design challenges that are important to solve:

  • The developer’s webhook sends a 5xx error. The system might be down or is encountering issues due to reasons beyond our control. We consider these a delivery failure.
  • They take more than the predefined allowed seconds in processing the request.

In this section, we are providing architectural design decisions that helped us solve these challenges

Figure 2: Handle Failures In Webhook

Selecting a message broker

We rely heavily on AWS for our infra. So we wanted a managed service that should be highly available and should be able to support message timeout (TTL) functionality. We went through ActiveMQ and this perfectly fulfilled our conditions.

Handling Errors

We should send the webhook reliably. To achieve this, we have designed a retry mechanism that takes care of the failure scenarios. We have introduced an exponential backoff strategy and defined a reliable retry policy that should be able to cover 99% of our developer’s failures.

RETRY POLICY:

We have created 5 queues.

  1. MAIN QUEUE: Core message queue that receives the signal to send the webhooks. It maintains a message metadata in the header where it keeps track of the retry count (x-retry) and increments the retry count if the webhook is failed.
  2. RETRY QUEUE 1: All the messages which fail for the first time are published here. The queue is configured with a predefined TTL that is 5 minutes for each message. Post TTL, the messages are sent to MAIN QUEUE again since it has been configured as a Dead Letter Queue. The messages are retried again.
  3. RETRY QUEUE 2: All the messages which fail for the second time are published here. The queue is configured with a predefined TTL that is 60 minutes for each message. Post TTL, the messages are sent to MAIN QUEUE again since it has been configured as a Dead Letter Queue.
  4. RETRY QUEUE 3: All the messages which fail for the third time are published here. The queue is configured with a predefined TTL that is 360 minutes for each message. Post TTL, the messages are sent to MAIN QUEUE again since it has been configured as a Dead Letter Queue.
  5. ALERT QUEUE: All the messages which fail for the fourth time are published here. We send an automated email alert to notify our developers about the fault in their system. Also, we notify our customer success team to educate the developers.

How We Secured it

Supporting the latest TLS security protocols

The older version of TLS has security concerns. Phyllo Production System does not support the older version of TLS 1.0 and 1.1. The integration should use newer versions of the TLS protocol.

Providing a set of webhooks IP

Developers need to trust the Phyllo system before accepting messages. We have provided a list of all the webhooks IPs we use to deliver the message. Developers whitelist these IPs.

Generating Hashed Signature

We generate a hash signature with each payload.

**X-Phyllo-Signature**The hash signature is calculated using HMAC with SHA256 algorithm; with your webhook secret set as the key and the webhook request body as the message.

key                = webhook_secret
message = webhook_body // raw webhook request body
received_signature = webhook_signature
expected_signature = hmac('sha256', message, key)if expected_signature != received_signature
throw SecurityError
end

Conclusion

We have launched this system into production in December last year. We are optimizing it regularly and can see much better results now.

As of now, Phyllo is serving 3 million webhook requests daily with 100% reliability.

If this article is helpful, follow me to see my next articles.

--

--

Mohit

Founder @Stealth | Prev @Co-Founder @getphyllo.com