Sending Millions of Webhooks in a Smart Way

Published in

Insider Engineering

6 min readJul 24, 2023

In today’s, webhooks play a crucial role in enabling real-time data integration and event-driven workflows. However, effectively managing webhook services comes with its own set of challenges.

In this article, we explore the complexities of webhook management and aim to build a robust and efficient webhook infrastructure.

Our Hook service is an essential communication service that enables applications to receive real-time notifications via endpoints about data entering our system. It is a service that facilitates the transmission of HTTP requests to applications that want to be informed about the data.

In the beginning, our hook system consisted of a single service that processed hooks and sent them to the corresponding endpoint if they were ready to be sent. However, having both tasks handled by a single service resulted in a slower delivery of hooks, bottlenecks, and increased complexity in management.

Therefore, we decided to split the functionality of our hook service into two separate services: the Hook Processor and the Hook Consumer.

The Hook Processor is responsible for checking whether the data entering the system meets specific conditions and is suitable for sending a hook. This service ensures that only relevant and critical information is sent to the applications. The Hook Processor acts as the gateway, analyzes the data inputs, and decides whether to forward it or not.

On the other hand, the Hook Consumer is responsible for only sending requests with the desired payload to the endpoints. It acts as the trigger for the Hook service, indicating which data inputs should be sent to the applications. The Hook Consumer acts as the final step, ensuring that the application receives only the necessary data.

The Hook service’s division into two services allows for a more streamlined and efficient system. Each service performs a specific function, allowing for better scalability and easier maintenance. This way, the Hook service can handle a large number of data inputs and efficiently forward relevant data to applications.

The Challenges

We don’t grow when things are easy; we grow when we face challenges.

There are two main challenges that we face with the Hook service, which we need to address to ensure its effectiveness and reliability.

The first challenge is related to the unpredictable data traffic generated by the insertion of data into the system. With thousands of endpoints in use, we may need to handle millions of requests every day. It is critical to ensure that these requests are handled efficiently and intelligently to avoid system overload.

The second challenge is related to the fact that every endpoint that the Hook system communicates with is a potential single point of failure. Any problems with an endpoint, such as being down or having a slow response time, can disrupt the system’s overall flow and cause it to malfunction.

The Solution

The problems mentioned above have a few existing solutions, one of which is to use SQS and Lambda. We can store the requests we want to send to endpoints in SQS and use Lambda functions to process them. However, due to the high volume of data entering our system, there may be times when we are throwing too many hooks, making this solution very costly.

The other and most important reason for it to be costly is that if a destination endpoint cannot be reached, the Lambda function will continue to run until it times out. This could result in an unusually long execution time, which would cause the cost of running the Lambda function to rise sharply. Lambda operates on a pay-per-use model, which means that you are charged for each individual function execution, as well as the duration of the function’s execution time.

Therefore, we need a smarter prioritization mechanism that categorizes endpoints based on their performance and the number of requests made to them. This way, the priority of poorly performing endpoints can be reduced, while the priority of well-performing endpoints can be increased. The number of requests made to endpoints will also be one of the factors affecting prioritization.

How do we prioritize hook queues and endpoints?

We use NSQ as a message queue system, which is a distributed messaging platform written in Go and created by bit.ly.

In the beginning, our webhook system consisted of a single queue where all the hooks that needed to be sent were placed and later consumed. However, as we did not prioritize the hooks, endpoints with poor performance or that could not be reached caused the queue to become clogged. To address this issue, we decided to create five separate queues and prioritize them based on their importance.

When all hooks were placed in a single queue, the performance of the webhook system was affected by the slow or unresponsive endpoints, causing delays in sending hooks. This was not ideal as some hooks might have been time-sensitive and required immediate delivery. By prioritizing our queues, we were able to ensure that hooks with the highest priority were sent first, ensuring the timely delivery of important hooks.

Due to the issues caused by having a single queue for all hooks without any prioritization, we decided to implement a system with 5 different queues, and the priority library (https://github.com/c3mb0/priority) is used to manage the consumption order of these queues.

With gratitude to the author of the priority repository, we can ensure that requests are processed in the correct order and that critical requests are given priority. This helps to ensure that our Hook service is as efficient and effective as possible and that it can handle large volumes of requests without becoming overloaded or overwhelmed.

We have organized our queues in ascending order, with the least priority at the beginning and the first priority at the end.

Endpoint Scoring

To prioritize endpoints, the system scores them every minute based on their performance and volume levels. The performance score is given 1.5 times more weight than the volume score. Endpoints are sent to the appropriate queue for consumption based on their scores.

Performance Categorizing

When categorizing endpoints based on performance, response times are taken into consideration. If an endpoint has a high response time, it receives the lowest score in the performance category. Conversely, endpoints with low response times receive the highest scores. P99 latency data is used to score response times, with a scoring system based on the following ranges:

0   - 10  milliseconds         6   score
10  - 30  milliseconds         4.5 score
30  - 50  milliseconds   -->   3   score
50  - 100 milliseconds         1.5 score
100 - GT  milliseconds         0   score

Volume Categorizing

To categorize endpoints based on volume, the system considers the number of requests sent to each endpoint. Endpoints that exceed the threshold for the number of requests receive the lowest score to distribute resources fairly. Endpoints that receive fewer requests receive the highest scores. The scoring system for volume is based on the following ranges:

0    - 500  requests         4 score
500  - 1000 requests         3 score
1000 - 2000 requests   -->   2 score
2000 - 5000 requests         1 score
5000 - GT   requests         0 score

We combine the performance and volume scores to give each endpoint a total score. Based on the total score, we assign the endpoint to the appropriate queue.

0 - 2  total score         Queue 0 (low-performance, high volume)
2 - 4  total score         Queue 1
4 - 6  total score   -->   Queue 2
6 - 8  total score         Queue 3
8 - 10 total score         Queue 4 (high-performance, low volume)

Conclusion

This approach provides a smart prioritization mechanism, ensuring that high-performance endpoints are given priority and that poorly performing endpoints are deprioritized. This approach also ensures that the queuing system doesn’t get clogged, by redirecting requests to the least prioritized queue when endpoints are unavailable.

Overall, this approach provides a scalable and efficient solution for managing high-volume requests to multiple endpoints in a hook service. It allows for the effective utilization of resources and smart prioritization of endpoints and ensures that our queuing system remains stable even when endpoints are unavailable.

If you are curious about how we reduce our costs using AWS Fargate, please take a look at How we saved 90% of costs by moving from AWS Lambda to AWS Fargate posted by Emre Kaya.