Create Monitoring & Alerting for Webhook Errors using Datadog

How do we monitor our webhook failure status and automate the alerts using Datadog at Xendit

Iman Tumorang
Xendit Engineering
8 min readApr 19, 2022

--

Monitoring and Alerting

Hi!!! It’s been a while since I haven’t posted any articles. Been so busy lately (again!). Even with the hectic schedule, I always think about writing a new blog post as soon as I can.

So in this blog post, I want to share a light topic. Very light topic. Just take your coffee and eat your breakfast while reading this. The topic is about “how do we monitor our webhook failure status and automate the alerts using Datadog at Xendit”.

To give you some context, Xendit handles a lot of things, mostly revolving around Payments. We serve a lot of companies and help them to accept money through any payment methods that are available in the region. By saying in the region, it also indicates that different country has a different type of payment methods.

We also build other services that support the customer experience when using Xendit. And one of the services is the Notification Platform, which I will explain in this blog post.

What Is This Notification Platform About?

When our merchants are onboarded to Xendit, they will receive an email. When our merchants log in to our dashboard, they will get an OTP for the secure 2FA authentication. When our merchants use the Invoice feature, which allows them to receive payment with a single link/page that is already UI friendly — no need to integrate using API and notify their users about the payment status, the payment link will be automatically sent to their user through our notification platform.

All of those activities are related to the notification. From sending the email, OTP, or just a WhatsApp message that contains a payment link. The notification service will have the responsibility to send the notification to any available channels that our merchant use.

By the time I write this article, Xendit supports 4 notification channels:

  • SMS
  • WhatsApp
  • Email
  • Viber

We use multiple vendors that are available in the market to help us to send notifications to all channels.

To give a bird-view of how our system looks like in general, take a look at the following diagram.

Xendit’s Notification Architecture
  • The products team will call Main API as needed, eg. When sending OTP through SMS. Our Main API will dispatch an event that can be listened to by the connectors. And send the message based on the targeted channel eg. SMS with Twilio.
  • Next, vendors will receive our request and send the message to the targeted recipient.
  • And by nature, most of the vendors should/would provide webhooks for every event that happening with our request.
  • The events that are being posted to our Webhook receiver may contain positive results (eg. message sent successfully), or negative results (eg. message failed sent).
  • And our Webhook receiver will record all the incoming webhook back to our Main API for inserting the events.

Done! Our topic finished here 😂. Kidding haha. And that’s the normal flow of how the notification platform runs.

Problems Happen

So there’s a saying,

Even an honest person also can be blamed if he’s not smart enough to handle conditions.

Same with the system, even a well-designed system also can be failed, even when we have already thought of all possibilities of failure when designing it, there is always an unhandled problem that arises later.

Our system is already well designed — with everything that we can think of. Circuit breaker pattern, auto-retry, auto-scaling, and many more that we have implemented. But one day, something bad things happened. All of our notification for WhatsApp is stuck in a pending/failed state. And even though we already have implemented any failure possibility, in our case it’s still not enough (yet).

The problem that we’re facing will be explained later, but first, let me tell you of how our WhatsApp notification architecture looks like.

Our WhatsApp architecture is quite unique, you can take a look at the diagram below.

Whatsapp notification Architecture at Xendit

From the diagram, as you can see, there are 2 different ways how we send notifications.

  • Using Proxy (On-Premise)
    In this version, we host an on-premise version of WhatsApp API that connects directly to Meta API. We called it proxy for internal terminology because we need to host it and act like a proxy to Meta API. The proxy app itself is a self-contained application that was already prepared by Meta, you can read more about it here
  • Using Vendor
    In this version, instead of managing the on-premise one, we also use Vendor (WhatsApp Business Partner) for our WA notification.

The reasons why we have this kind of architectural layout is a different matter to talk about. But that’s how our current setup for the Whatsapp architecture.

So, back again to our problem that caused a headache and even caused our WhatsApp API inaccessible. This issue happened on our Proxy (On-Premise) flow.

So there were 2 memorable events that I can recall

  • One is about the payment issues. It was a silly issue. But it’s critical. So we forgot to pay our overdue bill to Facebook (now Meta) in our Facebook Business Manager. And that makes our WhatsApp notification is not being processed by the Proxy
  • And also the other one, there’s one time, even a giant company like Facebook also can face an issue on their service. I remember especially WhatsApp is having an outage and it’s also making our WhatsApp notification is not processed by the Proxy

Maybe you’re confused, let me try to sketch this. So we have this architecture for WhatsApp using Proxy only.

WhatsApp Proxy Architecture

And we have 2 problems, so the flow of the problem will be looks like this

Left: Payment Issue flow. Right: WhatsApp down.

It looks normal right?

But if you take a look again, the “Proxy” is a bundled application from Meta to allow us to connect with WhatsApp API.

The Proxy architecture looks like this (ref: Meta)

Whatsapp On-Premise Client Architecture

Our internal system will call the WebApp in the bundled Proxy app. And since it’s a bundled application, we can not customize it like adding a Circuit Breaker on the CoreApp. So if something happens with WhatsApp Servers our internal application won’t notice the problem, the only way to know it is by listening to the webhook.

And back to our original problem, when we faced the issue on WhatsApp Servers, we didn’t notice it at all. Until we receive a lot of tickets complaining about WhatsApp notification is not sent successfully.

If you’re a veteran in the B2B SaaS ecosystem, we all know that when customers complain, even if it’s only one merchant, it means disaster. We will have a bad image as a company. So I remember when the payment issue with Meta (re: WhatsApp) happened, we didn’t know at all that our payment was overdue. There’s no alert or error happening. Because everything is run normally. The webhook receiver is receiving the webhook.

Long story short, we manage to identify the root cause after we look at the error log in the Proxy server. But one thing we learned, the proxy is sending the webhook about the payment issue, but we didn’t do anything to the webhook message. We just save it to our DB with no extra logic.

That’s why we then decide to figure out how we can monitor this incoming webhook and do alerting if there’s a critical webhook message.

Recording Webhook with type Error to Datadog

One simple idea that comes to our mind is utilizing the Datadog. In Datadog (same as the other monitoring tools both premium and open-source) they support sending custom metrics.

So what we do is build a custom metric and record all incoming webhook that consist error of messages. The structured message will look like below.

And then push all of the incoming webhook to Datadog, the steps of how we push it to Datadog using DogStatsD can be seen on the official page of Datadog here.

For our case, we create a simple function that will be forwarding all the metrics to the DogStatsD agent. A full example can be seen below (using javascript with this library hot-shot as the driver for the DogStatsD agent)

What we do here is only push the metric to Datadog. And once it’s deployed, any incoming webhook that contains an error will produce one event to Datadog. And in Datadog, now we can set a dashboard, monitor, or even alert for triggered conditions. Below is one of our examples for webhook error dashboard monitoring. We grouped the error’s code into 2 types, Retryable Error and Immediate Alert Error

With this, we hope we will be alerted immediately whenever there’s a critical webhook coming so we can prepare the countermeasures.

What’s Next?

After this simple experiment, we will expand this to other notification channels including Email, SMS, and Viber.

And after the alerting part, there are a lot of options we can try, eg. re-queueing all the messages back again until the problem is solved on the vendors' part (eg. Whatsapp Server).

And there are also more experiments to try, many exciting things yet to come, obviously we want to deliver the best experience to our customers. And also add more channels as our vision is to be more regional.

Conclusions

  • We utilize DataDog to monitor and alert us when there’s a webhook error coming for WhatsApp
  • We’re extending the same monitoring pattern to other channels
  • This kind of custom metric and monitoring also can be applied to all types of webhooks (including Xendit payment webhooks), and hopefully, this article is useful to anyone who reads this.

Hey all I’m hiring more people to join our teams to handle the notification platform and build more exciting experiments that are yet to come. If you’re interested please contact me, or if you know someone that looking for a job, please help to share this so they can read this.

--

--

Iman Tumorang
Xendit Engineering

Software Architect @ Xendit | Independent Software Architect @ SoftwareArchitect.ID | Reach me at https://imantumorang.com for fast response :)