Leveraging Webhooks for Real-time Data Warehousing

Introduction to Webooks, an event-driven alternative to Polling

Patrick Pichler
Creative Data
5 min readJul 31, 2020

--

Photo by Hello I'm Nik on Unsplash

Introduction

For years, we have been experiencing the trend towards service-oriented architectures for creating adaptive applications based on self-contained services, often implemented as microservices nowadays. The communication between those independent services happens through a wide range of different API technologies using some sort of lightweight communication protocols. This overall architectural design approach is a fundamental part of today’s cloud and serverless computing where servers including databases tend to completely disappear. Of course, they are still used, but the idea is that developers do not need to be aware of them. Relating this to data warehousing, then APIs have also started to replace databases in being the access point for retrieving and integrating data.

Photo by Author

Polling

Most cloud-based applications expose REST APIs these days allowing other systems to retrieve or manipulate data. However, as you might know from experience, repeatedly requesting data over REST APIs, has several negative implications. Above all, if you are dealing with huge amounts of data or/and you want to retrieve only changed data, preferably in (near) real-time. The most ubiquitous way to accomplish this is Polling. This means to obtain data updates through constantly sending requests to an API without knowing the server’s state and whether anything has changed in the first place. The API provider Zapier did a very interesting study across 30 million poll requests made through their services, and found that 98.5% of polls are wasted. Apart from this inefficiency, Polling towards the system also may degrade the overall performance of an application. Not to mention required mechanism for comparing states between requests to find changes, any pull request limitations or mechanisms for detecting deleted records. The good news is now that it doesn’t have to be like this.

Stop the polling madness by Zapier

Webhooks

Instead of polling, you can subscribe and listen to retrieve event-triggered changes in real-time, just like push notifications. How does that sound?

There are many different real-time web technologies around such as Webhooks, Websockets, Server-sent Events, Long polling, Comet, etc. They are the backbone of almost all modern web applications nowadays. However, for any sort of event notifications, especially Webhooks have become increasingly adopted. For instance, GitHub moved all their services over to Webhooks which enable their APIs to push streams of events via HTTP POST requests to a configured callback URL (the webhook). There is no need to constantly pull anymore. That’s why they are also often referred as “reverse APIs”. Most modern Webhooks essentially boil down to just listening for any changes to data and then automatically sending it to another HTTP endpoint. Such event-driven APIs are therefore a perfect fit for data warehousing.

In a nutshell, comparing Webhooks with Polling, they are superior in terms of freshness of data, efficiency of communication and infrastructure costs.

Go beyond webhooks by Zapier

REST Hooks

There is another good news, you don’t even need to implement or learn something completely new, Webhooks can advance the power of traditional REST API making them to REST Hooks. The idea behind this is to treat Webhooks like subscriptions which are managed via one and the same REST API, just like any other resource. This is an improvement over manually managed Webhooks as it allows to dynamically create and update them. This makes traditional REST APIs able to communicate with other apps in real-time even without the complicated setup required for normal Webhooks. Making REST APIs event-driven typically also entails much better efficiency in terms of CPU and network usage.

Sending, Receiving and Processing Webhooks

The most important thing about building efficient and scalable Webhook receivers is to make sure they do as little as possible. That means, further processing should be offloaded to a separate job. Ideally, all the receivers do is receive a payload, enqueue the payload and return a 200 success message to the sender API. This acknowledgement is important, since otherwise a request is considered to have failed and it will be retried. After a certain number of failed attempts, this usually lead to the removal of an active subscription. For proper enqueuing can be used message queuing system such as RabbitMQ, AWS SQS or Azure Queue storage in conjunction with any subsequent data processing service. The queue takes care that no message gets lost, even in case of high traffic rates.

Skinny Payload

The notification’s payload for further processing can then actually be any kind of JSON data but since its primary goal is to act as a notifier, it is good practice to limit it to that only. This approach also complies with one of the REST Hooks’ security measures, named “Skinny Payloads”. Instead of directly sending the actual representations of resources, you replace the payload with parameterized URLs or IDs. The receiver then makes the necessary REST API calls to complete the transaction. The benefit of this approach is that the receiver must make an authenticated API call to finally obtain the actual data. Another good practice is to not sent out notifications immediately, but waiting for some changes to be made and sending out collection of changes similar to the idea of micro-batches.

Conclusion

Fact is, Polling is inefficient, and making familiar REST APIs able to notify about data changes can solve a lot of problems, not only for data warehousing. Even though you may got your data integration processes working successfully using Polling techniques, it’s definitely worth considering to adopt Webhooks instead. The majority of today’s cloud-based applications should support some sort of it, otherwise you may can implement it on your own.

--

--

Patrick Pichler
Creative Data

Promoting sustainable data and AI strategies through open data architectures.