Webhook Dispatch the Elixir Way

Michael Guarino
Frame.io Engineering
4 min readDec 11, 2018

As part of Frame.io’s launch of it’s API, we’ve recently added webhook support. Since we already use GenStage as an event bus broadcasting to many heterogenous event consumers for all mutations in the system, webhooks were in the simplest sense just the addition of another consumer to the bus plus a signed HTTP request. That said, we want to raise the bar and identified a few possible gotchas with the implementation of webhooks:

  • Unpredictable latency from clients. It is possible a webhook consumer takes multiple seconds to accept a request. That is somewhat manageable with timeouts.
  • Intermittent client failures. One potential solution here might be retries.
  • Managing open HTTP connections. Again this is somewhat mitigable, this time using GenStage ConsumerSupervisor (which will limit the number of simultaneous webhooks being processed) + timeouts.

Obviously the second is the easiest to solve, and there is a trivial solve for 1 and 3. The big issue is if you just use GenStage to manage the backpressure, you run the risk of eventually shedding webhooks, as eventually you’ll just time out on the event dispatch to the GenStage producer. Also a pernicious side effect of this would be slowing down completely unrelated consumers because the webhook system keeps applying backpressure on the bus, preventing broadcasts from happening, which should be considered inappropriate. Neither of those were acceptable trade-offs in our estimation.

A Better Way

Our solution was to build out a dispatcher from scratch off of a simple GenServer, and utilizing hackney’s non-blocking http functionality as much as possible. Simplistically, hackney provides a nice streaming api with events all of the form {:hackney_response, stream_ref, event} . You can fold those into handle_info callbacks in a GenServer, with an internal queue, to provide a completely async webhook dispatcher that won’t apply backpressure on the rest of the system. Here’s code to demonstrate:

The GenServer exposes a cast to dispatch a webhook. Upon each dispatch, the webhook is enqueued for dispatch, do_dispatch/1 checks to see if there are any available streams (based on the configured max concurrency), and if so, will make HTTP POST requests using hackney’s async: :once option for each. This guarantees our open HTTP connection constraint. The stream reference is stored in the internal map for future use, and whenever hackney gets a message over the wire, it’ll send it back to the GenServer and handled through a handle_info call. That ensures no processing in the dispatcher itself is blocked by http (condition 1). We can even kill the stream prematurely once we’ve gotten what we’re looking for: the response code. (I’ll leave implementing response downloads to the side for now, but it’s also possible by keeping the hackney stream going).

Adding Retries

This gets us 66% of the way to our goal, but notice we’ve sacrificed the ability to retry. This isn’t a trivial task while getting all the benefits of async I/O we’ve just accomplished. If you block a loop of the genserver constantly retrying the POST for instance, well, it’s no longer async. The key insight here is you can actually use Process.send_after/3 to handle retries without blocking anything in the GenServer. All we’ll need to do is modify mark_event and add another handle_info function head, like so (previous code omitted for brevity):

The handle_info for the {:retry, event}message can simply imitate a normal dispatch of a webhook event. We encapsulate all the retry decision-making in the Events module to keep the dispatcher dumb, but assume it returns tuples like {:retry, _} | {:success, _} | {:halt, _} | {:error, _} (these become even more important if you’re doing body downloads, as you want to kill the hackney stream if you’re retrying, but keep it around for the download if it was successful or retries were exhausted). Finally send_after proves to be a really useful retry mechanism as you just need to compute a trivial exponential backoff, then shape the message properly.

Final Caveats

By adding back retries, we’ve accomplished the three goals of our webhook dispatch system, constant http connection usage, ability to handle diverse response times without significant performance degradation of other parts of the system, and retries on specific errors. There was also an implicit goal of 100% deliverability for our webhooks. Notice that the current implementation utilizes an in-memory queue. If the dispatcher always drains in very short time this is of little concern, but if the system starts accepting more load, utilizing a system like RabbitMQ or Kafka as the queueing mechanism would be wise to eliminate the possibility of a VM dying, causing dropped events. The nice thing is there’s nothing preventing us from using them in this code. Replace the :queue datastructure with a Rabbit or Kafka client and the calls to :queue.in/2/:queue.out/2 with the analogous calls against those systems and everything should behave as before. With Rabbit, you could also only ack the message once it’s completed all retries, guaranteeing exactly once delivery even if for whatever reason things blow up during the retry process.

Like what you’ve read? We’re hiring!

At Frame.io, we’re powering the future of creative collaboration. Over 500,000 video professionals use Frame.io to seamlessly share media and gather timestamped feedback from team members and clients. Simply put, we help companies create better video, together.

Across the stack we’re big users of AWS Lambda, Elixir, Swift, Go, and React. We’re a small, polyglot team that thinks big and works collaboratively to solve the biggest challenges for our customers that include Vice, BuzzFeed, Turner and NASA.

--

--