Event-Driven Integration With Klaviyo (MVP)

Published in

Ordergroove Engineering

7 min readNov 20, 2023

This article explores the design aspects of our MVP implementation for integrating Klaviyo. It delves into how we efficiently transmit events from Ordergroove to Klaviyo.

Architectural Goals

As usual, when architecting a solution for a problem, besides solving the problem from a functional point of view; we also want the architecture of our new event-delivery system to also fulfill several non-functional qualities attributes that good software should exhibit:

Scalability.
Extendability.
Idempotency.
Data Integrity.
Retriability
Auditability

Later on, after introducing the architecture we have devised, we will explain how it fulfills the aforementioned quality attributes.

High-level Architecture

From this section on, we assume that readers are familiar with Google Pub/Sub and Cloud Functions.

Consider the following diagram that illustrates our proof of concept, high-level architecture:

MVP Architecture for sending events to Klaviyo

As it can be seen in the diagram above:

Ordergroove publishes events into a RabbitMQ Event Bus queue. These events were already being used in various parts of the platform.
We leveraged the existing Pub/Sub topics we already had in place for other purposes, leaving existing Pub/Sub subscribers untouched.
We created new Klaviyo-specific subscribers (one per topic).
These new subscribers also have a 1-to-1 relationship with a new cloud function (one per event type) that will be called upon a message / event that arrives to the topic and needs processing.
The cloud functions leverage existing internal APIs to enrich the event data and deliver the events to Klaviyo.

Event Message Processing

The following diagram zooms in the interaction between Google Pub/Sub and Google Cloud Functions, to process and deliver events:

In the diagram above, we see that:

Upon the arrival of an event message to the topic, the (Klaviyo-specific) Pub/Sub subscription invokes the cloud function over HTTPs, passing the event payload as a parameter.
An internal Ordergroove API is called to enrich the data that came in the event payload.
For future iterations, we devised a plan to implement a custom mechanism to audit the processing of events, by persisting operations, payloads, results, number of retries, etc. in dedicated MongoDB collections. (there are more details about this in upcoming sections)
All cloud functions share a similar internal structure, with some parts open for extension; to allow for customized building and enriching of the payload to be sent to the partner, based on the event type being processed and the actual partner the event is destined to.

How did we fulfill our architectural goals?

In this section, we will explain in detail how we fulfilled the architectural goals we have outlined above.

Scalability

When integrating events from one platform to another, scalability becomes a critical consideration. By adopting Google Cloud Functions combined with Google Pub/Sub, we have solved several scalability challenges inherent not only to this problem but to the processing and delivery of events in general:

Events Volume: they scale horizontally.
Throughput and Latency: Cloud Functions and Pub/Sub offer low-latency and high-throughput event streaming and processing.
Concurrent and Parallel Event Processing: enabling concurrent / parallel function execution.
Fault Tolerance: durable event storage and reliable event delivery and retries, including dead-letter queues.
Resource Utilization: thanks to automatic scaling based on demand, maximizing cost-effectiveness.

Extendability

It was an explicit requirement for this architecture that it should support embracing future integrations with other partners seamlessly.
The internal design of the cloud functions considers the future addition of more partners other than just Klaviyo, with minimal code changes; as all the different components were conceived to be easily extended.
Roughly speaking, in order to add a new partner to the platform, the following tasks must be carried out:

1. Implement support in the code for all the existing event types, for the new partner. This might not be as bad as it sounds, since we reified common behavior across partners, into common classes that might provide generic solutions to more than one event at a time:

Simplified Class Diagram For Handling multi-partner events

2. Expose as many new cloud functions as event types are supported by the partner.

3. Create a new set of Pub/Sub subscriptions for the new partner, making them all invoke the corresponding new cloud function.

Idempotency

We identify the event message to be sent to partners, using the exact same UUID of the source event. This, in combination with the fact that, in the special case of Klaviyo, they actually use this event_id to deduplicate messages, guarantees that, Klaviyo will not store an Ordergroove event more than once.
Although, note that this does not guarantee full idempotency: in the event of a network failure or similar scenario when we don't know if the Partner actually received an event, we cannot guarantee that we will not re-send the same event in a future retry, potentially with contextual (i.e. time-sensitive) data that differs from the one originally attempted.

Data Integrity

By utilizing Google Pub/Sub, we prevent data loss by leveraging its durable message storage. We also guarantee that all events are processed thanks to its “at least once” delivery semantics, even in the event of failures, thanks to the built-in acknowledgment and retry mechanisms.

Retriability

For all API requests the cloud functions will issue, we handled retries by leveraging Google PubSub’s built-in mechanism, which will retry the function if and only if it throws any error / exception.

The error codes we have determined that will be retryable are:

429 (Too Many Requests)
503 (Service Unavailable)
504 (Gateway Timeout)

All other codes will be considered non-retryable.

For the specific case of Klaviyo, and considering that it is more likely to experience errors while interacting with their APIs than among our own sub-systems (mostly due to connectivity or throttling reasons), we have leveraged two levels of retrying: we configured retries at the Klaviyo SDK level and also at Pub/Sub level (this one will catch all kind of errors).

Klaviyo Retries

We have configured Klaviyo Retries, leveraging their SDK, to retry the aforementioned retryable error codes, configuring:

the amount of time to wait before attempting the first retry.
the maximum number of retries.
how much time passes between retries.

Pub/Sub Retries

At the same time, we have configured Pub/Sub retries, to make sure that errors caused by anything else than the specific interaction with Klaviyo, gets handled properly. The configuration includes the following aspects:

Maximum number of attempts.
Retry Policy: Retry after exponential backoff delay.
Minimum backoff duration.
Maximum backoff duration.

Therefore, if after exhausting all the Klaviyo retries, the processing function does not end with a successful response, our generic error-handling logic will take over and will signal Google Pub/Sub to either retry (throwing an exception) or not retry (returning normally), depending on the last error code thrown by the last retry performed by the SDK.

If, after retrying N times (at Pub/Sub level), we still cannot make the processing of the message succeed, Pub/Sub will move the event message to a dead-letter queue that is shared by all cloud functions / event types. We have a subscription to that topic that in turn invokes a dedicated cloud function that, just for the MVP, logs the failed message to provide some kind of paper trail.

Auditing

In order to track and monitor the flow of events for troubleshooting, compliance, and performance analysis, we have devised a plan to solve the problem of auditing previously-sent events. This solution relies on using a MongoDB collection that saves information about event delivery, storing operations, payloads, results, number of retries, status of delivery attempt, reasons of failures, etc.
Cloud functions would directly access MongoDB.
This data can be used in the future to power dashboards in terms of tracking, monitoring, and delivery of those events, etc. This information will also be used to debug the delivery of individual events.
Due to time-constraints, this was not implemented during the first iteration.

Future Improvement

As discussed earlier, our system currently cannot guarantee full idempotency.
Given that, for every single delivery, contextual information is fetched from internal REST APIs over HTTPs, we have seen slow performance in the face of retries for the same event and also contributes to the aforementioned lack of idempotency. A cache might be in order here.
We are currently not recovering dead-letter items, just logging them.
This implementation lacks of proper auditing or tracking capabilities, resulting in poor ability of data correlation in the event of failures / retries.
System configurers currently has little to no-control over the rate at which events are delivered to partners. This results in degraded performance upon sudden spikes in traffic, due to the partner (i.e. Klaviyo) asking us to throttle by returning the 429 HTTP Status code.
The system lacks end to end granular visibility easily available.

Conclusion

We have seen how we, at Ordergroove, architected a solution that, mostly by the infrastructure we chose, fulfilled reasonably fine most of the quality attributes we had defined for the architecture, considering the very short period of time we had to design and implement such MVP solution.

As discussed earlier, this was just the first iteration, so, as discussed in the past section above, there are several aspects that can and most likely be improved in future iterations, that might turn this article into just the first instalment of more to come.

Hope you have enjoyed the reading!