For the past year, I’ve been part of the data-streams team that is in charge of Wix’s event-driven messaging infrastructure (on top of Kafka). This infrastructure is used by more than 1400 microservices.
During this period I’ve implemented or have witnessed implementations of several key patterns of event-driven messaging designs that have facilitated creating a robust distributed system that can easily handle increasing traffic and storage needs. 3 of these are in this post and another 3 are in part 2.
1. Consume and Project
…for those very popular services that become a bottleneck
This pattern can help when you have a bottleneck legacy service that stores “popular” data of large domain objects.
At Wix, this was the case with our MetaSite service that holds a lot of metadata for each site created by Wix users, like the site version, the site owner, and which apps are installed on the site — The Installed Apps Context.
This information is valuable for many other microservices (teams) at Wix, like Wix Stores, Wix Bookings, Wix Restaurants, and many more. This single service was bombarded with more than 1 million RPM of requests to fetch various parts of the site metadata.
It was obvious from looking at the service’s various APIs, that it was dealing with too many different concerns of its client services.
The question we wanted to answer was how do we divert read requests from this service in an eventually consistent manner?
Create “materialized views” using Kafka
The team in charge of this service decided to create an additional service that would handle only one of MetaSite’s concerns — “Installed Apps Context” requests from their client services.
- First, they streamed all the DB’s Site Metadata objects to a Kafka topic, including new site creations and site updates. Consistency can be achieved by doing DB inserts inside a Kafka Consumer, or by using CDC products like Debezium.
- Second, they created a “write-only” service (Reverse lookup writer) with its own DB, that consumed the Site Metadata objects but took only the Installed Apps Context and wrote it to the DB. I.e. it projected a certain “view” (installed apps) of the site-metadata into the DB.
- Third, they created a “read-only” service that only accepted requests related to the Installed Apps context which they could fulfill by querying the DB that stored the projected “Installed Apps” view.
- By streaming the data to Kafka, the MetaSite service became completely decoupled from the consumers of the data, which reduced the load on the service and DB dramatically.
- By consuming the data from Kafka and creating a “Materialized View” for a specific context, The Reverse lookup writer service was able to create an eventually consistent projection of the data that was highly optimized for the query needs of its client services.
- Splitting the read service from the write service, made it possible to easily scale the amount of read-only DB replications and service instances that can handle ever-growing query loads from across the globe in multiple data centers.
2. Event-driven from end to end
…for easy business flow status updates
The request-reply model is especially common in browser-server interactions. By using Kafka together with websockets we can have the entire flow event driven, including the browser-server interactions.
This makes the interactions more fault tolerant, as messages are persisted in Kafka and can be re-processed on service restarts. This architecture is also more scalable and decoupled because state management is completely removed from the services and there is no need for data aggregation and upkeep for queries.
Consider the following use case — importing all of the Wix User’s contacts into the Wix platform.
This process involves a couple of services — the Contacts Jobs service processes the import request and creates import batch jobs, and the Contacts Importer does the actual formatting and storing of the contacts (sometimes with the help of 3rd party services).
Register, and we will let you know
A traditional request-reply approach will require the browser to keep polling for the import status and for the front-end service to keep the state of status updates in some DB tables, as well as to poll the downstream services for status updates.
Instead, by using Kafka and a websockets manager service, we can achieve a completely distributed event driven process where each service works completely independently.
First, the browser, upon request to start importing, will subscribe to the web-sockets service.
It needs to provide a channel-Id, in order for the websockets service to be able to route notifications correctly back to the correct browser:
Second, The browser needs to send an HTTP request to the jobs service with the contacts in CSV format, and also attach the channel-Id, so the jobs service (and downstream services) will be able to send notifications to the websockets service. Note that the HTTP response is going to return immediately without any content.
Third, the jobs service, after processing the request, produces job requests to a kafka topic.
Fourth, The Contacts importer service consumes the job requests from Kafka and performs the actual import tasks. While It finishes it can notify the websockets service that the job is done, which in turn can notify the browser.
- Using this design, it becomes trivial to notify the browser on various stages of the importing process without the need to keep any state and without needing any polling.
- Using Kafka makes the import process more resilient and scalable, as multiple services can process jobs from the same original import http request.
- Using Kafka replication, It’s easy to have each stage in the most appropriate datacenter and geographical location. Maybe the importer service needs to be on a google dc for faster importing of google contacts.
- The incoming notification requests to the web sockets can also be produced to kafka and be replicated to the data center where the websockets service actually resides.
3. In memory KV store
…for 0-latency data access
Sometimes we need to have dynamic yet persistent configuration for our applications, but we don’t want to create a full blown relational DB table for it.
One option is to create one big Wide Column Store table for all your applications with HBase/Cassandra/DynamoDB, where the primary key contains a prefix that identifies the app domain (e.g. “stores_taxes_”).
This solution works quite well, but there is built-in latency involved with fetching values over the network. It’s more suitable for larger data-sets than just configuration data.
Kafka offers a similar solution for key/value stores in the form of compacted topics (where the retention model makes sure that latest values of keys are not deleted).
At Wix we use these compacted topics for in-memory kv-stores where we load (consume) the data from the topic on application startup. A nice benefit here that Redis doesn’t offer is that the topic can still be consumed by other consumers that want to get updates.
Subscribe and Query
Consider the following use case — Two microservices use a compacted topic for data they maintain: Wix Business Manager (helps Wix site owners with managing their business) uses a compacted topic for a list of supported countries, and Wix Bookings (allows to schedule appointments and classes) maintains a “time zones” compacted topic. Retrieving values from these in-memory kv stores has 0 latency.
Wix Bookings listens for updates from the “supported countries” topic:
When Wix Business Manager adds another country to the “countries” topic, Wix Bookings consumes this update and automatically adds a new Time Zone for the “time zones” topic. Now the “time zones” in-memory kv-store is also updated with the new time zone:
We don’t need to stop here. Wix Events (that allows Wix Users to manage event tickets and RSVPs) can also consume Bookings’ time zones topic and automatically get updates to its in-memory kv-store whenever a country changes its time zone for daylight savings.
In the second part of this article I will describe 3 more event-driven patterns, including time-based events, events as part of a transaction and more.
Thank you for reading!
If you’d like to get updates on my experiences with Kafka and event driven architecture, follow me on Twitter and Medium.
You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.
If anything is unclear or you want to point out something, please comment down below.