ACP — Data exchange framework
ACP at a glance
Capillary’s Anywhere Commerce+ (ACP) provides a secure, scalable & highly customisable enterprise ecommerce platform to retailers across the world.
ACP enables brands to quickly launch and operate a D2C site (brand web store) with a true omnichannel experience (web, mobile, social channels, marketplace and physical store having a single view of inventory, pricing and fulfilment flows). It allows end consumers to discover, explore, interact, and transact with the brand and its offerings.
It lets a brand manage all aspects of running an online business including product catalog and inventory, search and discovery, pricing and promotions, user journeys, cart and checkout, payments and logistics, order management and fulfilment operations, integrations, analytics and so on and so forth.
ACP is designed as a multi-tenant SAAS offering with microservice-based architecture. It powers hundreds of brands across the globe with 100K+ orders being placed every day, and GMV processed in excess of a billion USD per year. To enable it all, ACP needs to reliably exchange multiple terabytes of data with external and internal systems on an average day.
This blog is a two part series, which will cover the following:
- Challenges in ACP’s data/event exchange framework, and how we solved it.
- How well the new framework scaled, new challenges and improvement areas discovered, and the road ahead.
Data exchange Framework
ACP’s data exchange framework was designed to facilitate the data exchange between different internal and external systems/microservices. For example, if status of shipment changes then it has to inform to multiple consumers, one of the consumer like communication service can send message to user and update about shipment and another one consumer can update my account page of user profile to know the latest status of shipment.
Challenges with Legacy data exchange framework
With unprecedented growth of e-commerce, brands need capability to experiment quickly, try innovative products/solutions, and enable some brand specific flows. This generally means integration with external partners, and exchanging events and data. Similarly, events/data has to be exchanged between various internal microservices as well e.g. for a fulfilment of an order, events has to be published/consumed by multiple services like orderservice, taxservice within ACP’s platform.
Some examples:
- Allocate an order placed online to a specific physical store and inform the store’s POS, a case in point would be you’re ordering a pizza online and it is getting delivered from nearest pizza store from your location.
- Fulfilment to be powered by existing WMS of the brand (as opposed to ACP WMS)
- Pushing order/return/inventory movement details to the brand’s central ERP
- Enable brands to re-compute personalised recommendations using third party tools on every click on the product in store front.
- Communication service now allows an additional communication trigger when an order is out for delivery
- On any pricing upload, cart service now need to notify the end user on price change
Plain architecture of existing system
In our existing design, each source service was updating their data to database and later, trigger was getting invoke which was updating to log table. From there DCN service was picking the messages sequentially and updating to different services responsible for making the data sync for consumer journey. This approach had the following limitations:
- As it was sequential process lot of delay were happening for updating the data across multiple services/ clearing the cache which resulted in in-consistency in/around system
- As brands started building more critical functions ( e.g. 30 min pizza delivery by store post order allocation) any delays / failures resulted in high priority production issues and unhappy clients.
- Due to trigger, mysql process were being slow down and load was increased due to operational overhead.
- There was no rollback process when log table process fails.
It was obvious that our existing system needed an overhaul to support the next level of growth.
Solution Implemented
Problem statement was now clear, source service may insert/modify the data it owned, and information of this data change would be consumed by multiple different downstream services (internal or external; in part or full), asynchronously but near real-time.
Hence we had to design a system considering the below key points:
- Data should flow b/w multiple consumers in near real time
- New downstream use-cases can be added without any code changes in source service
- Slowness in downstream services should not impact source service
- To move away from sequential process
- Alerts on specific queues can be configured to notify any abnormal behaviour
We also wanted to achieve this without too many changes on multiple downstream services.
Since data for a vast majority of ACP services is maintained with multiple mysql database, and all the changes in data (insert, update, delete etc.) would always be available into mysql binlog, the solution in sight was to develop an independent event Producer using CDC (change data capture) mechanism.
The new architecture needed to divide the entire process in 4 loosely coupled components, that can be scaled, deployed and updated independently:
- Source (Main) Application — Should continue to perform its core tasks without being bothered by what part of these data modifications are needed by which downstream service.
- Producer (New application) — Should read all data changes independently, filter as needed and publish through a Router.
- Router (Pub-Sub) — Segregate data into different queues (by function, brand, module, events, etc.) and enable downstream services to subscribe to these Data change notifications (DCN).
- Consumer (Downstream) Application — Subscribes to the DCN events published by the router and performs desired actions.
We decided to achieve this by implementing the Apache Maxwell daemon to read the mysql binlog using a new producer service, named LEGO. A publisher-subscriber model of routing was also setup using rabbitMQ.
To select a CDC library, a bit of competitive analysis was performed between Maxwell and Debezium, and we decided to use maxwell, due to its low operational overhead, familiarity and ease of implementation.
How does Lego work?
Mysql server’s binary log consists of files containing “events” that describe modifications to database contents. The server writes these files in binary format. Each event contains an entire row of data. Lego reads these log files and compares the current events with the previous log file and creates a Data bus argument.
The Data bus arguments contain the table name, column name, row identifier, old value and new value for a given column. The events are generated in a standard configured json structure. Lego publishes events to a destination, i.e. a queueing system with a Pub-Sub mechanism.
An exchange named DCN (data change notification) is set up with configurable routing keys for Lego to publish these events into various different queues. As new routing keys are configured, new durable queues are created in RabbitMQ and Lego starts publishing events.
Format of Routing key : dcn.{cluster}.{pod}.{database}.{table-name}.{merchantId}
Now by binding the queue(s) against the dcn exchange and giving the corresponding routing key we are able to listen to the DCN events from a given queue and consumers who are attached to these queues can consume the messages. If a consumer needs events from multiple tables, then it can bind with multiple routing keys in the same queue.
The json structure of each event message is explained in the following.
Now multiple consumers can consume the required events from a single/multiple queue. After consuming the messages, each individual application or a service can generate an event to the external system or to another service. In this way we can synchronise data across the downstream systems and applications reliably with minimal latency.
With all of it in place our new architecture looks as below:
Additional considerations for new framework:
- Read Lag — In a rare situation, due to very heavy database activity, Lego may not be able to keep up reading binlogs at the same pace, and this could result in a read lag. Reads can also pile up if Lego itself goes down for some reason.
- To ensure recovery from such a situation, Lego maintains its own position in a database table, which is equal to the timestamp of the last read log. This ensures Lego can always start reading the bin log from a particular point, and recover.
- Router (rabbitMQ) down — All the queues are created as durable queues so that in case of a restarting of RabbitMQ all the queues with any pending messages are recovered.
- Consumer (Destination) down — Since all the messages are kept in RabbitMQ till consumed, the consumer when comes up can process the pending messages.
All the components of this framework are built with standard observability, part of our regular monitoring & alerting processes, to ensure we can proactively handle any issues in this entire pipeline.
Benefits with new architecture:
- Since micro services do not need to fan out specific events, and are not impacted by old MQ series issues, overall scalability, efficiency & stability of the micro services increased significantly.
- No-code setups capability for new events have resulted in faster go-lives and experimentation for brands.
- Technology teams can focus on roadmap work instead of keeping building new events for every requirement.
This concludes Part 1 of this two part series. We shall discuss the challenges and opportunities in this new framework, and future plans in Part 2.