How did we reduce the stock availability changes from 120 second to less than 30ms with debezium?

Intro

Hüseyin Karaalp
Trendyol Tech
4 min readJan 31, 2023

--

As a Fulfillment team, some of our responsibilities are stock management such as counting, transferring, defective product, purchase orders, return, picking, allocating and managing available stock in our warehouses.

In this article, I will provide system design information about how we reduced the stock availability changes in the warehouses from 120 seconds to less than 30ms with debezium and how we prevent overstock sales. In addition, this article will not describe the installation of debezium, CDC at SQL, etc.

Terms

Before moving on, I need to explain some terms.

Overstock sales; it is the sale of a product that we cannot supply. We don’t want this situation to happen.

Available stock; the number of stocks we can sell. This term is so important for us.

StockAPI; this API collects stock information from warehouses(supplied by us) and MP sellers and provides this information to the UI. This API is the responsibility of a team outside the our tribe.

Why is available stock important for us?

We should feed available stock information to StockAPI quickly and correctly.

It may be Overstock sales when we over send this number. In the opposite case, products may not appear in UI even though stocks are available. In both cases, there is customer dissatisfaction.

How was the system working before Debezium?

With the job triggered every 2 minutes, Stored Procedure(SP) is run and the available numbers of all products are calculated. Calculated stock numbers are sent to StockAPI one by one. (If you are wondering about the details, you can watch the youtube video we mentioned in detail.)

One of the biggest disadvantages here is the 2 minutes in between. Stocks can be closed for sale within 2 minutes due to different processes such as locked stock locations for sale outside of B2C sales, reserving stocks for B2B sales etc. For example, a situation like the one below may occur when the location is locked for sale.

Consider the time below and the descriptions.

Other disadvantages;

  • Calculating all stock counts with SP each time was costly. CPU increments were occurring in the database while the SP was running. This increase started to affect other applications running on the same database server.
  • This system was sufficient at first. But with the newly opened warehouses and the increase in inventory variety, the calculation would take more than 2 minutes. When the calculation needs more than 2 minutes, the result would be overstock sales.

Why did we decide to use debezium?

Legacy applications were running on the system. These applications could directly access the stock context and execute CRUD operations. We did not have an event-based approach yet, therefore we chose Debezium approach. If CRUD operations had done a single place and sent an event, we could consume this event and send stock changes immediately.

System Architure with Debezium

System Architure

The system design is as above.

Change Data Capture (CDC) has been activated at SQL Server and Stock table. It is set up to capture only the changes in the columns we want in the Stock table. This way, CRUD transactions from different applications were captured and inserted in the CDC table. (Fig 1)

We set up Debezium as a connector. This connector can monitor and record the database changes and publish them to Kafka topic. (Fig 2)

We consume kafka topic and processed data. Processed data sent to new kafka topic for StockAPI. (Fig 3)

Thanks to the message being captured and processed instantly for each change, the new systems send time of changes has decreased from 120 seconds to 30 milliseconds.

How do we monitor the system?

I would need to be notified immediately when any problem with the system happens. Instead of hearing about a system issue from a different point, I prefer get a notification from monitoring tools. We believe that’s really important. We’ve actually done a few things for that.

We have default monitors for all of our systems. This approach allows us to monitor our system via dashboards and send alarms when necessary. For example, we monitor CPU / memory / restarts counts for our pods via Grafana. When we exceed threshold, we send alerts to my slack channel. We are also using anomaly detections in New Relic. Suppose there is an anomaly in the system. In that case, the system has about 2000 throughout, which has suddenly decreased to 20 or increased to 5000 and is going on; we get direct alerts to my slack channel.

The connector could be down. When the connector down, changes cannot be published to the kafka topic. For this situation, we’ve set up a PRTC alerts.(For Fig 2) There is an HTTP endpoint where the connector gets me the status like RUNNING, STOPPED. If the status is changed from running to something else, we get direct alerts to my slack channel.

Worker service may stop processing messages. The worker service actually logs all processes, started and finished. For this situation, we’ve determined a threshold finished log count. (For Fig 3) If the system logs less than a certain threshold in 30 seconds, we get direct alerts to my slack channel.

The connector and worker service are running. Everything is as we expected but I may be processing messages with lag. Imagine that the system messages are processing slowly and accumulating in the Kafka topic. For example, the message I’m processing is from 10 minutes ago, and I’m 20,000 messages behind. For this situation, we’ve set up a Grafana alert. When exceeding a certain threshold for falling behind with the messages, we get direct alerts to my slack channel.

Thank you for taking the time to read my article; I always welcome your feedback.

--

--