Building a Deduper, a systematic approach in our system context

If you are a software engineer, you would have heard about deduplication hundreds of times. Every system that wants to scale or focus on scaling should be interested in filtering out duplicate processing as early as and in the shortest possible time in the event processing pipeline. At Walmart Global Tech, we are no different for our enterprise rule engine platform that was built to optimize a specific event (we call it as override on rules) where we needed processing that was created and updated in many different ways by the user (e.g., UI/API/spread-sheets via REST or Kafka).

Assumptions:

  • If a duplicate is found, the processing will be considered as a success.
  • If the duplicate check fails, we will continue with processing.

Approach #1: (Sprouting)

My first system design approach was to rely on two main components such as:

  1. Bloom Filter (that determines either MAY BE present or NOT present)
  2. A key value store (MemCached or Cassandra(C*) (as shown in figure 1 below)

We would first check if an override’s key (hash of it) may be present in the bloom filter, then call the second key/value store (cache) to find a duplicate. If a duplicate is not found in cache or a key was not present in the bloom filter, then processing would follow the normal flow. If a duplicate is found, the system will mark the event as “duplicate” and respond to the caller marking it successful processing. Here is a high-level system diagram of our override processing system (shown in figure 1).

Figure 1 : System Diagram for ‘Override Deduper’

Advantage:

  1. It’s an easy design and simple to understand.
  2. As the data volume is not so large yet (although it’s growing), we did not focus much on a cache invalidation strategy.
  3. We preferred to use Write Through Cache (latency is around ~1ms for the same data center(DC) and <10ms for cross DC for cache write and there is no additional cost to write to DB as earlier we were writing to C* anyway) over Write-Around or Write-Back Cache.

Challenges:

  1. We need to keep bloom filter entries up to date to be effective.
  2. Handling partial failures were really tricky along with timeouts, etc. This prompted us to consider our approach #2 mentioned below.
  3. We planned to do a duplicate check on both the key (which includes override key + profile + consumer domain details, etc.) and the payload ( status+action, etc.) but in our case, just checking whether the key is present and whether the payload matches is also not enough because there is an additional dependency on the computed final status from different time frames (we call it timebox) for every incoming event. An example of an Override payload looks like:

Approach #2: (Budding)

We planned to overcome handling partial failures with the introduction of STATE and TTL (nothing but a simple time duration since event is created). There are three possible types of states: 1. Start (e.g., Start/Null), 2. Intermediate (e.g., IN_PROGRESS/Processing, etc.), and 3. Terminal states (SUCCESS/FAIL) and their transition.

Along with states, there could be two actions such as DROP/PROCESS (the event) determined from state transitions (as shown in figure 2) e.g., If a new event comes into the system that has a same key and payload and the current state is Processing and the predefined TTL is not past, then it makes sense to drop the event otherwise process the event if TTL is past since the system could not update to terminal state by that time.

Figure 2 : State transition diagram
Figure 2 : State Transition Diagram(State machine approach)

Advantage:

  1. This approach has the additional advantage of handling partial failures unlike the previous case.

Disadvantage:

  1. It will introduce an additional table for a state transition and add an extra cost for look up and maintenance.
  2. Computation for an overall override status is still needed for checking the existing value/payload for a key in cache.

Approach #3: (Blooming)

“the simplest solution is almost always the best.” — Occam’s Razor

We also tried to simplify the solution to our deduplication problem. We knew by now that our system did a lot of computation such as merging the incoming timebox (time frame) and status with existing timeboxes and statuses in the system to determine the final merged state (overall override status). So, we could conclude, that if our merged state is the same or partially matched for any incoming request with the saved data, we should mark it as a duplicate/semi-duplicate. Hence, we placed our deduplication logic after computation is done. Deduplication status is a composite status in this case that has the following decision:

  1. isDBUpdate = TRUE/FALSE ->based on this our system database and scheduler (we use our own developed distributed scheduler to trigger these events) gets updated.
  2. isDownstreamProcessing = TRUE/FALSE ->based on this value, we determine whether we need to process these events with our downstream client applications.

Irrespective of any deduplication status, we update our audit logs/history details for internal audits. Override statuses in a time perspective are explained as: After a timebox (refers to a time frame here) merging if there is an override:

  1. from now to future date: Active
  2. from past time till now or past: Expired
  3. from now or future time till future time: Due
  4. all timeboxes are deleted from the system for that override key: Deleted

The system should NOT honor the deduplication status if partial failure scenarios of previous events occur, so we decided to proceed with the deduplication flow for every event and simultaneously check asynchronously from the history/log table what the last status was (other than duplicate status). If the previous status is an intermediate state (e.g., Processing/Timeout), and a configurable time window (e.g., 4 hours in our case) is elapsed, we don’t honor deduplication logic’s decision. Instead we proceed with normal processing.

Advantage:

  1. No need of hashing of key, hence no hashing logic, cost, and collision.
  2. No new additional key value store is needed for caching and maintenance of cache.
  3. No need to maintain any separate state.
  4. Less integration points, less failures
  5. We could avoid down stream processing, database updates and scheduling updates as well if different types of duplicates occur, ultimately lowering the cost.

Disadvantage

  1. Although improves efficiency of override processing by not updating DB or avoiding downstream processing, as deduplication is done after computation, there are a few milliseconds (e.g., 10ms) of cost for every event.
  2. It requires a separate thread pool for an asynchronous check for the previous state and a cost is involved for every event.

Metrics

Everyone loves data, so here is a snapshot of our production metrics. The data is heavily dependent on user behavior, so the plot pattern varies a lot.

Figure 3: Statistics for deduplication at database, downstream systems, and scheduler

Closing:

Trade-offs are a harsh reality of distributed systems design. We have achieved a good percent of our end goal with this design approach.Last but not the least, we also built developer tooling for force triggering events for any processing status and a time frame for worst-case recovery of any unforeseeable issues.

References:

  1. Cache writing schemes — https://shahriar.svbtle.com/Understanding-writethrough-writearound-and-writeback-caching-with-python
  2. https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/
  3. Bloom Filter: https://yourbasic.org/algorithms/bloom-filter/
  4. State diagram: https://en.wikipedia.org/wiki/State_diagram

I would like to thank my friend and an amazing lead Amit Sharma for brainstorming with me to come up with this solution in a systematic manner.

--

--

--

We’re powering the next great retail disruption. Learn more about us — https://www.linkedin.com/company/walmartglobaltech/

Recommended from Medium

PHP 8 Aplikasi TodoList From Learn PHP OOP part 2

LDAPs Certificate in Spring Boot Application and its Docker Image

How To Work With MapKit

(Near) Real-Time Salesforce Applications

GDG & WTM Algiers April Recap!

An Enumerator Class in Ruby

How to set up DataDog in Kubernetes with Maestro

Appium — Adding wiremock and arguments

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Trinath Patri

Trinath Patri

I’m a Sr. Software Engineer at Walmart Global Tech. I enjoy solving complex problems using data structures, algorithms and distributed systems.

More from Medium

Benefits of Distributed Systems

Distributed Systems- Modernization

Towards building a failure resilient system

Designing online/offline indicator