Data Replication: Must-Have Architecture Capabilities

5 min readMar 29, 2020

In every organization I’ve been a part of, there is a common struggle in moving data from one place to another. It’s usually for good reason — for instance providing executive decision support through a data warehouse, enabling full text and faceted searching, or publishing data as a product. The objective is simple — just send all the data from the source (or a portion of it) to the target. Unfortunately, a mirad of things can go wrong when trying to do this. To replicate data well, you must do four things: the data must be valid, complete, chronologically accurate, and timely.

Most systems I’ve encountered are designed to make sure that data is valid. But it gets more difficult when not all of the data makes it to the target, or is out of order, or late. More and more businesses are requiring these systems to be near real-time, which increases the likeilhood of these errors if not addressed at system design.

Validity

Data in the target must be semantically equivalent to the source. Note that I said equivalent, not equal. You cannot expect the data to be the same, especially if you are moving from one type of system to another e.g. RDBMS to NoSQL. But you can, and should expect the data to have the same underlying meaning in the target system. For example, you may have a value in the source that is a date stored in a text field. In the target, you may want to represent that using an actual date data type. Doing so comes with certain restrictions — the input must be in the expected format, must be within a certain range, may or may not accept nulls, etc. These restrictions can make the data semantically different if not paid attention to.

As in the above example, data types are a common method in RDBMS and other systems to enforce data quality rules. In the same way, data tests can be applied to ensure semantic equivalence. Data tests can be applied when the data is in flight or periodically to data after it has been persisted. While in flight tests allow for immediate action on data that may be inaccurate, it can only provide row-level tests for accuracy.

In either case, data that is detected as inaccurate must be acted upon. If the problem is severe enough that the data should not be placed into the target, then it must be placed into a quarantine where then someone is notified to take action to resolve the issue. If the problem is less severe, it may be prudent to continue to persist the data, but flagged so that consumers of reports are aware that the data is inaccurate.

Completeness

Data in a target should be a complete representation of what is in the source. I’ve seen this again and again throughout industries. Rows get dropped. At best they are recovered and there is a delay in reporting; at worst they are forgotten about and trust is eroded in the target data set. Hours and hours of painstaking reconciliation can be spent on resolving these issues.

Making sure a target data set is complete comes with two architectural objectives. First, designing the system in a way that you can detect when the target has incomplete data, and second designing the system with resiliency to not drop data between the source and target.

Detecting incompleteness requires logging of activity. As batches are processed, there should be a record of the batch, how many rows were processed, and details on any errors. If the systems are similar in the way data is stored, it may be reasonable to apply a checksum to source and target batches. This will have a double benefit — both in verifying the completeness of the data and also the validity. The difficulties are two-fold as well. Checksums are difficult to apply in practice because it requires the exact same textual input data; if one value if off or formatted incorrectly, you will have a false positive. You also cannot tell what is wrong with the data if the checksum fails and will then need to manually reconcile.

Completeness gets particularly difficult when data is updated or deleted. In this case, it’s almost required that a journal be kept of changes to ensure updates and deletes are applied properly downstream.

Thankfully, many ETL systems already have this functionality and do not require developers to roll their own. When possible, use what’s already been built so that you don’t have to.

Chronology

This edge case is particularly difficult to fix if not thought about and designed for in the beginning. It’s best to explain with an example. Say I recently moved and am updating my address with my bank. I accidently misspell the street name, and after submitting I have to go back and correct my error. Unbeknownst to me, both of those requests flow through the bank’s data warehouse through two paths — the accident gets delayed and arrives after the correction. If this is not accounted for, reports would show incorrect data for my account.

While this is not a common scenario for legacy data processing systems that would batch process a day’s activity overnight, in today’s real time systems this can be a daunting and persistent issue.

The resolution is simple, yet complicated to implement. Processing systems must use the source modified timestamp to determine chronology and must apply changes in that order. This is complicated in that this logic must be pervasive throughout the data processing systems. ETL teams can integrate this into their code reviews and “definition of done” to ensure its use. QA teams can make sure do integrate this into their test plans as well. Systems with automated data tests can have a suite of tests to integrate into the development process.

Timeliness

In a near real-time data pipeline, standard cloud architectural principles are important to follow. Design the system to auto scale so that intermittent spikes in volume do not break time-based SLAs. In the design of the pipelines themselves, it’s important to also be able to detect and diagnose SLA violations so that the underlying issues can be resolved. All batches should have a processing time recorded in a log, along with the earliest source modified timestamp in each batch. With these data points recorded, SLAs become reportable and they can help teams figure out root causes.

Summary

When designing a replication pipeline, it’s never as easy as just pulling the data out of a source and delivering it to a target. At the core, users want to be able to trust the data they are using. When the data are used for business decisions, that trust is paramount. As you design replication systems, be sure to keep that in mind and ensure the validity, completeness, chronology, and timeliness of your pipeline.

Data Replication: Must-Have Architecture Capabilities

Written by Nathan Murray