Data Architecture in Distributed Applications

Greg Wilmer
Data Weekly by Jumpmind
3 min readFeb 14, 2018

JumpMind has helped a lot of companies with synchronizing data for their distributed applications. With the evolution of computing devices into every day devices (your phone, your watch, your refrigerator, street lights, tire sensors, parking meters and the power grid), the Internet of Things will continue to become more and more a reality.

JumpMind has helped a lot of companies with synchronizing data for their distributed applications over the last ten years. As computing devices become smaller and continue to be embedded into every day connected devices (The Internet of Things), distributed applications will become more and more prevalent. Yes, soon our refrigerator will be reminding us of things just like our phones and watches. With the distributed applications we work with, there is always a base set of data needed for the device that needs to be synchronized, usually centrally, to or from the cloud or other locations. Our SymmetricDS toolset was built just for these scenarios and has been synchronizing data in distributed applications for over eight years. Chances are good when you drove to work this morning, you passed more than a few locations running SymmetricDS. A key element in synchronizing distributed applications is data architecture design. Here are just a few design considerations we run into on a regular basis when helping companies work through synchronizing distributed data in relational databases.

First and foremost, don’t underestimate the need for thorough attention and design to the synchronization aspect of the application. When designing an application, it’s easy to focus all attention on the functional aspects of the system versus the underlying architecture. We’ve gone into many scenarios where the synchronization design was an afterthought and by the time the team began focusing on it, it was really late in the game. Any changes at that point are always more difficult, time consuming and expensive.

Uniqueness — When applications run in a single location, uniqueness is pretty straight forward, but multiply locations times 50,000 and it’s definitely something to think through. Many times we see clients use auto incrementing integers or sequences as primary keys for their application tables, which is fine when the application runs in a singe location. However, if you have auto incrementing sequences in multiple locations, guaranteeing uniqueness gets more complex. If you have two locations, you can set initial sequence values to 1 and 2 and auto increment by two thus utilizing odd key values at one location and even key values at the other. Continue adding nodes, and this strategy quickly becomes unmanageable. Composite primary keys for uniqueness with a unique location identifier as part of each key is another option that works well. Last but not least, Globally Unique Identifiers (GUID)’s as primary keys are an excellent option. All of these options have pros and cons that should be worked through when working on the synchronization design.

Data Criticality — This probably sounds pretty obvious, but it’s remarkable how many times this isn’t well thought through. Understanding the application’s data needs and the criticality of those needs is paramount to the design of the synchronization solution . For example, log and usage information are important, but transaction data is critical. Data categorization and classification will go a log way in helping design a real-time, prioritized synchronization scenario.

Relational Theory — No surprise here, normalization vs denormalization is critical to any good database design. However, bad choices in relational design can have further impacts on the synchronization scenario as well. We recently had a client that denormalized batch processing status information into core transactional tables. The transactional tables were set to synchronize centrally. When transactions were processed without error the synchronization worked well. In other scenarios where transaction processing failed, the denormalized batch processing information in the core transaction tables was updated repeatedly for each processing attempt on the transaction. This resulted in the entire transaction set being synchronized repeatedly, unnecessarily, due to the denormalized design of the table structures.

Like anything else, solid design begins with a good foundation. A good data architecture design will ensure you are starting your synchronization design on the rock vs the sand.

--

--