Are your database tables ready for change data capture replication?
With so many database platforms available today that leverage different technologies such as relational, key value, graph, and document store the need for data replication is becoming increasingly valuable. Whether you are looking to setup a reporting platform, a disaster recovery environment, or simply distributing across locations, the ability to keep these databases in sync is critical.
However most users experience similar pain points in dealing with a legacy database that was originally architected without replication in mind. These structures were originally designed for the sole purpose of persisting data for their application. This was sufficient for the original purpose but struggle when trying to apply a database replication strategy to them.
The main problem, primary keys. Depending on the platform, primary keys are usually designed around an identity or sequence based implementation. This can cause some issues though with a master to master or a client server replication pattern because these key values would be generated on multiple sources and conflict with each other during replication. Again this was a sufficient pattern when replication was not involved but with the next generation of high availability and distributed computing replication of your data is critical.
Here are some common techniques that can be used to ensure your database primary keys are replication ready.
- Use GUID values for primary keys.
A GUID is short for a “Globally Unique ID” which usually consists of a generated string value that contains information about the location of the data and/or time of creation. Different languages have libraries for generating these values already in place. For example Java provides the java.util.UUID.randomUUID() method to generate unique identifiers.
Pros
- Ensure that data created at different sources will remain unique when replicated to the same database.
- For a new database structure this is a simple solution to implement and has been used for many years to create uniqueness in databases
Cons
- In an existing database this might be a difficult solution to change existing data and application code to utilize GUIDs.
2. Use seeds and ranges to ensure each source generates its own unique values
Most database platforms allow the user to control the range and seeding of their identity or sequences. The seed is the value a sequence or identity begins on and the range controls the min and max values it will distribute. Some systems also include and increment which defaults to 1 but can be used in setting up your system for replication.
Example: 3 databases to sync.
Database 1 : Seed 1, Increment 3
Database 2 : Seed 2, Increment 3
Database 3 : Seed 3, Increment 3
Pros
- Can be controlled through the database without affecting application coding
- Simple to setup
Cons
- Using the increment approach will not allow a new database to be added to replication. The increments are based on a fixed number of databases up front.
- Using the range approach you have to be careful to set the ranges large enough for the data or you could run out of unique numbers
3. Use composite primary keys
Composite primary keys include multiple columns in the primary key. So adding something as simple as a location name or store id to an existing primary key would be sufficient. Some replication tools (like SymmetricDS) can even add the additional column value for you as part of replication.
Pros
- Ensures uniqueness without and constraints on number of rows or number of databases in the sync scenario
- One of the easier solutions for an existing database as the existing primary key values do not need to change they are simply adjusted to include an additional column.
Cons
- Changing the database structure might incur some overhead to the applications that utilize it.