The Trouble with Foreign Keys

Eric Long
Data Weekly by Jumpmind
5 min readMar 8, 2018
Foreign keys can be a challenge when loading and syncing data.

Integrity constraints improve the quality and solidity of your data. Would you want a sales order in your database that references a customer who doesn’t exist? Of course not! Foreign keys are an important integrity constraint that prevents misbehaviour from applications and users. However, constraints can pose a problem for data replication. With some careful planning and intelligent features, we can avoid problems and keep our database honest.

Initial Load

The first step in data replication is to populate the target database. This is also the first place we can encounter a problem. I’m using SymmetricDS, which calls this process the “initial load.” It creates the tables, populates data, and it can even translate across different database platforms.

The SymmetricDS Pro wizard for sending an initial load to a target database.

Many solutions will send the tables in alphabetical order, but if I’m using foreign keys, the tables need to be in order by dependency. A smart solution like SymmetricDS saves me some time here by ordering the tables automatically, based on their foreign key constraints!

Self-Referencing Tables

But what if I have a self-referencing table — that is, a table with a foreign key to the same table? Now the rows of a table need to be sent in a specific order. While an automated solution seems possible, I haven’t seen one before. It’s currently a manual configuration in SymmetricDS. Fortunately, there’s not usually many self-referencing tables that need configuration. The “Table Routing” screen lets me adjust the query used in the initial load. For a simple parent-child relationship, I can just order by parents first with “order by parent_id”. (If nulls are ordered first on your database, otherwise use “order by parent_id desc”.)

Simple ordering of rows to satisfy constraints for a self-referencing table.

Hierarchical Self-Referencing Tables

But what if there are multiple levels of children? I was afraid you might ask that. It’s possible to do this using “dead triggers,” which are triggers configured for the same table with its sync flags turned off.

“Dead triggers” are used only for the initial load, not change capture.

Definitely more configuration on this one. You need to create a dead trigger and table router for each level of the hierarchy. The SQL also gets a bit complicated, either joining the table on itself, which isn’t bad for a few levels, or using a so-called “common table expression” (CTE) that performs the recursion for you. The first level is just “parent_id is null”. The second level is:

exists (select 1 from mytable x where x.id = t.parent_id and x.parent_id is null);

And levels after that need to successively join the table in again, like this:

exists (select 1 from mytable x inner join mytable y on y.id = x.parent_id where x.id = t.parent_id and y.parent_id is null);

Here is the CTE version, which uses a recursive join, and you can just plug in the level number you want:

id in (
with children(id, parent_id, level) as (
select id, parent_id, 0 from mytable where parent_id is null
union all
select c.id, c.parent_id, c.level + 1
from children c inner join mytable p where c.id = p.parent_id
)
select id from children where level = 1
)

Not for the faint of heart, right? But at least it’s possible to configure the ordering of rows for a self-referencing table. Detecting and automating this without configuration would be a great feature for a data replication solution.

Cyclical Dependencies

Another sticky issue is cyclical dependencies, either within the schema or within the same table. An extreme example is two rows that depend on each other. How do you even insert these? I’ve seen two approaches:

1 — Insert the rows with the foreign key values set to null, then update the rows to set the values.

2 — Disable the foreign key constraints on the tables, load the data, then re-enable the constraints.

Implementing the first one involves setting up transforms to filter the foreign key values on the first pass and using a dead trigger to send the table rows a second time. I’ll spare you the incantations and leave it as an exercise. The second approach is easier to configure — in SymmetricDS, you can provide a script to run before and after the initial load using parameters “initial.load.before.sql” and “initial.load.after.sql”. Here again, we find it’s possible to configure, but an automatic feature would be great.

Syncing Changes

When syncing changes, some solutions will merge and send changes in no particular order, which of course leads to more foreign key violations. While it can be more efficient to merge changes, solutions like SymmetricDS that preserve the order of rows solves this problem. Another nice trick is to automatically resolve foreign key violations when they happen by sending the missing rows. Why were rows missing in the first place? Good question. For one thing, certain sync configurations make it possible. For example, with multi-master asynchronous replication, it’s possible for a node to go offline after sending a row to some nodes, which then insert referencing rows and try to sync those. With SymmetricDS, it detects the error code for a foreign key violation and notifies the source, which then uses the foreign key relationships to query and send the referenced rows. Self-healing behaviour is a great feature to have.

Ready for the Challenge

Foreign keys help maintain the integrity of your data, but they can make it a challenge to load all data, handle cyclical references, and keep changes in sync. Now you know what some of the issues are and what approaches are available for solving them.

--

--

Eric Long
Data Weekly by Jumpmind

I’m a software developer for JumpMind, a dedicated Linux user, and a technology enthusiast who appreciates the benefits of open source.