Using Mapping Tables to Merge Data with Auto-Number Keys Referenced by Other Tables
Introduction
The problem and solution can be demonstrated with the following picture:
We have two instances of databases with similar structures that were not designed to be combined. The tables use auto-number primary keys, which are also used as foreign keys — a common model that works fine until the data needs to be merged.
Now, we want to combine all this data into a target database. The goal is to retain both the original target parents with their children, plus the source parents with their children. Since we use an auto-number primary key, each source row in the target DB will get new IDs (e.g., 3 and 4), and these new values must be used for corresponding child rows.
We’ll discuss a solution based on source-ID-to-target-ID mapping tables. This approach was recently implemented in the ID-Remap Merge feature of KS DB Merge Tools. We’ll also briefly cover alternative solutions. The implementation details will focus on several popular relational DBMSs — Oracle, MySQL, SQL Server, PostgreSQL, and SQLite — but the approach can generally be applied to any relational DBMS.
Other Solutions
It’s clear that auto-number primary and foreign keys are not suitable for such operations. A more natural solution would be to use GUIDs, which are unique across all databases and can be copied between databases as-is, preserving referential integrity and avoiding conflicts with existing data:
However, this solution is not suitable for our task because we can’t change the database design. It may be a legacy database, or there may be other reasons why we can’t modify the structure of the target database.
Another approach requiring a redesign would be to change the key structure by adding a DatabaseID
value and changing keys to be composite keys consisting of that DatabaseID
and an auto-number:
We also don’t consider some tricky ways, like to import source records with negative IDs:
Or shifting ID values to some number space not used by target IDs:
These methods can work under certain circumstances, but they are generally less applicable. Negative IDs can be used only once, and shifting IDs creates gaps in ID values, reducing the range of available IDs.
Our Solution
When we insert source parent rows into the target database, we need to keep a mapping of the original parent ID from the source DB to its new value in the target DB. Then, we use this mapping to assign the correct parent ID values for child rows.
The examples below are provided for SQL Server, followed by clarifications for other DBMSs.
First, let’s create a mapping table in the target DB:
CREATE TABLE dbo.MapParent (
SourceID int primary key,
TargetID int not null);
Next, we loop through source Parent rows and generate INSERT
statements for the target DB, keeping new parent ID values in the mapping:
INSERT INTO dbo.Parent (Name)
OUTPUT
1, -- source ParentID
INSERTED.ParentID
INTO dbo.MapParent
VALUES ('Source parent #1');
Result of this statement in the target DB will be a new Parent table row with new auto-number ID and new row in the dbo.MapParent table:
The OUTPUT
statement to extract the inserted ID value is specific to SQL Server. For other DBMSs, we can use the following methods:
- Oracle:
DECLARE target_id NUMBER(19);
BEGIN
INSERT INTO "Parent" ("Name")
VALUES ('Source parent #1')
RETURNING "ParentID" INTO target_id;
INSERT INTO "MapParent" VALUES (1, target_id);
COMMIT;
END;
- MySQL / MariaDB:
INSERT INTO `Parent` (`Name`)
VALUES ('Source parent #1');
INSERT INTO `MapParent`
VALUES (2, LAST_INSERT_ID());
- PostgreSQL:
INSERT INTO "Parent" ("Name")
VALUES ('Source parent #1');
INSERT INTO "MapParent"
VALUES (2, lastval());
- SQLite:
INSERT INTO "Parent" ("Name")
VALUES ('Source parent #1');
INSERT INTO "MapParent"
VALUES (2, last_insert_rowid());
Going Further
Since we use auto-numbers for primary keys in our database, our Child
primary key is most likely also an auto-number. But what if our Child
primary key is used as a foreign key for another table, say GrandChild
? We need to use the same approach: create a MapChild
table with SourceID
and TargetID
columns to keep the mapping for child rows and populate it during INSERT INTO Child
.
In real-world applications, we may have complex objects at the application level that are represented in the database as many rows in many related tables linked by foreign keys. In our Parent-Child-GrandChild example, we must ensure that we merge a parent row with all its children, and all sub-children of each child, and so on. We can obtain this information from the database metadata:
- SQL Server:
sys.foreign_keys
andsys.foreign_key_columns
- Oracle:
SYS.ALL_CONSTRAINTS
andSYS.ALL_CONS_COLUMNS
- MySQL/MariaDB:
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
andINFORMATION_SCHEMA.KEY_COLUMN_USAGE
- PostgreSQL:
INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS
andINFORMATION_SCHEMA.KEY_COLUMN_USAGE
views or underlyingpg_constraint
and other system catalogs - SQLite:
PRAGMA foreign_key_list(table-name)
can be used to get parent relations; child relations can be retrieved based on parent relations for all tables, and all tables can be taken from thesqlite_master
table.
General Algorithm for Merging a Parent with All Levels of Children
- Set the “current” table to the source parent that we want to merge with all children. Later, the current table will be replaced with child tables, but the steps for the current table will not change; this is a recursive algorithm.
- Create a mapping table for the current table.
- Read the required rows from the source DB for the current table. If the current table is a parent table, “required” means the rows needed for the business goal (e.g., importing a specific parent or parents matching certain criteria). Otherwise, if the current table is not a parent table, we deal with its children and need to process only those children related to our parent. This requires temporary storage of a list of IDs for each processed table. For example, if
ParentID
2 is referenced byChildIDs
3 and 4, then when processing theGrandChild
table, we only need its rows withChildID
3 and 4. - For each row, generate
INSERT
statements, populating the target table and mapping table from step 2. If the current table is one of the child tables, then foreign key values referencing previously processed parents must be replaced with aSELECT TargetID FROM Mapping
subquery. - Find all direct child tables (those with foreign keys to the current table).
- For each table from step 5, change the current table to that child and repeat all steps starting from step 2.
As a result, we will generate a SQL script that creates mapping tables and inserts all required source data from the parent and all its children. Executing such a script will give us the expected result shown in Fig. 1. If necessary, we can add DROP TABLE
statements for the mapping tables because they are no longer needed for INSERTs
. However, these tables can be useful for further troubleshooting, they can be used to identify rows merged from the source database. They can also be used to divide a complex merge into parts, for example, merging reference data first, verifying the merge result, and then proceeding with the rest of the data using the mapping tables created for the reference data.
Caveats
Circular References: Unfortunately, foreign key dependencies may form cycles. For example, our schema could evolve as follows (pseudo-SQL without data types, to demonstrate relations between tables and the historical order of schema changes):
CREATE TABLE Parent (ParentID PRIMARY KEY, Name);
CREATE TABLE Child (ChildID PRIMARY KEY, ParentID NOT NULL REFERENCES Parent(ParentID), Name);
ALTER TABLE Parent ADD COLUMN FavoriteNephewChildID NULL REFERENCES Child(ChildID);
The word “Nephew” in the column name is intentional to emphasize that this ID should belong to the child of another parent. Our algorithm can’t be used here because when we process the Parent
table, the Child
mapping does not exist yet. And that's not just a mistake in the algorithm. Logically, it may happen that we need to merge a parent referencing a child that is not yet merged, and vice versa. The solution is to split the Parent
table merge into two steps:
- First, merge
Parent
rows withoutFavoriteNephewChildID
(set it toNULL
) and generate a mapping forParent
. - Next, merge
Child
rows using theParent
mapping and generate a mapping forChild
. - Finally, update
Parent.FavoriteNephewChildID
using theParent
andChild
mappings.
In this example, we have two tables referencing each other, but in general, there could be longer cycles, such as T2
referencing T1
, T3
referencing T2
, T4
referencing T3
, and finally T1
referencing T4
. The solution will be the same: merge one of the nullable foreign key columns as NULL
first and then update it in an additional step.
Other Considerations:
- Triggers: Processed tables in the target database may have
INSERT
triggers that change related data in other tables. We need to decide whether this is desired. If not (e.g., we want to import these related tables separately), additional actions will be required, depending on the DBMS, which may or may not support trigger disabling. - Calculated Columns: Calculated columns need to be excluded from
INSERT
statements generated in step 4 of our algorithm. - Changed Columns: It may happen that the source and target databases use slightly different schemas (e.g., different versions of applications), and the processed target tables are missing columns present in the source. In this case, we need to generate
INSERTs
only for common columns. Of course, if the source table is missing aNOT NULL
column present in the target, the entire merge operation may not be possible.
Conclusion
Combining databases with auto-number primary keys presents a number of challenges. By using the recursive algorithm discussed in this article, you can confidently merge parent-child relationships and handle more complex data hierarchies without losing integrity or causing conflicts.