Merging tables is hard

Dean Allemang
7 min readJan 10, 2023

Way back in 1978, William Kent, writing in his book, “Data and Reality”, which is now considered a classic, commented on what he referred to as “record technology” — that is, the technology of representing data in stable record formats. He was writing before the introduction of spreadsheets into the general marketplace; so he didn’t refer to this sort of representation in the way we usually do nowadays, as “tables”. But here is what he had to say:

Record technology is such an ingrained habit of thought that most of us fail to see the limitations it forces on us. ¹

— William Kent

Even though graph databases existed in those days, they didn’t have the sort of hegemony that tables did (and still do). This means that if you want to talk about merging data from multiple sources, you might as well expect both data sources to be tables. So just how difficult is it to merge tables?

A collection of different tables arranged haphazardly.

Let’s take a couple of simple tables as an example. These are intentionally very simple, so that they can fit into a blog like this, but they show off what is going on.

Here’s a table with some people’s names and addresses in it, as well as a company loyalty level — maybe this table’s job is to list the loyalty level (silver, gold, platinum) of regular customers:

Names, addresses and loyalty levels

Here’s another table with just addresses. Maybe this is a contact list.

Names and addresses

These two tables share a lot of coverage for the data they represent, but they don’t actually share any column headers except “city” and “state” (assuming we ignore case). How might we combine this data together, as a table? The most obvious (and naïve) way to do this would be to just slam both tables into one, with all the columns and rows of the original two:

Both tables slammed together

Not only is this a naïve way to bring two tables together, it also isn’t very useful. You really don’t have any benefit over having the original two tables as they were.

How might we do any better? Just looking at the data, we might expect that Bill Kent and William Kent are actually the same person, and that Addr1 is actually the same feature as Address1, and that zip is the same as Postcode, etc. Figuring out that two rows from different tables actually refer to the same thing, or that two columns actually refer to the same thing, is a difficult and subjective task. I want to borrow a concept from Computing Theory called an oracle machine (not to be confused with the famous database company of the same name). The wikipedia article for oracle machine is a bit technical; you can think of an oracle machine as a subroutine that solves some very particular problem. Its use in computing theory is to compare the difficulty of two different problems. It helps us investigate questions of the form, “if I had a solution to this problem, how much more work would I have to do to solve this other problem?” That’s how I’ll use it here.

Suppose we had a solution to the identity problem. That is, suppose we had a subroutine that could tell us when two things are the same. To be specific, it can tell us when two rows (from same or different tables) refer to the same thing; and whether two columns (from same or different tables) refer to the same thing. Usually, in computational theory, oracle machines perform tasks that we don’t expect any real machine to be able to do. In this case, there are services that do a pretty good job at figuring out identities of this sort, so our oracle machine isn’t as farfetched as the theoretical ones.

Let’s imagine that the oracle has told us that indeed the first row of each table refers to the same person (i.e., Bill Kent and William Kent are the same person). Suppose further that it tells us that the fields Addr1, Addr2, City, State, and Zip are the same as Address1, Address2, City, State and Postcode respectively. What can we do to combine the tables now?

We’d like for there to be a single row for Bill, so a quick improvement we could make is to combine those rows into one:

Combined tables, taking advantage of the oracle’s knowledge that Bill is the same as William

But we still haven’t used the oracle’s knowledge that Address1 is the same as Addr1 etc. That gets a bit trickier in a tabular form, since now Bill has two street addresses etc. I guess we’ll have to cram two values into a single cell. There are lots of ways to do that, we’ll use a slash (“/”):

This table captures all the knowledge the oracle machine has for us, but at quite a steep cost; we had to do something fully non-standard in the merged cells; in some sense, this merged object isn’t really a table anymore. Our table management software won’t help us use this, and we can’t expect any programs we’ve written for either original table to work at all (for example, imagine trying to print a mailing label using this table). And heaven help us, if the content of some cell originally contained a slash (“/”)!

This example shows off another ubiquitous problem with merging data; it’s called the reification problem. We can see the problem in this example with Bill’s address. We know from the original data, that his Stanley St. address is in Catawba, FL, and his Alliance Way address is in Oronoco, OH. But when we look at the merged table, the only way to know that is to count how many slashes are between the front of the cell until the value in question; Catawba is the second value, as is FL. But if we aren’t sure of the process by which these things are filled in and updated (and using a slash in this way isn’t standard, so we’re on our own here), we could mistakenly believe that Bill lives in Catawba, OH or that apt. 603 is in the building at 16b Alliance Way. This might seem like a trivial problem, but that’s just because the example is so small. Imagine you had hundreds of records about Bill (that’s pretty normal in a large enterprise like a bank), and that you have been updating them for a while (you know, like since before the banking merger spree back in 2008); can you be sure that everything is always lined up correctly?

Webster defines “reify” as “give definite content and form to (a concept or idea)”. In data terms, we use the word “reify” to refer to naming something that was only implicit in the original data structure. In this example, both tables included a concept of an “address”, which has several components; the tables show the components (e.g., Addr1, Addr2, City, State, Zip), but not the concept itself (“address”). If we were to reify this concept, we would explicitly (concretely, in Webster’s terms) identify these things as making up an “address”. If we were to do that (somehow; spreadsheets don’t give you a good way to do this, short of making a new row for each one, which is where we started out), then we wouldn’t have this problem; Bill would have two of these “addresses”; one of them in Catawba, FL and the other in Oronoco, OH. There wouldn’t be any way to be confused.

In a relational database schema, there are ways to solve this; make a table for addresses, give each address a primary key, then make a many-many join table to connect those addresses to the people in this table; Bill would get the two addresses shown here. Unlike the solution presented in the images in this blog, that method is pretty standard, so it has a fighting chance of being correctly maintained.

To review what we’ve done so far; even armed with an oracle that can tell us when two things refer to the same thing, managing the implications of those equivalences in tables is a pretty tall order. You have to solve the problem of multiple values, you have to work out a solution to the reification problem, and you have to represent that back in tables. There are solutions in relational database workflow for all of this, but they go far beyond simple table structures.

A collection of different tables arranged haphazardly.

In a subsequent blog entry, I’ll show how this same problem works out with graph structures, and how many of these issues are simplified, or even avoided altogether.

¹William Kent, Data and Reality 3rd Edition. Technics Publications, 2012. 1st edition published, 1978

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.