About Record Linkage and the “Golden Record”

Any differences in the information about the same objects or persons in various systems lead to problems in automated processes. This makes digitalization more difficult and leads to unwanted effects. Therefore, methods of “record linkage” are necessary to find the one “golden record” to an object.

The use of different databases and the import of different sources result inevitably in differences in the data management. Even with utilities and interfaces between the systems, typical differences are unavoidable. Therefore it is necessary to find and merge unique data records of customers, contracts, products or diagnoses. Because only the merging or linking of similar information from different sources to a unique data object makes the identification of all connections and their handling possible. In order to achieve this “golden record”, the methods of “record linkage” are necessary if no reference catalogue is available, i.e. if no check against a defined data inventory can be made.

“The solution is always simple, you just have to find it.”,
 Alexander Solschenizyn

The usage of different databases in various subject areas, different applications and the enrichment of data from other (external) sources leads to different states of the same data sets in the same context. These include, for example:

  • missing information
  • incorrect information
  • outdated information
  • different data schema
    (e.g. different units for the same specifications)

The large number of applications and specialist areas makes it almost impossible to find a universal solution for eliminating the differences. Because a solution for a company can lead to unwanted results for another. Therefore, the “Record Linkage” approach describes only one possible solution.

Due to the differences in the different systems without a leading reference database, supervised learning is also not possible, i.e. machine learning based on comparisons and the determination of probabilities based on the reference database. Rather, all data sets have to be checked against each other in order to find similarities and to evaluate the agreement.

Before this path is taken, however, there are preparatory steps necessary in order to prepare the existing data in the various sources in the best possible way so that they can be automatically used by machines.

Before you start

Each database depends on the work- and data-processes in the respective department. Depending on the requirements, only selected information are recorded, stored and processed. Because not all information about an object or a person is necessary for a process. Furthermore, not every user has the right to view or change all data. There are thus both technical, as well as legal reasons, which lead to differences in the different data stocks of the systems involved.

For the comparison and the merging of different data stocks the contained information must be brought if possible on the “smallest common denominator”. Despite — partly — different data stocks, the contained information can and should fulfil four basic requirements in each individual data stock. The information should therefore at least

  • atomic,
  • completely,
  • correct,
  • up to date and
  • redundancy-free

and corrected accordingly if necessary. Each of the five required states can also be considered separately. This, however, is an ideal conception that can hardly be completely achieved in reality.

Atomic

The term atomic (Greek: atomos “ undividable “) refers to data in their smallest ( undividable ) unit of information. For example, an address can be broken down into individual pieces of information.

Example: `Pariser Platz 1, 10117 Berlin`.

In this example there are four atomic information that should be stored separately. The individual fields are therefore called:

Street (`Pariser Platz`); House number (`1`); Postcode (`10117`) and City (`Berlin`)

The same is possible with other information about other objects. Information on persons, articles, diagnoses, documents, etc. can also be broken down into single pieces of information. Only this enables a clear comparison with data from another data storage, the checking of the available values against an “expectation” (“value set”) and the unique selection of certain criteria.

In the course of the “record linkage”, individual values can thus be compared exactly with each other (deterministically).

Completely

The term “completeness” refers not only to the specialist information necessary for a data set — i.e. the description of an object, a person, a product or the like — but also to each individual piece of information. For example, the following values are correct postal abbreviations, but from a data point of view they are incomplete:

- Adm.-Nelson-Str.
- Adm.Nelson-Street
- Admiral-Nelson-Str.

Only with Admiral-Nelson-Street the value is fully written out and can be compared with other values.

Correct

Different datasets can lead to the fact that not all contained information are correctly available. For example, it must be clarified whether all address details of a person are correct. But also with information about objects, machines, state of health etc. it has to be checked whether the stored information contains errors. This can be done with the comparison against a reference database; e.g. the comparison against official address directories or against allowed and expected value ranges (“constraints” and “value sets”).

The above example is changed by the correction to a different address:

- Admiral Nelson Pub, 49 Nelson Street

In this case, the correct spelling not only corrects the address, but also places the title “Admiral” in the right context.

Other data can also be used as a further example. For example, the following data could be presented:

- Body temperature: 101°C

On the one hand, the question of context (is “body” a medical or machine context?) arises here again; on the other hand, the question of whether the value is within a allowed range. In the medical context, the value is probably given in Fahrenheit, although the dimension was given in Celsius.

The invalid combination of values and dimensions results in a faulty context, which even Nasa committed with the Mars Climate Orbiter in 1999, by exchanging the altitude data for an automatic landing unit from foot to meter (http://edition.cnn.com/TECH/space/9909/30/mars.metric.02/).

It is therefore important that at each interface (machine as well as human) the compliance with expected values is checked.

Up to date

The condition of objects and persons changes over time. Even if the above points have been fulfilled, it is necessary to check to what extent any stored information may be subject to changes within a period of time.

For example, a person moves once every 7 years on average. The age between 20 and 40 is more frequent than the age between 50 and 70. The permanent check of the current address is therefore not necessary, but can be done for younger people more often than for older people. By contrast, sensor data from machines can change very frequently and are therefore automatically subject to constant monitoring. The accuracy to be able to compare data correctly with each other is therefore just as necessary as the requirements mentioned above.

In the ideal case, therefore, each change to data requires the recording and storage of a time stamp, by means of which the change history of each value can be traced.

Redundancy-free

With the demand of the non-redundancy is meant the reduction of the data without loss of information. This demand comes from the modelling of databases and belongs to the so-called “normalisation”. Hereby it is meant that from two data with the same statement one without loss of information can be eliminated. For example, the city can be omitted in several data records from the combination of postal code and city. This is because the postcode makes it possible to determine the location uniquely. If the postal code were omitted, there would be several ways of assigning postal codes.

For record linkage, viewing a single data record leads to clearer and simpler comparisons.

Das Record Linkage

The Record Linkage solves the problem of finding records that refer to the same facts (object, person, contract, …) and linking them or combining them in a common record (“Golden Record”). This makes it possible to identify unique objects, avoid redundancies and correctly update different data sets. In addition to the legal requirements (DSGVO/GDPR), this results in a common, uniform dataset for a company that can be used by different departments.

In addition, data analyses in the overall context and a holistic view are possible for a company. This can be extended by machine learning in order to evaluate, predict or evaluate developments (“rating”).

Once the previous steps have been completed, two approaches are available for record linkage: “deterministic” and “probabilistic”.

The deterministic approach

Simply defined, determinism describes a predictable result based on related, predetermined factors (causality). The deterministic approach is used to determine whether or not a data set pair matches in a defined set of identifiers (field contents) by means of similar, matching contents. The match is evaluated as an “all or nothing” result.

Example: Given are four records with first and last name:

  • Record 1: Kerstin Meyer
  • Record 2: Kerstin Mayer
  • Record 3: Kirsten Meyer
  • Record 4: Kirsten Mayer

From a deterministic point of view, there are only two similarities between first names and surnames. Data set 1 and 2, as well as 3 and 4 match in the first name; data sets 1 and 3, as well as 2 and 4 match in the last name. Therefore there is no 100% match. With the deterministic approach no link is formed.

The check for agreement can therefore only be done over several steps. The comparison of the data sets per field thus leads to two links for each data set which can be evaluated with a 50% match (one field always matches one field of the remaining three other data sets; the other does not).

The deterministic approach therefore ignores the fact that certain values have a higher inaccuracy than others. To take this into account, a probabilistic approach (“probability statement”) must be used.

The probabilistic approach

The two mathematicians Ivan P. Fellegi and Alan B. Sunter described already in 1969 in their mathematical model “A theory for record linkage” (https://courses.cs.washington.edu/courses/cse590q/04au/papers/Felligi69.pdf) between matches, possible matches or non-matches based on probability. This model is based on the calculation of linkage points and the application of decision rules. This is also taken up and explained by the Federal Statistical Office in Germany (DeStatis) in the summary “Automatisierte Zusammenführung von Daten — Das Modell von Fellegi und Sunter” by Dr. Josef Schürle (https://www.destatis.de/DE/Publikationen/WirtschaftStatistik/Gastbeitraege/ZusammenfuehrungDaten42005.pdf?__blob=publicationFile).

The probabilistic approach thus evaluates the agreement as probabilities when comparing the field contents. The above example can therefore be taken up again.

  • Record 1: Kerstin Meyer
  • Record 2: Kerstin Mayer
  • Record 3: Kirsten Meyer
  • Record 4: Kirsten Mayer

As you can see here, the first name Kerstin can be changed to a new first name by swapping i and e. This can happen e.g. by a letter turner with the manual input. This occurrence must therefore be found, in order to be able to determine an evaluation of the deviation. The same applies to the surname “Meyer” in all its spelling forms with the same phonetic pronunciation.

In order to find deviations, but almost identical values, new comparison values can be formed from the existing information. For example on the basis of phonetic and other algorithms.

Example: Four data sets are given with first and last names with additional fields for phonetic sounds and “normalized” values, in order to find spell turners:

Record set — First name — Last name — Phonetic — Normalized
Record 1: Kerstin — Meyer — CARSTA — MAYAR — kneirst — mreey
Record 2: Kerstin — Mayer — CARSTA — MAYAR — kneirst — mraey
Record 3: Kirsten — Meyer — CARSTA — MAYAR — kneirst — mreey
Record 4: Kirsten — Mayer — CARSTA — MAYAR — kneirst — mraey

The additional values supplement the real values with phonetic sounds according to the NYSIIS standard. In addition, standardized values are added, which were formed according to an algorithm based on the work of Ernest Rawlinson, “The significance of letter position in word recognition” from 1976.

The thesis of Rawlinson’s work is based on the assertion that when words are read, they are recognized as “patterns” and not letter by letter. Therefore, a text can be read even if the letters are swapped between the first and last letter of a word (“If tihs txet is radebale, tehn the theiss is coercrt.”).

The algorithm for Rawlinson’s normalized values works in lower case and moves the first and last letter to the beginning of the value. The remaining letters are listed alphabetically. The first names Kerstin and Kirsten are always followed by the same value kneirst.

After saving the additional values, the datasets can now be compared again. Over six fields then also for the phonetic sounds and the standardized first name a agreement is determined.

By the determination of similar values the result of the comparison must be evaluated now with a probability, in order to set a link with the possible agreement of similar data sets. This is the manual effort and the real art of record linkage.

It can therefore be seen that there is no universal solution with the help of record linkage, but only one solution is described, which has to be adapted and optimized according to the area of application.