How MyHeritage Turned 25K Printed Directories into Fully Searchable Digital Records
Harnessing machine learning and algorithms to make genealogical research accessible to all
Our team at MyHeritage built a powerful genealogy platform and maintains a huge database numbering 12.5 billion historical records that allow people to learn about their ancestors. Users can search this database to discover new information about their families and find photographs featuring their relatives. MyHeritage’s Record Matching technology automatically notifies users when historical records match information in their family trees, saving them the need to actively search the archives.
One of the biggest collections in this historical record database is a structured index of the data extracted from U.S. City Directories. In this article I’m going to explain why this project was highly complex, why city directories are important, and how we solved technical problems faced by our engineers.
City directories are books that contain listings of residents, streets, and businesses, and indicate their location in a certain city. They may be arranged alphabetically, geographically, or in other ways. The data from city directories is a goldmine for genealogy research since it allows users to reconstruct the personal history of individuals in their family.
These books were issued by many different publishers and across many different cities throughout the years. What if we could extract the names of all the individuals listed in tens of thousands of books, and then sort or match them into groups representing a certain family? We could find year ranges when a person lived at a particular address, their occupational history, and even infer marriage and death events.
A data index of city directories would provide massive added value to MyHeritage’s Record Matching system, which automatically finds matches between our users’ family trees and historical records. However, converting thousands of city directories into structured data becomes a complex task.
To structure the data we needed to identify each logical record in the book, infer the surname (not trivial, since when a group of people share the same surname, it is commonly represented by a special ditto character instead of the surname, to save paper), and then break each record into components.
In addition, we had to handle numerous abbreviations, combine line breaks, extract family members, get rid of advertisements and other unwanted data, and make the process resistant to OCR mistakes (OCR processing is described below). Adding extra complexity to this project was the fact that there was no uniform standard for data published in the city directories — each publisher had their own vision of how to represent the data, and sometimes one publisher chose to do it in different ways throughout the years.
The whole project was broken into five major parts:
- Book digitization
- Record extraction
- Data labeling
- Post processing
- Consolidation
Book digitization
The source data we received of the books was on microfilm. An individual microfilm roll is about 25–30 meters long. In total, there were 12,966 rolls of 35 mm microfilms, which produced 12,027,174 digital images. If we stretched out each microfilm and taped them all end-to-end it would be around 325–390 kilometers long.
An additional part of the data arrived on 6,290 microfiche films (sheets of flat films). The microfilm and microfiche were digitized using special equipment which gave us more than 12 million JPEG 2000 images, consuming 160 TB. Almost all of our images are “2-up” as that is how they were initially microfilmed. Images were optimized for maximum OCR quality: 300–400 DPI, 8-bit grayscale, high contrast.
Our next step was to recognize text from images within the OCR. This was done by our team in Utah using OCR servers. This cluster processed all the images and produced data in ALTO (Analyzed Layout and Text Object) XML schema. As a result, we received a set of XML files with recognized words, combined into rows and blocks by their coordinates on a page.
This OCR’d data was stored next to images and we proceeded to the next important step — keying of metadata. One of the challenges we faced was that the books were microfilmed with no clear division — books could start or finish in the middle of the film. We worked together with a third-party vendor to find and tag the front cover and back cover of each book to segment the data. Furthermore, the vendor tagged the positions of residential directories in the books, abbreviation tables and book metadata. For this subproject, we prepared a set of vendor-specific images with reduced dimensions and more aggressive JPEG compression.
Once that work was complete, we had everything split and were ready for the next step, which was triggered automatically once a certain book was uploaded on a specified S3 bucket.
Record Extraction
We developed a record extraction algorithm that processed the books sequentially, and generally works as follows: when the book is first processed, the algorithm chooses a specific number of pages from the residential directory in order to form the learning set, which is then analyzed to compute weights for further steps. Specifically, it analyzes the OCR output: font sizes, block sizes, and average line numbers in the structure of blocks and columns on the learning set pages. The statistical data collected allowed us to reject advertisement blocks and outline valuable regions.
After the computation of the learning set, we processed the whole book. As a result, part of the text lines and blocks were rejected, the other parts were put together into effective columns.
A single book could contain between dozens and hundreds of pages. It’s important to find the correct learning set since the statistical data collected on this set will affect further processing. If the learning set consists of advertisement pages only, whole book processing will fail. Our algorithm tries to pick similar pages for the learning set, which is usually not the case for advertising reach pages, where font size varies widely.
Once we rejected the garbage and defined effective columns, we needed to learn the row structure since it differs from book to book. For further computation it’s important to align rotated pages (which happens sometimes after digitization). Generally, we didn’t need to rotate whole images, we only wanted to align coordinates of text rows mathematically. We did this by calculating linear regression on row coordinates and rotating the line with all points to match altitude.
The next step was to clean up ditto indicators (if they were used in a book). Ditto indicators are special markers that infer the surname for a given record that were used by publishers to save costs. There are several reasons to eliminate ditto indicators, mainly because of numerous OCR errors on these symbols and to make the rows from different books more consistent.
We then needed to classify resulting rows by their horizontal indents. There are a few possible indentation configurations, but the standard configuration is shown in the example above.
We classified the rows into primary lines, secondary lines, and line breaks. In specific books or columns secondary lines and (or) line breaks could be absent. The primary line is a record which starts with a surname, the secondary line is a record which should infer the previous surname, but represents another person. A line break is a continuation of the previous record.
We ran clustering algorithms on rows from the learning set by relative x-coordinate to achieve this.
The calculations performed allowed us to discover the indentation values used in the book and process all the lines while taking into account these values.
Then, knowing the indentation, we could classify the lines throughout the book. After classification, all line breaks were concatenated with the previous lines, and all secondary lines inherited the surnames from the primary lines.
At the final stage of record extraction, we performed OCR cleanup — we searched and replaced specific known OCR errors and also assigned unique identifiers to each row. Identifiers were later applied on further steps. The results were stored on S3 and passed to the next module.
Data labeling
Once we had raw records, we needed to break them into logical components that reflect people’s names, spouses, occupations, workplaces, addresses, and so on.
The first thing that we tried was to parse strings by regular expressions. Regular expressions are very helpful in identifying patterns that follow a certain structure. The problem was that we had no clear structure here. Thousands of books had different formats and there was no chance of covering everything with regexs. Perhaps we could process it with a sequence of conditions? We could tokenize strings, apply conditional statements, analyze letter cases, special characters, and so on. In fact, this way leads to conditional hell in the code. The situation was compounded by the fact that we could never expect clear records, and because there are always some OCR errors, real records often look like this:
It became very inefficient to parse and process such records with a conditional approach.
Was there a better way to extract the data? Yes. One of the best ways to reach this goal is a specific class of machine learning algorithms named Conditional Random Fields (CRFs). Conditional Random Fields can be used to predict any sequence in which multiple variables depend on each other. CRFs are commonly used in labeling or parsing of sequential data for natural language processing or biological sequences, POS tagging, shallow parsing, name entity recognition, gene finding, object recognition, image segmentation in computer vision, and others.
CRF took the previous context in a sequence into account when making a prediction on a data point. This made it easy to find patterns in sentence structure and even classify corrupted words correctly.
CRF is a supervised machine learning algorithm, requiring a training set to produce its predictions. We decided to build a small training set from each of the books. During this process, human operators trained models by assigning a label to each word of the training set, such as name, occupation, address, phone number and other. We built a system with a useful interface to label a large portion of data.
The quality of the labeling was fundamental, since it served as the input data for the machine learning algorithms. A potential error in the labeling could dramatically affect the results of this process. The goal was to build a model for each book to split records into components with a CRF algorithm as best we could.
Operators manually trained models for each of the books. For model quality control we built a self-validation system. The idea was simple: take N records labeled by an operator and split it into T records for a training dataset and V = N — T records for a validation dataset.
A training dataset was used to predict the responses for the observations in a validation dataset. Based on this we calculated an approximate score of each model’s quality. When combined with human QA, it provided a great quality score for trained models.
Once the model was trained and validated, the whole book was processed by a CRF module implemented in Python and deployed on an Ec2 cluster. After this step we had each record broken into labeled components. This data is stored on S3 and the next pipeline step triggered within the SQS message, while the operator started work on the next book.
Post Processing
Once records were broken into components by CRF, it was still not enough to build a valuable searchable index. We still needed to do a sequence of computations, using a post-processing module, which is a java application deployed on an auto-scaling ec-2 cluster.
In post-processing we expanded abbreviations (common and book-specific), extracted spouses, infer surnames of spouses, determined death records, and more.
Each book has its own abbreviation table that was originally keyed manually.
It was important to expand abbreviations during this step since abbreviations are context-dependent, and the same abbreviation in an address and in a name could have different meanings. At this stage we already deconstructed records into logical parts, and we knew the context of each part. There were many other genealogy-specific transformations and error corrections that occurred during this step, and as a result we have structured records: a list of individuals from a certain book with their first and surnames, relationship between them, addresses, occupations, and other details.
Consolidation
An interesting fact about city directories is that the books were published periodically over many years, and the same person could repeatedly appear in different publications. In other words, If John Smith lived in Atlanta for 20 years, he would probably appear in every Atlanta residential directory issued during the given period. Books are often published every two years, which gives us good overlapping and redundancy in the data.
This idea led us to implement an algorithm that runs on a book range from a certain city and combines matched individuals into consolidated records.
The consolidation algorithm is a complex system that produced matches between records from different books and discovered people’s life events. The system needed to process a large amount of data and was designed to be scalable horizontally using the mapReduce model. It ran as a final step of the pipeline in the amazon ec-2 cluster.
During this process, the system automatically discovered an individual’s genealogy events, hidden at first glance, but visible by analyzing changes in the record data. For example, it can detect the year range when a certain person lived at a given address. It can detect approximate marriage events, when a spouse appears for the first time; furthermore, it is able to infer death events, when a wife becomes a widow in a sequence of consolidated records.
In the example above, the algorithm consolidated 31 records from different books published between 1912–1959 into a single genealogy record. The system detected that Alfred and Mary Albert married circa 1914. Albert worked as conductor, carpenter, and motorman during his life and died circa 1959.
Putting It All Together
All components described above were deployed in AWS infrastructure and formed a pipeline that allowed us to do algorithm modifications and instantly re-run processes on any part of the data.
Consolidated records, produced in the last step of the pipeline, represent structured searchable genealogy data that is loaded and indexed into MyHeritage SuperSearch™ — a genealogy search engine that consists of 12.5 billion historical records.
Within the described pipeline we were able to process 25,000 public U.S. city directories published between 1860 and 1960. It comprises 545 million aggregated records that have been consolidated from 1.3 billion records. The coverage of this collection is so broad that almost anyone researching their family history in the U.S. during those years would likely find their ancestor listed.
If you liked the article, check out another major project we accomplished: extracting 290 million individuals from U.S. Yearbooks
Related links
U.S. City Directories collection https://www.myheritage.com/research/collection-10705/us-city-directories
City Directories: Much More than Ye Olde Phonebooks https://familytreewebinars.com/download.php?webinar_id=786
Overview of Conditional Random Fields https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541