Graphing Through My Ancestors

A few years ago I read a great book called Seven Databases in Seven Weeks, and then I read it again as part of a book club at my current job. One of the databases in the book was particularly neat, a graph database called neo4j. The basic difference between a “graph” database and a “relational” database is that the graph database focuses more prominently on the relationships between the data than the data itself. Both database types have records and relationships, but the graph database has a lot more efficient methods of traversing relationships and finding common paths, etc. Underlying it, the actual differences focus on Graph Theory vs Set Theory, but I’m not going to delve into that because I’m not an expert, and if I were it would take more than this blog post to explain it. Graph databases map naturally to things like social networks, network topologies, and, as it turns out, genealogical records. It also turns out that I wrote a website to host my mom’s genealogy way back when using Perl and mysql, so I already know a little about the problem area, and I’d been itching to sharpen the saw with some of the latest and greatest in web development land (since I miss it sooooo much). So, solution, meet problem. I will rewrite my mom’s genealogy website using more modern web development tools, and use neo4j as the backing database so I can learn more of the ins and outs of working with a graph database.

First things first

The most common file format for genealogical data is the gedcom format. It’s a text format that has in-built recursive records and links records together with common identifiers. This is the format my mom currently exports her database in, before she uploads it to be processed by the existing Perl script.

Here’s an example of an INDI (individual) record (my grandpa). I’ve removed some of the more obscure tags to focus on the basics:

0 @I1@ INDI
1 NAME Elisha Kane /Brown/
1 SEX M
1 BIRT
2 DATE 22 APR 1921
2 PLAC Vidette, Fulton, Ar
1 DEAT
2 DATE 14 JAN 1974
2 PLAC Summit, Union, Nj
1 BURI
2 DATE 18 JAN 1974
2 PLAC Somerset, Somerset, Nj
1 BAPL
2 DATE 16 MAY 1957
2 PLAC Omaha, Douglas, Ne
1 ENDL
2 DATE 23 JUN 1958
2 TEMP LOGAN
1 _UID A16BF3F5826FD5118170AA149BD0F8704D8C
1 FAMS @F1@
1 FAMC @F2@

So the basic format of each line is:

2 TEMP LOGAN
$Depth $Tag $Data

Except the first tag you’ll notice is different, perhaps even special:

0 @I1@ INDI
$Depth $Id $Tag

A zero depth means a top-level element or the beginning of a new record. In this case, we’re starting an INDI record with id @I1@ (I for INDI, 1 for first). The record finishes either when the next top-level element shows up or we reach the end of the file, otherwise, all of the data is about this person. You can probably parse a few of the more obvious ones, like SEX = gender, BIRT = birth, PLAC = place, etc. The ones that might throw you off are the relationships.

FAMS is a tag indicating a relationship to a marriage (Family Spouse), and the value is the id of the FAM record @F1@. FAMC is Family Child, as in he is a child in FAM @F2@. The other notable bit in there is _UID. Anything beginning with an underscore is an unofficial extension to the gedcom format. Different programs add different ones, and that was a challenge, but I finally found a pretty comprehensive list and added support for all of them, even though my mom’s database is always generated from the same program and only utilizes a few. You can see a list of all the tags and a more readable name in the source code.

Generating the database

Since the site was going to be using node.js because I wanted a single language and the ability to try out some of the ideas I have had percolating that node would be a great fit for, I decided to just go ahead and write the database generation bits using node as well. It turns out it has pretty great tools for doing ETL (Extract/Transform/Load) work from different file formats using the built-in Streams API, and someone already wrote a useful gedcom parser and put it up on npm. I ended up finding and fixing a bug because the amount of sample data I had from my mom’s 400,000+ person database exercised the library a bit more than the original author had. Basically, if the chunk it was processing ended at the end of a line, it removed the newline character and concatenated the next tag to the end of the value of the current one when it received the next chunk. So you’d lose one tag and have one jumbled tag, but depending on which tag, this could screw up whole records. The author graciously accepted my pull-request and helped me figure out some testing issues. So, this library allows me to read in the gedcom file and generate a ReadableStream of gedcom records. Each record has information about what type of record it is (INDI, FAM, etc), and all the child elements (tags and their values) associated with the record.

Warp speed is too slow?!

My initial attempts to do this had me hitting the neo4j API for each of those records to insert them into the graph database. The problem with this naive approach is that it’s ridiculously slow, at least using the node.js neo4j client. After some minutes, the program would crash because the buffer of incoming records would overflow waiting for neo4j to finish inserting the records ahead of it. I toyed around with managing multiple connections and doing more parallel inserts, but I doubt that would have solved the problem. Since most databases have slow insert performance, many of them have bulk insert tools that are much much faster at ingesting large amounts of records in one go. Sure enough, neo4j has one. neo4j-import will import from any number of CSV files and generate the records much more quickly than otherwise possible (seriously, it went from crashing after 30 minutes to completing in 6 seconds). So, now I needed to use an intermediate CSV format for the data. Again I turned to npm to see what was out there before writing my own. I found this neat tool called fast-csv. You just stream objects to it and it generates a header line from the sum of all keys it has seen and a line per record, with the values in the same order as the headers. Because it can alter the header row as it receives new data, the performance isn’t quite where I had hoped, especially since it boasts about its speed in its very name. This part of the process actually accounts for about 75% of the total time (3 minutes of the approximately 4 minutes total), and it’s the one area I might spend some time trying to optimize when I have some free time (ha!).

Show me the codes!

So to handle the conversion from gedcom to CSV, I created a class called CsvWriter that takes a single gedcom-stream as input and outputs multiple fast-csv stream, one for each record type. So, INDI records go into INDI.csv, FAM into FAM.csv, etc. Relationships between records are a special case we’ll go into later. Here’s the basic process for piping the two streams together:

var CsvWriter = require(‘./lib/csv_writer’);
var csvs = new CsvWriter(opts.tmpdir);
var GedcomStream = require(‘gedcom-stream’);
var gedcom = new GedcomStream();
gedcom.pipe(csvs);

The ‘tmpdir’ in question is a configurable option giving a temporary storage location for the intermediate CSV files we are going to generate. Since there will be many, we need a folder. Internally the CsvWriter class processes the incoming stream as records and converts them to objects:

CsvWriter.prototype.write = function(record) {
this.emit('read', record);
var label = tag_names[record['name']];
if (! label) {
this.emit('skip', record);
return;
}
var node_data = this._record_to_node(record, record['id']);
if (Object.keys(node_data).length > 0) {
node_data[':LABEL'] = label;
this._save_node(record['name'], node_data);
}
};

Since we’re piping the gedcom-stream ReadableStream object into this CsvWriter WriteableStream object, the ‘write’ event is fired whenever we receive data from the gedcom. Because of how gedcom-stream is written, each event represents one record in the gedcom database. The ‘name’ in the record is what I’ve henceforth referred to as the ‘record type’, i.e. FAM, INDI, etc. We skip any unknown types and emit an event so we can react to those if we so choose (we currently just log them, and there’s only 1 in the source database and it’s like an EOF marker record or something).

The :LABEL bit is some neo4j magic. Each record in a neo4j database can have a label associated with it. This is the most akin to a table from a relational database as you will see in neo4j. So, for INDI records, the :LABEL is “Individual”, and for FAM it is “Family”. This will aid us in looking up the records later.

The bulk of the conversion from gedcom-stream object to regular JS object that we can pipe to fast-csv is in the ‘_record_to_node’ method:

CsvWriter.prototype._record_to_node = function(record, node_id) {
var node = {};
if (record['id']) {
node['Gedcom Id:ID'] = record['id'];
}
if (record['children']) {
var l = record['children'].length;
for (var i=0; i<l; i++) {
var child = record['children'][i];
var key = tag_names[child['name']];
if (! key) {
this._unused_tags[child['name']] = true;
continue;
}
if (child['value'] !== '' || child['children'].length == 0) {
if (child['value'].indexOf('@') == 0) {
if (node_id) {
var rel_data = {
':START_ID': node_id,
':END_ID': child['value'].replace(/@/g, ''),
':TYPE': key,
};
this._save_relationship(child['name'], rel_data);
}
}
else {
if (transformations[child['name']]) {
transformations[child['name']].bind(this)(node, key, child['value'])
}
else {
node[key] = child['value'];
}
}
}
if (child['children'] && child['children'].length > 0) {
var child_obj = this._record_to_node.bind(this)(child, node_id);
if (key == 'Event') {
// EVENT is not unique, use its TYPE as the real key name
key = child_obj['Type'];
delete child_obj['Type'];
}
for (child_key in child_obj) {
var composite_key = [key, child_key].join(' ');
node[composite_key] = child_obj[child_key];
}
}
}
}
return node;
};

There’s a lot there and I don’t want to go into all of it, but the basic gist is we descend through all the child elements of the record and convert them to attributes on the main object. So:

1 BIRT
2 DATE 22 APR 1921
2 PLAC Vidette, Fulton, Ar

Becomes record[‘Birth Date’] and record[‘Birth Place’]. The ‘Event’ stuff is another one-off thing in gedcom files that we need to account for with a special case. An EVEN record looks like:

1 EVEN
2 TYPE Initiatory (LDS)
2 DATE 23 JUN 1958
2 PLAC LOGAN

But we don’t want record[‘Event Type’] and record[‘Event Date’] especially because many records have multiple EVEN tags, and the last one would overwrite the ones before because the generated key name would be the same. Instead, in that case, we want the value of the TYPE field to be the key name, so it becomes record[‘Initiatory (LDS) Date’].

There’s some extra bits for recognizing that the value is a link to another record, if the value contains an @ symbol in it. This causes us to write the record out as a relationship file instead of a data file. neo4j-import has to know what type of records its parsing, and relationship files are just basically:

I1,F1,Spouse in Family
$source, $destination, $type

Then we listen for the end of the incoming stream to tell the generated fast-csv streams that they are finished so they can write to disk.

CsvWriter.prototype.end = function() {
for (type in this._output_streams) {
for (filename in this._output_streams[type]) {
this._output_streams[type][filename].end();
}
}
};

A few other tidbits, neo4j can only import if a) the database isn’t running and b) the data directory is empty. So I had to do some work there to stop the database if running, move the existing data elsewhere (so I can recover on failure), and restart the database after the import has finished. node has decent builtin libraries for managing processes, so this was pretty simple. I’m going to skip going over that code, but you can read it if you like.

Oh no, another bug?

Once I had the whole process in place, I started to test it with my mom’s gedcom file. I fired it up and it looked like it was doing well. It generated
the intermediate CSV files, and started the neo4j-import process just fine. But it was taking forever! After some time (like an hour), it crashed because it had filled my hard disk. I was a little dumbfounded. How could a 145MB gedcom file, which isn’t an efficient format to begin with, generate a graph database of > 200GB? So I started debugging, and it turned out that I discovered a since-fixed bug in neo4j-import. If one of the CSV files is empty, it causes the import tool to go into an infinite loop, starting over and re-processing the same files over and over and over. Eventually it had processed the INDI and FAM records so many times that the on-disk database took up all the remaining space on the hard disk. I honestly just discovered this was the problem by guessing. The only odd thing I found in the generated files was that some were empty (because they were a top-level tag with no child tags), so on a hunch, I removed those from the file list and it processed everything in about 6 seconds and generated the database I wanted. I filed that bug report and they fixed it pretty soon thereafter, but in the meantime, I added some code to skip records that had no children and no value, which is what would produce the empty files.

So now what?

Now that I have a database, I’m going to need to put a website in front of it. I plan to do a basic pedigree chart view using the three most popular modern frameworks I know about: React, Angular, and Ember. Then I’ll build out the remaining pieces in whichever framework suits me best. I think it would be neat to add a “How are these two people related?” tool this time, and with neo4j at my side, I’m hoping that’s not too onerous to figure out.