Cultural heritage metadata is all about structure. MARC, MODS, EAD, etc… highly structured data formats with specialized vocabularies. But often the structure breaks down, especially at sites of aggregation. When multiple institutions contribute data to a single system it is difficult to unify institutional practices into something that uses the same structure let alone consistent vocabularies.
Could newer machine learning techniques like Word2Vec be used not only on the content of resources but the metadata records themselves? If you used millions of metadata records to train a model, could it sort out disparate records into something useful? I wanted to try using DPLA’s 18 million records. The bad news is that it wasn’t magic, but the good news is some interesting things happened.
The process I used, Doc2Vec is basically Word2Vec but at the document level (read more here). Simply put it converts each document into a vector that can then be compared to the other documents in the model. The more similar the documents are, based on their embedded words, the closer together their vectors will be. You can then measure the distance between vectors and hopefully find related documents. I’ll get into the details of the process below but first the results.
I plotted all 18 million records based on their vectors, this is basically a X/Y scatter plot. In theory the more similar records should be together. You can click around the visualization to see where the different records come from. I colored each record based on it’s source, which institution contributed it to DPLA. The distinctive features that jump out are the blue region, which are NARA records, the pinkish center which is NYPL and the green offshoot, which is University of Southern California Libraries. The whole thing looks like a blurry hummingbird, which is nice, I guess.
The grouping of these institutions is interesting because in the training data I did not include source/institution name. Meaning that these records grouped together because the content of the metadata was similar or in a similar style. For example the green beak of the hummingbird, if you look at the metadata for those records they are all very minimal records with titles like “PAGE 243” for each record. The opposite is also true, Hathi trust, which has the most records in the model has a deep blue color, but it did not cluster together, the records in it were sufficiently different that it more or less evenly dispersed across the graph. This makes sense, since it is also an aggregator and might not have a dominate “style” of metadata.
At the less macro scale, the model did not work as well as hoped. A common way to test a model is to feed the training set back into it and see how it performs. Given a training document you would expect it to return the exact same document as the most similar in the model. But for this model that only happened for 60% of the records. That points to some major problems that we will look at later. But if you would still like to play around with the model you can add this bookmarklet to your browser:
When you are on a DPLA item page you click it and it shows the nearest neighbors in the model:
While there are problems it does surprise you sometimes with relevant matches. This approach seems like it could yield better results with some more work.
The rest of this post will be looking at the technical process of this experiment.
The first step is to train the model, which you need documents. I first tried to use the original source metadata from the DPLA record. This is raw data from the data provider that has not been mapped. I thought that I could possibly get more metadata by using unrefined data. This means each document in the model had random data in it, whatever was included by the provider. After some initial tests I switched to just using the mapped metadata for each document which resulted in 18M documents like this. I used a Node script to accomplish this extraction, you can see at the start of the script which fields are included in each document. What is liberating about this is that there are no fields, it is all just free text, not structured at all.
The next step is to train the model. I used the python library Gensim to do this. Using this script. Some things to point are out the number of iterations 20, which I increased from the default 5, hoping for better results. I also needed to do this on a larger machine, I used an AWS Memory Optimized class box with 255GB ram and 32 virtual cores which took around 12–14 hours to train on the 8.8GB of documents.
The next step would be to put the documents back through it and find the nearest neighbors for each one. Unfortunately the Gensim method to do that is not very fast at this scale. To compare 1000 docs against the model took around 5 minutes, which was not going to work for 18M docs. So instead I exported all the vectors for each document (100 each) using this script.
Now that I had the raw vectors I could use another tool to compare them. I loaded the vectors into Annoy and built the index with 50 trees, using this script. This resulted in a index file that I could then run each document through that would give me the nearest neighbor relatively quickly. You have to do some juggling because Annoy only allows you to use integers as your index IDs, so you need to keep track of what int maps to what DPLA UUID.
That’s great if you want to make something like the DPLA bookmarklet lookup tool. But if you want to graph your results you need to reduce your model. Right now each document is represented by 100 dimensional vector. If you want to plot that you need to reduce it to two dimensions. I did this with PCA using the scikit-learn library. Here is the script that does that.
The last step is taking those X/Y points and plotting them, you could do a sample but I wanted to graph all 18M so I drew them on a big PNG file that I cut into tiles. I drew it with this script.
You can then tile cut it with VIPS:
vips dzsave /data/providers.png dpla_vectors_providers
The whole process is fairly expensive, there are probably ways to reduce the necessary RAM for some steps but you are still looking at needing a machine with 60–120GB of RAM for most of these operations, more for the model training. Fortunately I’ve been having good luck with Amazon Spot Instances so it was not too costly to use these machines for a few days.
In general the documents might be too small, a few lines of metadata might not be enough, leading to bad results/training. But I think it can be improved on, I should have selected a smaller dataset to get started with Word/Doc2Vec but I would like to return to it once I learn more to see if I’m able to build a better model.