Wouldn’t it be great if there were a tool that could index the PDF files saved on your hard drive, and then allow you to search for files containing specific phrases? Even better if it could identify all the other files topically related to whichever document you happened to be viewing.
Such were my thoughts in 2018, when I returned to the topics of phrase indexing and document relationships. (Perhaps I could develop my own personal PDF indexing and navigation solution?) I’d first thought about such things some two decades earlier, when I was a publishing technologist working on a project to develop a prototype encyclopedia of materials science and technology. Since then I’d downloaded the PDFs of several thousand research articles about subjects that interested me, along with book reviews, interesting news items, obituaries and the like. Increasingly I was finding it difficult to locate specific articles I knew to be somewhere in collection, or all the articles I had on a particular subject.
As for the problem of finding related articles, it was now essentially solved. Today journal publishers routinely provide ‘see also’ functionality on their websites to connect the readers of an article with related (‘similar’) articles. Sometimes, though, you need to find out whether you can do something … There was another motivation too: my employer, a university press, was still investigating the possibility of implementing ‘see also’ functionality on its own content platform. It seemed like potentially a fun personal project to see whether, or how far, it was possible to replicate what the commercial players in that space (such as UNSILO) were able to achieve, whilst at the same time solving my PDF management problem.
Around 2011–12 I’d learnt Python, greatly assisted by Allen Downey’s concise primer, and had proceeded to write a phrase extraction script to help with a book indexing project. That script had its origins in the Delphi code I’d written in 1998 while working on the encyclopedia prototype. Then I’d found that when the term extraction program was applied to a clutch of sample articles, the frequent terms — the ones that weren’t stop words — generally bore an obvious relationship to an article’s topic. Words and phrases in an article’s title were often extracted, for example, and the extracted terms frequently matched the human-assigned keywords displayed with the article abstract. Armed with Python and my earlier phrase extraction ideas, I was ready to tackle my PDF management problem.
The methodology I developed for identifying pairs of related articles is relatively straightforward, but has several noteworthy features. (For a more detailed description see here.)
The first feature to note is that the method involves extracting not just individual word frequencies but also phrase frequencies. The maximum phrase length can be adjusted, but an upper limit of four words tends to work well.
The second important feature is that terms — words and phrases — are assigned individual document-specific significance values computed on the basis of frequency in the document relative to frequency in the corpus. This approach to computing term significance could be called the relative representation level (RRL) method. If we naively assume that terms are distributed uniformly across a corpus of documents, and uniformly within each document, then from the word count of document D (C[D]) and the word count of the corpus (C[corpus]), and knowing the number of times a term occurs in the corpus (N[corpus]), one can calculate the number of times a term T is expected to occur in document D as
N[exp] = N[corpus] ✕ (C[D]/C[corpus])
If term T actually occurs N[act] times in D, then intuitively it makes sense to think of the ratio N[act]/N[exp] as a measure of the degree to which T is over- or under-represented in D relative to its overall representation in the corpus. So we say that the significance (sig) of term T in document D is given by the expression
sig = N[act] / N[exp]
= N[act] / N[corpus] ✕ (C[D]/C[corpus])
= (N[act] ✕ C[corpus]) / (N[corpus] ✕ C[D])
An embellishment here is to boost the significance scores of multi-word phrases in proportion to the number of words in the phrase, on the basis that there’s something less probable — ‘more significant’— about particular words coming together in specific combinations to form phrases than there is about the occurrence of individual words.
An alternative to this expectation-based measure of significance would be to use the well-known TF-IDF (term frequency — inverse document frequency) method, which penalizes terms that occur in many documents and rewards those than occur in few. It was my own research baggage that led me to adopt the RRL method, for in another context altogether, in the late 1980s, I’d been interested in analysing the relative representation levels of the different amino acid triplets in protein molecules (link). A few years later I applied the same way of thinking — but using character trigrams — to the problem of identifying OCR (optical character recognition) errors in scanned text.
Back to term significance though, where a project for the future is to compare the effectiveness of TF-IDF with RRL in a variety of situations. TF-IDF has a peculiar magic to it (a ‘sparck’ of magic, one is tempted to say), and I expect that in many situations it gives results as good as or better than those delivered by any other method. But equally, RRL has strong intuitive appeal, and RRL is what I used.
Once we know what words and phrases our documents contains, and can assign significance scores to them, we can think about the similarity of a pair of documents. A major intuition here is that similar documents share many of their terms, and dissimilar documents share few terms. Another intuition is that the ratio of shared terms to unshared terms, for each document in the pair, is also important. The supposition is that similar documents share a high fraction of their terms as compared with dissimilar documents, which tend to share a small fraction of terms. One way of satisfying these intuitions — the third feature of the method — is to see the product of the ratios of the shared to unshared terms of the two documents A and B as indicative of their similarity (sim(A,B)):
sim(A,B) = func(SA/UA ✕ SB/UB)
Now SA equals SB, since these are the terms the documents share, so we can say that
sim(A,B) = func(S² / UA ✕ UB)
where S is the number of terms shared by A and B, UA is the number of terms unique to A and UB is the number of terms unique to B. (‘func’ here means ‘a function of’.)
The final conceptual piece of the puzzle involves taking account of the term significance scores we’ve calculated, assuming that similar articles share not just any old terms but especially their significant terms. The way document pairs are scored for similarity achieves this.
Having extracted the terms from the two documents, we determine their intersection, i.e. identify the terms they share (as alluded to above). Then we compute the average significance of all the intersection terms with respect to document A, and then their average significance with respect to document B. The ‘total significance’ (tot_sig below) is the product of these two average intersection term significance scores. Finally we can compute the ‘significance adjusted similarity’ (SAS) of the documents A and B:
SAS = func(tot_sig) ✕ (S²/(1+UA) ✕ (1+UB))
Taking the square root of tot_sig, multiplying it by 100, and then multiplying the resulting quantity by the intersect-related term, seems to work well. (Possibly if the method were applied to larger corpora — larger than some thousands of documents — it would be necessary to take the log of tot_sig, so as to ‘sit on’ excessively large values, but for my intended application the need has not arisen.) Note the addition of one to the terms in the denominator, to avoid a division-by-zero blow-up if two documents were to share all their terms.
The methodology is generally very successful in highlighting meaningfully related pairs of articles. Sometimes it’s wrong-footed by some of the artefacts that are liable to crop up every now and again in text extracted from PDF files. In addition, if you save web pages as PDF files using your browser’s ‘Save as PDF’ print option, the PDFs may include text relating to the identity of the website, and perhaps browser-generated headers and footers. All of these will be indexed, and that may affect the file relationships inferred. Overall, however, I find the results to be eminently usable.
So there we are: a method for computing document similarity. Next step: to integrate it into the PDF navigator, and continue developing that to become the tool I’d first envisaged.
(See the full paper for more information and discussion.)