How I built a (patent) text concept search & analytics app
When ElasticSearch won’t do.
Very recently, I launched my patent search & analytics web app. It does a bunch of things, but its core, it is based on discovering similar patent documents based on one or more source documents.
The app is built using many amazing tools that others have developed at places like Google and Spotify, and I learned so much reading about others’ work on their blogs. So, I wanted write a little about how I built it and how it does what it does.
I have tried to limit the background (patent) information, but if you’re not interested in it, I suggest scrolling down to skip the “Background” section.
There are lots of patent documents out there. In fact, there are tens of millions of patent applications that’s been filed, just counting the ones filed in English. For manual search and review, this is an absolute nightmare.
Reading, understanding and organising patent data is not trivial at all. Patent documents are written with a goal of seeking legal protection for new inventions, the bargain being that the inventor must disclose the invention to the public. As a result, what gets produced are long, legal documents about inventions, combining all the esoteric language of a legal document with the subject matter difficulty of a technical paper.
To compound this, some tasks involve dealing with large patent data sets. Two such and very common tasks are patent searching, and patent landscaping.
Patent searching is an exercise in concept extraction. In a typical manual search, a searcher receives a brief and forms an idea of what the search should be “about”. Then, they would form a search query using their expertise, refine the query usually through a process of trial and error, as well as consultation with the client, before producing the results set. They also often manually filter the results based on perceived relevance.
This can get good results, but requires a number of elements for success. The result quality is particularly dependent on the ability of the searcher to a) understand and distill the key technical concepts from the brief, and b) formulate the right query, using the right combination of words into a usually complicated query syntax. Both of these elements are extremely subjective.
In patent landscaping, search results are analysed to convey to the reader the current “lay of the land” of patents in a field. In some patent landscapes, the search results might be put into smaller sub-groups either as one-to-many or many-to-many relationships. The grouping, or tagging, exercise requires further subjective grouping, and must be performed on every patent document.
Both of these tasks can be extremely time-consuming and subjective manual tasks. They can also be expensive as a result.
NLP to the rescue?
Having seen these challenges for years as a patent attorney, I envisaged that these problems could be overcome by applying existing NLP tools, or at least by building on these tools.
The field of patents is actually well suited for NLP applications. For one, the body of patent documents provides a fantastic corpus from which language processing models can be built and onto which analyses can be carried out. The above problems are in fact opportunities. Also, because of the bureaucratic nature of patents, the data is extremely well structured. And having domain expertise, I was able to judge the quality of any outputs myself.
Roughly speaking, I wanted to develop a reliable metric with which two documents could be compared against one another. This would enable a search by allowing a ranking of similarity outputs (or the reverse of a ranking by ‘distances’, whichever you’re measuring), and clustering could be carried out from similarity matrices if necessary.
Determining patent document similarity
The last few years has seen amazing advances in machine learning, including in the areas related to language processing.
Virtual assistants like Siri and Alexa are ubiquitous, chatbots follow us around on seemingly every website as Clippy did in 1996 and translator apps are able to translate text and speech in real time, even from images. On the cutting edge, an AI model made headlines this year for being able to generate “complete made-up sentences and paragraphs”, and occasionally a “really competent, really well-reasoned essay” from just a simple prompt. AI and machine learning in the field of language processing has come a long way.
Another one of the most significant, earlier, developments in this time period has been the Word2Vec model, developed by Mikolov et al at Google. It has been covered extensively, many times by people far better placed to talk about it than I am. So, I will only discuss it at a high level and as it relates to my project.
The main, significant concept to take away is that Word2Vec introduced meaningful word embeddings. Here’s what I mean by that, and why it’s significant for what I’m trying to achieve.
One of the difficulty in computer science, and natural language processing, has been to represent words, and relationships between words, in a quantifiable manner. Without such representations, it becomes very difficult to manipulate the data, or carry out any kind of nuanced analysis such as comparisons, and understanding degrees of differences. Compounding the problem, previous attempts to convert words into representations have often run into problems of dimensionality.
Word2Vec introduced meaningful word embeddings.
Word2Vec solves both, and does so quite elegantly. In Word2Vec, each word is represented by a multi-dimensional vector of floating point numbers. They are thus “distributed” representations in a vector space of fixed size.
In this vector space, each number in each dimension can be (roughly) thought of as representing some aspect of a word — imagine representing a word’s holistic meaning by moving 100 dials, each dial representing some property of a word. That’s (sort of) what a 100-dimensional Word2Vec does.
These numbers are determined by the magic of machine learning. In Word2Vec, the algorithm learns the embedding for each word based on its neighbouring words that appear in the training text corpus.
As a result, each word embedding’s numerical value, and its relationship with other word embeddings’ numerical values can now be meaningful.
Because they are mathematical representations, these words can be represented in a way that can be processed, compared and manipulated, all the while without avoiding problems of dimensionality as the number of words represented increases.
I will repeat a few often-cited example to demonstrate this concept. Using Word2Vec embeddings, the vector difference between those for “man” and “king” is about the same as “woman” and “queen”, implying something sort of “royalty” transformation to the vectors. Similarly, the difference between “France” and “Paris” are similar to “Germany” and “Berlin”, modifying by “capital city”.
Note: If you’d like to read more about Word2Vec, here are some great resources: a very readable blog post, an introductory post by Radim Řehůřek who wrote the gensim python library that implements word2vec, Mikolov et al’s original paper, Stanford course on NLP (the first few lectures do a great job of explaining Word2Vec/word embedding concepts).
This might seem obvious to intuit and not that big a deal. But the significance is in that these relationships are not programmed, but have been “learned”. They are indicative of the utility and efficacy of word embeddings. In other words, spatial relationships between word embeddings have significance. One of those significances is that words whose embeddings are close together in vector space are “similar” in meaning.
Distances between word vectors is an indicator of word “similarity”.
At this point, I should say that Word2Vec isn’t perfect, and that it suffers from a number of limitations. There are many, newer, arguably better (and far more complex) embedding algorithms like GLoVe, fasttext or ELMo. I simply use Word2Vec here as one example of word embeddings, as one that’s almost synonymous with the concept.
For me, I wanted to enable a quick, consistent comparison of patent documents, group similar patents together by subject matter, see trends, and compare different companies’ patent portfolios. Word embeddings helped me get to a starting point, and I developed my own way of converting entire documents into vector representations.
Along the way, you will have to deal with numerous questions. At a word level, there are questions like dealing with different word senses, what you’re going to do with lemmatisation (will you assign different vectors to “do” and “doing”?), and n-grams (does “New York” get a unique vector? what about “machine learning”, or “CT scan”?). If you’re embedding concepts at a document level, the problem gets more complex by orders of magnitude.
That’s why it was important to have clear goals and evaluation metrics.
Setting goals & metrics
There are a ton of different ways to go about turning a body of text into a single vector representation. So, I needed a way to evaluate different methods of doing so regardless of whether I was going to use an existing algorithm, or attempt to develop my own.
I wanted to turn each patent into a vector representation to find similar patent documents, see trends, and compare different companies’ patent portfolios.
If you’re looking for my source code or algorithm on how my patent vectors are derived, sorry, I will not be going into that much detail. But, it’s probably not relevant to most of you to be fair.
Instead, I will go through my methodology in settling on an algorithm and evaluation metrics, because I believe that to be the most important part of the process.
I began by establishing my goals, which were a) to have the vector represent the core concepts of a patent document, and for any subsequent similarity outputs to b) emulate what patent & subject matter experts would consider similar.
To this end I had to really establish for myself what I meant by “similarity”, how I wanted to assess it, and what my ground truth data set would be to evaluate performance. After some thought, I settled on a mix of quantitative and qualitative metrics. Here’s why.
In many machine learning problems, you have good ground truth datasets to fall back on. Kaggle is in fact built on this premise, that there will always be ground truth datasets to assess your model on.
Well, it’s just not the case a lot of the time in real life. It’s especially true when you’re dealing with language.
Early on, I was able to obtain some categorised data sets which I could use to assess average performance of candidate patent vectorisation algorithms to some extent. I used that data set to develop a scoring metric, relying on how many of the same patents in the same “category” are ranked as a high similarity item.
However, it still doesn’t tell you much about degrees of similarity. Yes, categorisation in ML still works by calculating continuous variable and then logistically/softmax transforming it to a binary/categorical output. But, only using categorical ground truth to train/develop a model for a continuous variable output that might look okay, but perform poorly in reality, as I would be optimising for the “wrong” metric. In fact, when I was simply using this metric I found the algorithm performance to be quite unsatisfactory in my random reviews.
So, in lieu of better data I ended up doing two things. Firstly, I ended up labelling data sets myself, where for each source patent in one “set”, I manually ranked the remaining patents by similarity. It was my version of a similar mechanical turk task that was done by some researchers at word levels to develop an intrinsic evaluation criterion.
Additionally, I also ended up adding a manual assessment component where I assessed relative (extrinsic) performance of the patent vectors by reviewing and scoring “best match” patents that it would find for the source patent documents.
Luckily, I found quite good alignment between the categorisation data, and my own quantitative/qualitative outputs.
Using these metrics, I evaluated the above word embedding algorithms, various corpus sizes, various methods of converting word vectors to paragraph or document vectors, embedding documents themselves, including my own formulations. On top of that there was hyperparameter testing.
It was a lot of variables.
My suggestion here for anyone going through a similar process of algorithm evaluation would be this:
Before developing models, downloading larger and larger training corpus, or even really writing any lines of code — read & understand the principles of each algorithm first.
Then consider your particular goals, any peculiarities of your field, and which models might suit your particular problem best.
There’s simply not enough time in the world to test every model at every parameter at every model size.
You are far better off starting off with an idea of what you need, forming some hypotheses as to what models, methods and hyperparameters might suit you best, and then running them. You’ll get a lot more out of each run and experiment, iterate faster even if your results don’t match your expectations, and ultimately save a ton of time.
From my perspective, the fact that I practiced as a patent attorney helped a lot, from both an algorithm design perspective as well as a validation perspective.
To give you an idea of the search algorithm in action, here is a search result set for patents most similar to the original pat2vec patent (US9037464B1):
It works pretty well. It has found a bunch of patents in related to machine learning, and language modelling domain just from the input of a patent number. The results are sorted by proximity to the original patent document, which I’ve labelled as degrees of “relevance”.
I can also cluster the results to groups of subsets using the similarity metrics derived.
The below figure shows outputs from clustering approximately 30,000 car-related patent documents. Once I was able to determine similarity satisfactorily, clustering became an easy task where the only major choices were choosing a clustering algorithm and how to choose the number of clusters to use.
Personally, for my application, I found deterministic, hierarchical clustering methods to work better but your mileage better. Hierarchical methods also work well for me in that they allow easier and consistent ways to drill into sub-clusters.
Searching performance was initially a non-issue, especially for use in a demo desktop application, or as a consulting tool.
Even when I am dealing with millions of patent vectors, and want to look up vectors best matching a source patent, speed really wasn’t an issue. In a desktop app that I would use to generate outputs for consulting work, I could get the dataset and simply determine for each patent pair a similarity value. If I’m dealing with a subset, I could even generate a matrix and just look up values as I needed.
I used a cosine similarity because of the way my vectors were derived, but your mileage may vary.
But then, for web deployment where compute is at a premium and speed is king, that wasn’t going to work. I could pre-calculate similar patents and save them onto each database row of patent document as a new column, but that would have taken enormous time every time the database was updated. It also wouldn’t work for query vectors that were not in the database to begin with.
So what to do?
Approximate nearest neighbour algorithms (ANNs), of course. ANNs are designed to perform almost exactly this task. There are a number of different libraries out there, such as NGT, hnsw and Faiss libraries being some of the better-known algorithms (check out this GitHub repo for benchmarks). I went with ANNOY.
ANNOY was developed by the folks at Spotify for music recommendations, it had a few key selling points for me. It was fast, had a small memory footprint, and its index was able to be shared between processes.
The file may be a few gigabytes for my dataset of a few million vectors, but it uses very little memory as it memory maps the index from the hard disk. It is also ideal that the index is able to be shared – otherwise we might have needed different ANN databases for different processes, which would have been highly inefficient.
Using ANNOY, I’m able to look up thousands of nearest neighbour vectors and query them from the SQL server in fractions of a second.
Remember this search result?
This takes less than a second. Patent searches that used to take either hours of your own time, or days for someone to get back to you— done in less than a second. Pretty neat if you ask me.
Phew. That turned out to be way longer than I expected. I hope that’s useful to some of you. I enjoyed writing it. I haven’t discussed these, but the codebase is almost entirely Python, and some of the other significant tools/libraries used are:
- NLTK, SpaCy, Django, gensim and Plotly.
I expect to be writing more here as we go — some of it about how the site works, and some about the outputs. I will be occasionally sending out site updates and articles — so if that was interesting, please sign up below (or even better, on my site).
Thanks & until next time!