Building Similar Patent Recommendations for Chemistry Patents

Smiral Rashinkar
iReadRx
Published in
4 min readJun 30, 2021

Product search serves as a very important and probably the most used feature for a great product. Also, while solutions like Elasticsearch, Solr, PostgreSQL(full-text search plugin) exist, which provide full-text search and string matching with keywords and parts of strings, they are sometimes not enough to match the user with very closely related results. We can make use of vector embeddings to get semantically close results for user searches.

In patent research, this can be extremely useful to provide useful results, leading to new discoveries and reduced effort in research. We can use NLP to power our search intelligence and create similar patent results with the results already appearing.

For example, suppose a user searches for hydrocarbons. In that case, they will definitely get results where the text contains the string “hydrocarbons.” Still, with NLP to power the search, we can get similar patents that discuss actual hydrocarbons like methane, ethane, etc.

How can we provide semantically similar and related results based on the data available?

We found vector search to work really great for finding similar patents with the data available from the many experiments we conducted.

Here is one way which worked really great for us with all the challenges we faced:

We used Universal Sentence Encoder to convert text to vector embeddings and use these vectors to perform a similarity search by getting a similarity distance from comparing two vectors.

Here are some issues with this approach:
• Now that we have decided to use Universal Sentence Encoder to create vector embeddings, how do we preprocess patent data before encoding it? Although you could technically feed the whole text to Universal Sentence Encoder, this dilutes the meaning and variation that can be captured from the text and converted to embedding, which would yield not-so-relevant results.
• Calculation for these similarity distances on the whole dataset can be costly and time-consuming, especially when the dataset contains around 3M patents.
• Choosing the data for preprocessing is a significant factor as well; we had the title, abstract, and claims text as options

Solution?

Firstly, we need to choose which text can be used for embeddings. While claims texts can be really great for similarity search, they are long. Moreover, they might require extractive or abstractive summarization, which is a different problem statement altogether, but using claim text is definitely an option for v2.0 of similarity search.

We decided to use title and abstract for embeddings with some preprocessing to reduce the length and capture important tokens from the text using biomedical and custom chemistry Named Entity Recognition (NER) models.

Second, preprocessing title and abstract by removing stop words and excluding words with lengths greater than the threshold value of 15 and less than 4. These words are further lemmatized and tokenized for the USE embedding process.

Creation of embeddings takes considerable time even while using GPU with Universal Sentence Encoder. Nevertheless, as noted, calculating similarity distance and nearest neighbors (similar patents) between a single patent text with the whole dataset takes about ~2 mins, which would be very prohibitive.

We can solve this problem by using Approximate Nearest Neighbours instead of actual nearest neighbors. For some loss of accuracy, we get a considerable speed boost. While there are some great libraries like faiss, nmslib, etc., we decided to use Annoy as our ANN Library, which uses tree-based implementation for the creation of ANN Index, plus it also has other awesome features like build on disk, which lets you build indexes for large datasets which might be greater than your RAM size.

Using Annoy Index, we can get the top nearest semantic matches for patent text. It also provides us with the similarity distance, which provides a measure of similarity between patents, i.e., the lower the distance, the higher the similarity.

Scope for Improvement and Final Results:

With external SME validations, we found the title text embeddings to yield better results than abstract text. As mentioned, with appropriate preprocessing, claims text can also be used to evaluate this feature.

Using Universal Sentence Encoder can be considered as a good baseline. Still, we can use models like BioSentVec for encoding as it has been trained on PubMed+MIMIC-III, which can provide better semantic embeddings in biomedical texts and yield better results.

While the Annoy Index was primarily used for precalculating of top similar patents from existing datasets, it can also serve as an intelligent search for unseen search queries for matching with existing patents (Unless new patent data is ingested)

This is live at http://ichemist.ireadrx.ai/, so head over to our app and try searching for patent keywords and check out the top 5 similar patents related to the original patent!

(if the above link is broken check out our demo videos at https://www.youtube.com/playlist?list=PLZ0CqxpJ8nQsPeIg5QowE1s5esIONoFQt)

Click on the “Show Similar Patents” to retrieve closely related patents
Top 5 patents related to the original patent

--

--

Smiral Rashinkar
iReadRx
Writer for

A Machine Learning Engineer and Tech Evangelist. Generalist who relishes designing and building for ML solutions.