Patent Language Processing

Applying Natural Language Processing in Patent Spaces

6 min readApr 29, 2019

Are you predicting stock prices, product launches, or job market niches? Patents contain text that may help signal what, how and where new ideas will move markets in the future. This post explores applied research on applications of natural language processing (NLP) in the patent domain. For more context on patents as data, the World Intellectual Property Organization (WIPO), a global patent governing body, is a good place to discover resources on patent analytics, operational AI initiatives, and datasets. If you’d prefer examples of market analysis of tech companies using patent data, I like to read reports by CB Insights.

Why care about patent language processing?

Why am I digging into patent records using NLP? Recently, I’ve begun to learn and experiment with NLP tools, and patent applications represent the type of dense technical language that machines are good at processing. I have some previous experience analyzing patent data — on inventors and locations — as part of Ph.D. research on climate tech innovation that I completed in 2016. As I transitioned from academia into tech spaces, I also had the opportunity to seek to acquire patent applications data to forecast the quality of startup firms. Now, I’m learning new computational processing tools to leverage unstructured patent data at scale in machine learning projects, and reading applied research is a way to organize my mental model of use cases of NLP in the patent space. Here’s what I’ve found on a first dive, organized around four dimensions.

Data Pre-processing

Raw patent records often require processing prior to operationalization in machine learning workflows. Differences in the conventions of data generation processes across national patent agencies and over time lead to noise, and cleaning is required to use patent texts as clean, comparable units of analysis. A common challenge is to identify names and map correct entities — inventors, companies, topics, locations — contained in patent metadata. While named entity recognition, an NLP technique, could conceivably be the end goal of a workflow, significant cleaning to clarify unique names and locations will often be required to operationalize higher-level analytical goals. Disambiguation algorithms and techniques such as, for example, k-means clustering, can help to organize named entities in patent data — people, places and businesses — prior to using patent records as input to NLP techniques.¹

Data Representation

Introducing patent texts as inputs to higher levels of NLP analysis entails choices around how to represent the long sentences, domain-specific vocabulary, and complex syntax that are typical in this domain.² The use of off-the-shelf NLP language models that are trained on very large numbers of texts from news articles and web sources should be evaluated critically before applying to the complex language in patent corpora, the NLP term for collections of texts. As a pragmatic tactic, some have used crowdsourcing — “human-in-the-loop” support — to adapt off-the-shelf language models for use with patent corpora.³ Nonetheless, research efforts often involve initial low-level processing of large patent corpora.

One set of researchers processed brute-force comparison of all words from a corpus of 5.3 million patent descriptions to develop a domain-specific vector representation of words weighted by term-frequency-inverse-document frequency (TF-IDF) scores, which express word importance in a patent relative to other patents in a collection.⁴ Others suggest that the inclusion of n-grams, a technique to represent a word using adjacent elements from a string of words, helps to improve the accuracy of patent classification tasks.⁵ One group of researchers compare three different groups of patent vectorization methods, including vector space models using TF-IDF, topic models using Latent Semantic Indexing (LSI), and neural models using Document-to-Vector (D2V), which extend the word2vec word embedding model.⁶ Results suggest that advanced methods exhibited an only limited increase in performance in comparison to the TF-IDF approach, as measured by a cosine similarity metric.

Modelling Semantic Similarity

Textual measures of semantic similarity based on data representations discussed in the previous paragraph can also form the semantic basis for the development of features to proxy market dynamics, company strategy, and location-based technology specialization as part of machine learning workflows.⁷ Topic modeling approaches such as Latent Dirichlet Allocation (LDA) and the LSI approach discussed above, can help to model the semantics, the meaning, of ideas contained within a patent application. Semantic similarity metrics between patent texts can also represent alternative “distance measures” between technological spaces relevant to competitive dynamics in or across market categories. To measure technological similarity, some authors use cosine similarity between every two patents in the patent corpus.⁴ Other text-based measures of similarity between patents also use Jaccard similarity to provide a separate lens from the classification taxonomies of national patent agencies.⁸ In addition to technological similarity, researchers are also working to use patents to measure and predict other market-relevant concepts such as the quality of patent ideas.

Predicting Quality

A higher-level goal of a machine learning workflow using NLP may be to develop measures of patent quality, novelty, or valuation to predict the drivers of transformations in a technological space or conversion into commercial products. Textual insights from patents may express multiple dimensions of quality of the underlying idea or ideas. For example, some researchers develop a measure of patent quality, which they term “firstpatword”, to measure the novelty of an idea in a patent based on the first appearance of a patent’s key word relative to a peer patent corpus.¹ Predictive measures can complement other measures of quality such as counts of patents and citations which are common in branches of the academic research. Future research on these semantic approaches may serve to continue to improve and problematize existing measures of patent quality in the literature.⁹

This quick dive into the intersections of NLP and market-focused patent begins to scope an emerging space of applied research using patents. The articles examined underscore the extent to which initial processing and representation of patent corpora command the time and attention of researchers. As these challenges are overcome, there appears to be potential for continuing future development of tools and techniques that apply NLP solutions to patent spaces, as suggested in AI and deep learning roadmaps developed by stakeholders in these sectors.⁷ So, if you’re forecasting emerging technologies, shaping strategic technology plans or mapping technological hotspots, perhaps there is a place for semantic features using patent data in your project.

Articles Referenced in this Post:

¹ Balsmeier, Benjamin; Assaf, Mohamad; Chesebro, Tyler; et al. 2018. Machine learning and natural language processing on the patent corpus: Data, tools, and new measures. Journal of Economics & Management Strategy, 27(3): 535–553. https://doi.org/10.1111/jems.12259

² Verberne, Suzan; D’hondt, Eva; and Oostdijk, Nelleke. 2010. Quantifying the Challenges in Parsing Patent Claims. In 1st International Workshop on Advances in Patent Information Retrieval (AsPIRe 2010). https://repository.ubn.ru.nl/bitstream/handle/2066/84168/84168.pdf

³ Hu, Mengke; Cinciruk, David; MacLaren Walsh. 2016. Improving Automated Patent Claim Parsing: Dataset, System, and Experiments. Association for Computational Linguistics. Arxiv: https://arxiv.org/abs/1605.01744

⁴ Younge, Kenneth and Kuhn, Jeffrey. 2016. Patent-to-Patent Similarity: A Vector Space Model. SSRN: http://dx.doi.org/10.2139/ssrn.2709238

⁵ D’hondt, Eva; Verberne, Suzan; Koster, Cornelis; Boves, Lou. 2013. Text Representations for Patent Classification. Computational Linguistics. 39, 3: 755–775. https://doi.org/10.1162/COLI_a_00149

⁶ Shahmirzadi, Omid; Lugowski, Adam; and Younge, Kenneth. 2018. Text Similarity in Vector Space Models: A Comparative Study. Arxiv: https://arxiv.org/abs/1810.00664

⁷ Aristodomeu, Leonidas; Tietze, Frank. 2018. The state-of-the-art on Intellectual Property Analytics (IPA): A literature review on artificial intelligence, machine learning and deep learning methods for analysing intellectual property (IP) data. World Patent Information. 55: 37–51. https://doi.org/10.1016/j.wpi.2018.07.002

⁸ Arts, Sam; Cassiman, Bruno; Gomez, Juan Carlos. 2018. Text Matching to Measure Patent Similarity. Strategic Management Journal. 39, 1: 62–84. https://doi.org/10.1002/smj.2699.

⁹ Kuhn, Jeffrey; Younge, Kenneth; Marco, Alan. 2019. Patent Citations Reexamined. SSRN: http://dx.doi.org/10.2139/ssrn.2714954