Technology Due Diligence Co-pilot

9 min readDec 13, 2023

Article by: Noah A, Benjamin Zeisberg, Karan Daryanani

Project GitHub link: https://github.com/noahmadi/AC215_seeing-green/tree/main

Video link: https://drive.google.com/file/d/1FoKorszTdyGmSZbhLtsT-7-5O9ycK0Ad/view?usp=drive_link

This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.

The evolution of a Technology Due Diligence Co-Pilot

“Due to the rapidly increasing complexity of deep technology there is a large ‘knowledge gap’ between innovators and investors. Frequently, financial intermediaries do not have sufficient information and/or technical expertise to evaluate the technical and economic viability of deep technology projects and innovative fast growing [deep-tech] companies.”

- European Investment Bank

The nature of hardware or ‘tough-tech’ investing is such that it can be very difficult for investors, who commonly lack a technical background to properly assess the potential of the technology in relation to existing technology innovation and patent stock.

The evolution of the idea for the project began with finding a dataset that accurately captures the stock of technology innovation presently available (and we landed on using publicly-available USPTO data) to compare a user input (such as a product sentence from a pitch deck, or a company website) to the stock of patents. We considered several potential solutions for this kind of system, starting with a) a long tradition in Natural Language Processing in considering the ‘novelty’ of text, and b) newer advances in high-dimensional representation of text and sentences, as well as c) the broader area of semantic retrieval or ‘Retrieval Augmented Generation’ (RAG) that takes an efficient approach to context-driven answers from a chat interface.

What this post expands on is the process of building an application that allows an investor to input a sentence relating to a product (e.g. “a pick up truck”) and for our retrieval engine to used embedded claims (discussed below) that provides a context window in the form of patent abstracts for the GPT to provide an overview of technologies in the space.

Data Collection and Preprocessing: Laying the Foundation

We have collated a comprehensive dataset of patents from the USPTO, covering the period from 1976 to the present, with a focus on the patents filed from 2005 to 2023. Patent data is available through other sources, including Google Patents. This data, however, is stored in PDF format which would have complicated the text extraction process. The USPTO website allowed us to extract data directly from the source in textual format.

Other tools, such as The Alexandria Index, provide already-embedded data for arXiv papers. However, the conversion of InstructorXL Embeddings to context-relevant BERT embeddings was not as straightforward as first thought, and their website indicates an intent to embed patent data in the future. We were excited to be able to leverage a large dataset of embeddings that had yet to be made open source.

The USPTO provides patent data stored as weekly XML files, each containing approximately 3,000 nested XMLs (one per patent). The USPTO transitioned to a new, consistent XML structure for patents beginning in 2005 which allowed for the implementation of a more seamless parsing tool. We used BeautifulSoup from bs4, as well as pandas and html packages in python to parse and write out the raw, nested XML data. The data is stored in a GCP storage bucket and is processed through a series of containers.

First, data was downloaded directly into our GCP bucket from the USPTO website in ‘.zip’ file format. We created and ran a container on our GCP VM that downloaded and unzipped this data and stored the nested XML files in a separate folder.

We then created another container that read in each nested XML file and parsed out the relevant fields. These included publication number (our ID field), the abstract, and all patent claims. We then extracted the first claim from the patent claims section for each patent as claims are written in a ‘waterfall’ structure: the first claim is the most general and all-encompassing, and each subsequent claim elaborates upon it.

Below is the anatomy of a US patent, highlighting the areas of interest to us:

Some exploratory analysis confirmed that using the first claim on each patent was sufficient for our clustering and semantic similarity exercise (see: https://arxiv.org/abs/2103.11933).

For each weekly file, we wrote out a corresponding parquet file with only the relevant fields mentioned above. The snappy compression of parquet files, coupled with the discarding of irrelevant data, allowed us to shrink our data from almost 500GB to only 12GB. The weekly partitioning of the data allowed us to easily read in chronological subsets of patents during the development and testing phases.

Our final data set included 975 weeks of patent data, neatly partitioned in parquet format in our GCP bucket.

Embeddings and Semantic Retrieval

The goal of a semantic retrieval algorithm is to perform a matching based on a distance measure (e.g. cosine similarity, Manhattan distance, and Jaccard similarity) that is symmetric, and then the algorithm returns the top ’n’ similar results to the input.

Let us then consider the different parts of this system — first, we need contextually similar embeddings to use to build comparisons, second, we need an efficient way to compute similarity scores, and lastly, we need to use the distance measure to create distance pairs between our user input and repository of embeddings.

The model we use for our embeddings is based on a paper (https://arxiv.org/abs/2103.11933) that leverages a self-supervised model training system. The benchmark STS similarity dataset (https://huggingface.co/datasets/metaeval/sts-companion) is used as a training dataset to fine-tune a RoBERTa-based classifier to create a set of labeled sentences to serve as input to the S-BERT model. The labeled pairs (‘machine-augmented’ data) is then used to fine-tune the S-BERT (a transformer architecture based on siamese networks for sentence embeddings).

(from paper)

We leverage the pre-trained model from huggingface (https://huggingface.co/AI-Growth-Lab/PatentSBERTa) to use this for the comparison of sentence embeddings that do better with the specialized, technical content of patent claims.

The paper presents the model as having comparable performance to other models that were trained on much larger corpuses of annotated data, such as PatentBERT. The primary benefit of this setup is that a team with no contextual understanding of patents could leverage an entirely self-supervised system with publicly available data. Using the S-BERT architecture also gives us the benefit of faster embedding speeds with an embedding output that is amenable to semantic similarity based tasks.

We take in the parquet files produced with the information from the patents, and use the model above to create the context-specific embeddings for patent claims. We run this in parallel, using a high-performance computing cluster. For each patent claim, sentences much longer than the 512 token length accepted by BERT models are split into multiple sentences by using a ‘rolling window’ style and are embedded separately. For every sentence, the first [CLS] token of the sentence embedding is taken as a representative embedding for the sentence.

We then take the set of embeddings that were created, and try to find an efficient storage solution that allows for quick comparison of vectors. Facebook’s FAISS (https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) provided a perfectly usable vector database for this purpose, enabling us to create a searchable index of sentence embeddings.There are several others, however, which could be used for the same purpose. The front-end (discussed below) allows for a real-time embedding of a user input.

In our case, this submission is a technology-related or product sentence. The embedding that results from this output is used to compute a set of pairwise distances (using cosine similarity) in order to return the top 10 relevant patent claims (and id) from the larger database.

Generating Summarized Outputs using GPT

The relevant patent claims we return from the semantic search are then used to match to the original sets of abstracts that came with the USPTO data. We then use the top 10 most relevant patent abstracts (which use less technical language than the claims) to provide an extended context window to our generative text model.

We faced several constraints in creating this summarized output, and the mildly expensive GCP cloud architecture prevented us from fully utilizing open source generative models such as LLaMA or the newer Mistral release. We therefore decided to leverage the (paid) OpenAI API to provide the summary of the context window provided. The tool leverages the GPT model release (gpt-3.5-turbo-1106).

From the semantic search results and a simple prompt — “Based on these abstracts, summarize key technological advancements and their potential applications.” — the model generates informed, accurate responses for technology due diligence. The system dynamically retrieves relevant patent information for every request input by the user, forming a context window for the GPT prompt, thus providing a comprehensive technology landscape assessment.

Front-End Development and Deployment

The user interface is developed using React and Node.js, offering a simple and intuitive experience. The front end includes a form where users can submit technology-related sentences. Once a sentence is entered, it’s sent to the back-end through an API call. The results, which include the top 10 relevant patent claims and their GPT-generated summaries, are displayed on the front end. This interface is styled with CSS to enhance visual appeal and usability.

For backend integration, FastAPI is employed. It handles requests from the front end, processes user inputs, and returns relevant information. This API is designed to handle real-time embedding computations and manage data flow efficiently. Deployment and scalability are managed with Kubernetes, ensuring the system’s responsiveness and efficiency under varying loads. We used Ansible to spin out instances and create docker containers for the deployment.

The core of the system lies in its data processing and model integration. User inputs are first embedded using a pre-trained model, followed by a semantic search to identify the top 10 relevant patent claims. These claims are then matched to their corresponding abstracts. For summarization, these abstracts form a context window for the GPT prompt, and a summary is generated using the GPT model through the OpenAI API.

Continuous Improvement and Future Work

The early iterations of this project looked to create a novelty score based on a few non-symmetric indices (e.g. comparing patent claims matches across time to look at new concepts/topics/words introduced). However, the specific claims context was a difficult setting for this kind of problem, as some patent claims were mainly different from the user input in non-meaningful terms, such as in the specific shape of a corner shelf. There are lots of other interesting ways to think about novelty, such as ‘entailment’ classifier models, or leveraging statistical concepts of entropy or ‘information gain’. Looking ahead, this would be an active area of development.

The second aspect of what we could improve would be to create better generative outputs that could be useful for investors. One part of making this better is more user research, but another could be better and more fine-tuned language models for generation. We could use the original dataset (and the abundance of text found in patents, such as embodiment related text) to fine tune an open source language model (such as LLaMA).

Technology Due Diligence Co-pilot

Written by noah :)