Chemical Patent Analysis Beyond Simple OCR

Shashank
iReadRx
Published in
4 min readJul 9, 2021

Making sure you are not infringing on somebody else’s intellectual property (IP) is an essential step before filing a patent. Inventors and patent attorneys spend considerably large amounts of time analyzing similar patents for this reason — time which could instead be spent on research.

Chemical patents, just like other patents, contain a lot of information: Textual and Non-Textual. To quickly analyze the data, extracting information and programmatically analyzing them is getting popular, for instance, using OCR to extract text and looking for keywords. But using OCR limits the extracted information to textual data. Other important details like chemical structures, pictures of reactions, etc., can’t be extracted this way.

Deep learning-based approaches like object detection solve this problem. Many state-of-the-art (SOTA) object detection models help quickly detect non-textual pieces of information like chemical structures and reactions. The detections can further be cropped and used for other tasks.

YOLOv5 is one such model.

Overview of YOLO

YOLO(You Only Look Once) is a class of object detection models that run inference by “Looking” at the image only once, i.e., only one forward propagation. This approach saves a lot of time, making YOLO one of the fastest object detection algorithms out there. In use cases that require analyzing 100s or even 1000s of pages at once, YOLO’s fast inference makes it a top choice.

Further, YOLO learns generalized representations, so if training data only contains patent documents, it can still detect compounds in other scenarios.

YOLOv5 is one of the fastest and most accurate open-source models available. Unlike other YOLO models, YOLOv5 is fully implemented in PyTorch, making it easy to work with.

YOLOv5 performance comparison:

Source: YOLOv5 Repo

Transfer Learning

Transfer learning, in a nutshell, is the process of using a pre-trained model really good at a particular task and transferring the knowledge to another task, improving performance and reducing training time at the same time.

The knowledge can be transferred by “freezing” the weights of the first few layers and only updating the weights of the remaining layers. In our case, YOLOv5 had 24 layers, so we froze 40% of the top layers (the backbone of the model) and began training.

Hyperparameter Tuning

To make the most out of the available training data, having the most optimal hyperparameters is crucial. The most optimal parameters not only give the most accurate results, but they also do so in the fastest time hence utilizing the least compute resources.

We ran a genetic algorithm for 300 generations to find the most optimal parameters while training the model on 30 epochs. This significantly improved model performance and gave us really accurate results despite having only 71 images to work with. The results are summarized below.

Results

We developed a web app that lets you upload a pdf and output detected chemical compounds. The app crops these detections and stores them in a separate directory for each class. These results can then be downloaded by clicking on the generated link. All this with surprisingly low inference times.

Comparison with Google Patents

We looked for a couple of chemical patent documents on google patents and noted the number of chemical structures extracted by google patents. We then downloaded the same documents and uploaded them to our web app to compare the results.

Example 1: File

Example 2: File

Example 3: File

As google patents rely on extracting image elements in the PDF to detect chemical compounds, it couldn’t extract all the compound structures present in the document.

Future Work

Continue improving model accuracy.

Distinguish between Compounds, Markush Structures, and Intermediates.

--

--