Unveiling PDF Parsing: How to extract formulas from scientific pdf papers
4 min readFeb 15, 2024
This article is a supplement to Advanced RAG 02: Unveiling PDF Parsing.
Extracting formulas from scientific papers has always been a challenging task.
There are some tools that can recognize formulas in scientific papers, such as:
- Nougat: Neural Optical Understanding for Academic Documents, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup.
- grobid: Figure 2 demonstrates that its performance is inferior to Nougat.
- LaTeX-OCR: Figure 2 demonstrates that its performance is inferior to Nougat.
- Donut: Nougat is based on its model architecture
- Mathpix Snip: A paid tool.
In this article, we use the open-source Nougat framework, the architecture is shown in Figure 1:
For scientific papers, the accuracy of formula recognition is high, as shown in Figure 2:
As a demonstration, we use some formulas from page 5 of the paper “Attention Is All You Need” as shown in Figure 3.