Unveiling PDF Parsing: How to extract formulas from scientific pdf papers

4 min readFeb 15, 2024

This article is a supplement to Advanced RAG 02: Unveiling PDF Parsing.

Extracting formulas from scientific papers has always been a challenging task.

There are some tools that can recognize formulas in scientific papers, such as:

Nougat: Neural Optical Understanding for Academic Documents, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup.
grobid: Figure 2 demonstrates that its performance is inferior to Nougat.
LaTeX-OCR: Figure 2 demonstrates that its performance is inferior to Nougat.
Donut: Nougat is based on its model architecture
Mathpix Snip: A paid tool.

In this article, we use the open-source Nougat framework, the architecture is shown in Figure 1:

For scientific papers, the accuracy of formula recognition is high, as shown in Figure 2:

As a demonstration, we use some formulas from page 5 of the paper “Attention Is All You Need” as shown in Figure 3.

Written by Florian June