Unveiling PDF Parsing: How to extract formulas from scientific pdf papers

Florian June
4 min readFeb 15, 2024

This article is a supplement to Advanced RAG 02: Unveiling PDF Parsing.

Extracting formulas from scientific papers has always been a challenging task.

There are some tools that can recognize formulas in scientific papers, such as:

  • Nougat: Neural Optical Understanding for Academic Documents, an end-to-end trainable encoder-decoder transformer based model for converting document pages to markup.
  • grobid: Figure 2 demonstrates that its performance is inferior to Nougat.
  • LaTeX-OCR: Figure 2 demonstrates that its performance is inferior to Nougat.
  • Donut: Nougat is based on its model architecture
  • Mathpix Snip: A paid tool.

In this article, we use the open-source Nougat framework, the architecture is shown in Figure 1:

Figure 1: Simple end-to-end architecture following Donut. The Swin Transformer encoder takes a document image and converts it into latent embeddings, which are subsequently converted to a sequence of tokens in a autoregressive manner. Source: Nougat.

For scientific papers, the accuracy of formula recognition is high, as shown in Figure 2:

Figure 2: Results on arXiv test set. Source: Nougat

As a demonstration, we use some formulas from page 5 of the paper “Attention Is All You Need” as shown in Figure 3.

--

--