De-rendering LaTeX Tables

Gideon Mann
4 min readSep 23, 2019

At Bloomberg, one key challenge our Global Data team faces is the normalization of incoming data. Their goal is to transform raw incoming data into clean, normalized data that can be used to power financial analytics for our users. For example, publicly-traded companies’ quarterly and annual reports, which are filed with the SEC, include data like quarterly net income and loss. This information is extracted and placed into a database where customers can graph, screen and test against this ground truth.

One of the most complicated settings for this normalization occurs when the data is textual. Given textual data, there has been significant work in NLP — in information extraction in particular– to normalize semantic content from ambiguously-worded text. One example of this is extracting CEO succession relationships from free text. However, in some cases, the incoming data isn’t in a clean format like XML where the structure can be transparently read. Often, it comes in PDFs or images where even extracting the exact words and the page structure is not straight-forward.

PDFs, in particular, are of tremendous business interest as they are the medium for many regulatory filings, especially outside of the United States. Unlike other problems in machine learning, where the goal is to recover the original human intent from some observations, the problem here is subtle and different. Instead of a human process, there is a computational process which takes some of the markup language and renders it as an image. For example, scientific documents are written using the LaTeX typesetting system, which is first interpreted by a LaTeX compiler and then rendered as a PDF. Typically, the original LaTeX commands are lost during this transformation. However, this LaTeX is crucial for understanding the document contents, as it contains content about the structural makeup of the document.

Over the past few years, there has been a line of work dedicated to the problem of “de-rendering.” Interestingly, LaTeX is Turing complete. So, in a sense, this de-rendering is almost a form of program synthesis — figuring out how to derive a program that meets a set of behaviors.

One of the seminal pieces of work looking at this problem was “Image-to-Markup Generation using Coarse-to-Fine Attention” [Deng et al. 2017]. In that early work, the authors demonstrated how to de-render LaTeX equations from their images. The proposed model combines a convolutional neural network (CNN) that integrates information across the image pane, and then has a series of Bi-LSTMs that scan across rows of the compressed feature representation to emit LaTeX tokens. This model has been surprisingly effective and has been widely replicated — for example, a team of Weights & Biases students reimplemented it (here is an online demo www.wand.com/articles/image-to-latex). There’s also a video that shows the attention during processing (lstm.seas.harvard.edu/latex/)

At ICDAR 2019 in Sydney this week, I’ll be presenting recent work led by Yuntian Deng of Harvard NLP, which was supported by David Rosenberg and I during his summer internship with the Data Science team in the Office of the CTO at Bloomberg. It scales up this work to the harder problem of LaTeX table de-rendering. LaTeX tables are typically more complex than equations. First off, they’re larger and frequently contain multiple sub-equations within them. Second, the rendering is non-local — one command (e.g., a row formatted as centered instead of right-aligned) can alter multiple parts of the image. As such, these represent an interesting increase in complexity over equations. In our paper, “Challenges in end-to-end neural scientific table recognition,”we provide a corpus of around a half a million LaTeX tables extracted from arXiv, and demonstrate experimental results on applying the equation model to this table corpus.

It is heartening to see that the equation model performs adequately, suggesting that it is a reasonable baseline for future work. But, there is a significant gap in the accuracy of the de-rendered LaTeX tables and ample room for future work. We hope this corpus will provide a mechanism for others to develop more sophisticated algorithms.

Moreover, what is most surprising is simply how difficult this problem is. Certainly, there is an ample amount of training data, as we have access to many hundreds of thousands of tables, but approaches which have been very successful for, say, image recognition, do not work as well in these documents.

Again, we’re merely talking about recovering the structure of tables. The larger problem of recovering the structure of whole PDF documents that have been rendered from LaTeX still looms unsolved. We hope that, with more attention to this problem, the community will develop methods that scale to whole documents with high precision, enabling increased transparency and liquidity across the capital markets.

--

--

Gideon Mann

Head of Data Science / CTO Office Bloomberg LP. All opinions my own.