Performance Engineering in Document Understanding

Part 1 of 2: Towards Better Data Formats

Published in

Thomson Reuters Labs

12 min readJul 24, 2024

Update: Part 2 is now available.

Since I joined the Document Understanding Core (DU Core) team back in 2019 (when it was “the Data Science team” at a start-up called ThoughtTrace), we’ve spent a fair number of brain-hours trying to reconcile the various types of “scale” you can encounter in natural language processing (NLP) workloads:

we wanted to process bigger documents
we wanted to process more documents at the same time
we wanted to process all documents faster

And, naturally, we wanted to do all the above without spending more than necessary — ideally, spend less!

In 2021, for context, we’d processed several million documents (approximately 5 million by January 2022), and the majority of these services:

Were written in Python — e.g. standard Python “data-processing” services, DataFrames and all
Would pull work from one or more queues and continue working a given item until it was complete (whether several seconds or several hours later)
Would handle a single request (typically, one “document”) at a time

In comparison, in the past thirty days we’ve processed:

More than a half-million documents (with a peak of forty-five thousand in one day)
In excess of seven million pages (with a daily peak of more than nine hundred thousand)
More than forty million paragraphs (and peaked at more than 4.3 million in a single day)

And our fiftieth percentile aggregate processing time is approximately thirty seconds. This processing includes optical character recognition (OCR), domain-specific pre- and post-processing, and interaction with up to three separate NLP models!

Note: the above figures are all intended to represent the volumes that different phases of DU Core processing handled independently, i.e., the above is not intended to signify the same work expressed in different units. Some portion likely correspond to “a document” being submitted by a user and then transiting every processing workflow we support. However, it is likely that just as much of the workload, and potentially more, represents documents OCRed but not classified, bulk re-classified, reprocessed with a different model, etc.

Over the course of the next two posts, we’ll be discussing a few of the technologies & approaches that made a difference: the context in which they were selected, advantages and disadvantages noted during their adoption, and any alternatives that may exist.

Disclaimers, Caveats, and that One Donald Knuth Quote About Premature Optimization

Firstly, this isn’t intended to cover all optimization scenarios. There’s a fair amount of low-hanging fruit for Python-based ML and/or standard backend engineering that won’t be discussed in any great detail (e.g., appropriate database indices). This isn’t because they’re not worth adopting (they are!) but because the intent here is to cover a specific problem area: optimization of real-time memory and CPU-bound data transformation processes. Again, selecting appropriate database indices is a great idea, but DU Core processing doesn’t execute a single database call, ever, so index tuning doesn’t help all that much.

Secondly, none of the above (or below) is intended to signify that performance is a “solved problem” in DU Core. It’s not! It likely never will be: like the vast majority of software ever written, it can always be at least a bit more performant. Moreover, ML engineering has an irritating tendency to require balancing more than just speed or memory usage — our systems would probably be even faster if we just stopped doing machine learning, but I’ve been told repeatedly that that’s not an option.

Lastly, there’s a quote that rattles around the Internet every once in a while that “premature optimization is the root of all evil”, and for what I’m sure are perfectly innocuous reasons, it’s never the full quote:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%.

Donald E. Knuth. 1974. Structured Programming with go to Statements. ACM Comput. Surv. 6, 4 (Dec. 1974), 261–301. https://doi.org/10.1145/356635.356640

Essentially, the below were our 3%, but our 3% in 2021 may not be your 3% in 2024. No major performance-oriented work should be undertaken without a backstop of both data and test cases that can be used both to quantify the improvement and to ensure that the system still functions as intended. A good test and profiling suite is essentially a meta-optimization — keeping the feedback loop as short as possible allows you to optimize for optimization.

Better Data Formats: Arrow & Parquet

One of the biggest challenges the DU Core team has encountered, time and time again, is the sheer size of the content which our services are required to wrangle.

Since our platform handles documents, and since documents can (irritatingly) be any size, and since (more irritatingly) memory isn’t free, we would routinely run face-first into a wall: our application supports uploads of x megabytes (MB), but a customer wants to upload a 2x MB file. Or a 3x MB file, or a 4x MB file, or...and so it goes. The application bumps the maximum supported file size, the customer uploads their file (or several hundred), and DU Core services start to fall over because they're out of memory ("OOM", which in 2020 was well on its way to being listed as my cause of death, too). As it turns out, you typically want services to keep working! To that end, the first round of optimizations we introduced was intended to just keep the memory footprint of our more ornery services stable.

Here lies Scott Arnott. The OOM killer got him. Sucks to suck, write better software.
(*“Gravestones”* by *Steve Parker* *is licensed under* *CC BY 2.0 DEED, Attribution 2.0 Generic*)

At the time, most end-user content (i.e., “uploads”) would ultimately be represented as XML and/or JSON and passed between DU Core services and their consumers. In particular, the OCR vendor which we’d used for several years generated XML; for larger customer uploads (potentially several hundred to several thousand pages), these files would routinely reach (or exceed) a gigabyte of space on disk.

This enormous XML would then be read completely into memory, each node would be scanned and transformed as appropriate, and the end result would be a sequence of records. Each record would correspond to a single line in the customer upload and capture font information, index values, etc. Eventually this would be read into a DataFrame and further downstream, DataFrame-based transformations and computations would occur.

Time after time, during either the initial file read, the generation of data based on the file, or the follow-on computations, customer documents above a certain page count would thrash our services. While the existing implementation had been fit for purpose for some time — several years, at least — the assumptions that had been valid when it had initially been rolled-out bore revisiting.

Data Format Strategy

The ideal outcome in this scenario was that our memory footprint stay stable — if the footprint was ~500 MB for a 50-page document, a perfect solution would be that the peak memory usage for a 5000-page document was also ~500 MB. What this required was ultimately a reframing of the work to be done: previously, the ‘unit of work’ was a document, but if a ‘document’ incurs too much variance, how about a page? And if our unit of work was a page, the “transformed” representation of each page would need to be temporarily stored, not in memory, and stored efficiently for follow-on usage downstream. While hard disk capacity is finite and was of some degree of concern, “efficiency” in this instance was a function of serialization/deserialization rate and more importantly, memory usage. We needed something that serialized fast, deserialized fast, and that incurred low overhead once loaded into memory. And we found Parquet.

Rather than load the entire document’s XML tree, transformed values, and DataFrame into memory simultaneously, we’d instead lazily iterate over XML events. For each page, we’d generate a single Parquet artifact. Because Parquet deserializes so quickly, the “cost” of the data not already being in memory is relatively low. The end result is that we can perform the same overall processing with greatly reduced memory utilization, and in tandem with Polars (discussed further below) can do so faster than the original implementation (where everything was kept fully instantiated in memory from start to finish). So much so, in fact, that it gets more efficient as the number of pages rises!

***Memory utilization, in MB, of the operations required of this phase of processing****. “~700” and “3k” correspond to length of a document in number of pages. As the number of pages increases, the “per-page” footprint actually improves using Parquet.*

Looking at the above, it’s clear that our Platonic ideal of memory efficiency was just a bit out of reach: a bigger document used more memory. However, we’d made enormous progress, and certainly enough to ship it; progress is progress, and we’d be returning to this effort again in the near future.

Columnar Data Formats

I buried the lede a little bit above (and probably will again) with regard to what makes Parquet so effective. One crucial aspect will need to wait until a follow-on post, but in the immediate short-term: Parquet is a columnar data format.

The differences between column- and row-based formats could be (often is) a standalone discussion in and of itself, but to drastically over-simplify things, while a row-based representation may look something like:

Odette Jones,Northern Mindanao,8
Nehru Hewitt,Northern Mindanao,8
Colby Burns,Northern Mindanao,6
Mollie Robinson,Delta,6
Steven Keller,Western Australia,5

a column-based representation will look more along the lines of:

Odette Jones,Nehru Hewitt,Colby Burns,Mollie Robinson,Steven Keller
Northern Mindanao,Northern Mindanao,Northern Mindanao,Delta,Western Australia
8,8,6,6,5

which allows for a number of advantages for analytic workloads.

Firstly, columns can be disregarded completely up-front if they’re superfluous. In comparison, all data for each row has to be retrieved when using a row-based format. If a given field is unnecessary, it can only be “excluded” after the fact. When using a columnar format, we don’t load the unnecessary data in the first place: this improves both memory usage and deserialization performance. There are even ways to only retrieve particular columns from a remote Parquet file — i.e., it’s possible to fetch specific columns from an S3 bucket, disregarding the rest of the file completely!

Secondly, columnar formats are far more effective at utilizing contemporary CPU architectures, and will generally be much more cache- or vectorization-friendly. For example, if we wanted the average of the values in the last “field” (8, 8, 6, 6, 5) the execution using row-based data may iterate through each row and load all values in each row (which on average will involve some indirection). It may then drop the first two values and move the last value into some temporary structure. Lastly, it might then feed that structure into actual “get the average” routine. Columnar data, on the other hand, is far more likely to “just” feed into cache lines; at the very least, it will require far less indirection to get the relevant data. Less indirection will likely result in far fewer cache misses, and lends itself much better to how modern CPUs “optimize” for data-crunching.

Lastly, columnar data can typically be compressed much more efficiently. Like values within the same column can be grouped together and only expanded when necessary. For example, the ‘region’ column above can be represented as 3"Northern Mindanao",1"Delta",1"Western Australia" - the same can't trivially be accomplished with row-based formats.

While the above are pretty contrived examples, extrapolating those advantages out to “large” data sets, i.e., a 30,000x15 matrix or larger, hopefully makes it easier to conceptualize the strengths of the format for ML workloads.

Arrow?

You may have noticed that the heading above was “Arrow & Parquet” and that while there’s a slew of words above, not a single one of them is “Arrow”. In the first round of optimizations we made to address the business case above, we did in fact use Parquet, and the general principle of restricting the “unit size” for the sake of more manageable memory usage is valid.

However, removing “the biggest” bottleneck in a system typically just leaves you with a new biggest bottleneck (this is the secret and most aggravating fourth law of thermodynamics, the Theory of Constraints). As we continued to optimize the more pathological aspects of our system, we eventually landed on a general preference for Arrow. Moreover, we eventually moved from a page-oriented approach implemented in Python to a document-oriented approach in Rust. That said, that migration was ultimately the result of the improved efficiencies in Rust and not because of any issue inherent to using Parquet in Python. I can’t overstate it enough: both Parquet and Arrow are vastly more efficient, in terms of both space- and time- complexity, than persisting “bulky” structured data in more “standard” human-readable formats (i.e., JSON or XML).

So what is Arrow? Arrow is a “columnar memory format (…) organized for efficient analytic operations on modern hardware”. Having used both fairly extensively at this point, my general rule of thumb is that Parquet — which to be clear can be used with analytic libraries like Pandas or Polars — is not optimized for analytic use cases to the same extent that Arrow is. However, it is likely the better choice for “archival” or Hadoop-level use cases (as is discussed here, where Parquet blows CSV out of the water). In my experience, serializing and deserializing Arrow-formatted data may as well be (not to be dramatic) borderline free, making it a great “quick” interprocess communication (IPC) format for use cases like the following:

Service A has “a lot” of data and wants Service B to do something with it
Service A uploads an Arrow-formatted representation of the data to cloud-based blob storage
Service A submits a request to Service B informing it of the data
Service B retrieves the data, deserializes it, and performs its workflow(s)

If your data is “big enough”, because Arrow compresses reasonably well, deserializes so quickly, and can be used with high-performance analytics libraries, steps 1–4 above can often start and finish (and with far better average resource utilization!) while more ‘standard’ approaches are still getting started (i.e., deserializing JSON or CSV into a Pandas DataFrame).

Advantages and Disadvantages Arrow & Parquet

I’ll start with the disadvantages (well, disadvantage) first:

Not human-readable
…

That’s it, really. Arrow and Parquet require something external to actually read it, whether that means converting it to a human-readable format, or using any of a number of analytics libraries — the sky is the limit here, as support for both formats is pretty commonplace in 2024.

As to the advantages:

Serialization and deserialization performance (vs. JSON, XML, CSV, pickle, etc.)
Storage efficiency (vs. JSON, XML, CSV, pickle, etc.), particularly Parquet
Integration with Pandas, Polars, ML.NET, DuckDB, etc.
Language-agnostic (vs. pickle)
Analytic efficiency (vs. JSON, CSV, etc.), particularly Arrow
Memory usage in tandem with higher-performance analytic libraries (Polars, DuckDB, etc.)
Capable of handling nested data

Alternatives to Arrow & Parquet

There’s not many alternatives in the “columnar format” space, and outside columnar formats there are likely too many to mention in any intellectually honest detail (so I won’t), but two potential alternatives would include:

We haven’t used either in production contexts, though ORC appears skewed towards transactional workloads (vs. analytic workloads). Lance, in comparison, just didn’t exist at the time, but is certainly worth evaluation — it aims to be the format for machine learning, and integrates well with many of the aforementioned formats and libraries (e.g., Arrow, Parquet, Polars, Pandas, DuckDB, and so forth).

Conclusion

As the customer base of Document Understanding Core grew, its scalability and stability requirements did as well. Long-standing, load-bearing aspects of the DU Core ecosystem had to be reevaluated, and primary among these was our data. By leveraging contemporary, performance-oriented data formats — Parquet and Arrow — the DU Core team was able to dramatically improve the memory utilization of its document processing services. Better memory utilization in turn increased the number of successfully-processed documents without requiring additional spend.

There’s a lot to recap from the above, but to summarize:

JSON and XML are a terrible fit for “bigger” data (and an even worse fit for “big” data)
Both Parquet and Arrow are more time- and space-efficient than JSON, XML, CSV, etc.
Parquet is an outstanding storage format, and Arrow enables enormous speed improvements

In part 2, I’ll detail two further optimization efforts DU Core undertook: leveraging Polars for faster data transformation, and adopting Rust for performance-critical services.