Performance Engineering in Document Understanding

Part 2 of 2: Towards Better Data Engineering & Applications

Published in

Thomson Reuters Labs

19 min readAug 3, 2024

Document Understanding Core (DU Core) is a high-throughput, high-efficiency system for document and natural language processing, often processing millions of paragraphs of day.

However, scaling a system like DU Core can be tricky. In the preceding post (Part 1/2), I detailed the first of three optimizations efforts DU Core undertook as we aimed to process more (and larger) documents, without “spending” our way to scale: by adopting Arrow and Parquet, DU Core was able to process the same volume of data, or more, with a fraction of the memory footprint. That said, memory utilization and service stability were only the first hurdles.

In this second part, we’ll be looking at two additional technologies that made an impact: the context in which they were selected, strengths and weaknesses we encountered, and a survey of other technologies that could be viable in similar circumstances.

Better Data Transformation: Polars

Aside from ever-increasing memory requirements, the other challenge that DU Core has routinely grappled with in the context of scale is speed. Especially with more documents, larger documents, etc., code that was fast “enough” when initially introduced (particularly code that scales pathologically with document size) tends to become a problem. A slow loop that only executes ten times may be fine; that same loop having to execute three thousand times will cause issues in a distributed system eventually, even if the issue is “just” one of customer dissatisfaction. It’s not said enough, but speed is a feature.

Realistically, though, aside from customer complaints, “slow” tends to cascade — in a distributed system, a faster service is a more predictable service. Slow code fails health-checks. Slow code causes timeouts, timeouts can cause a stampede, and a stampede can kill a service that would otherwise function as intended. Slow code keeps memory saturated longer, which can prevent other, perfectly adequate services from starting in the first place.

Most importantly, (CPU) time is money. While CPU cores may not be as sparse on the ground as GPUs are these days, CPU time is less fungible than, say, memory usage. If there’s a way to get more work out of the same set of CPU cores, all things being equal, that work is cheaper to perform.

Traditionally, space and time complexity have to be balanced, which was our largest concern in attempting to stabilize memory utilization. As bullish as we were on the increased memory efficiency noted with Parquet, it’d probably (rightly) get discarded if it was a 1:1 trade — i.e., if we cut memory usage by 50% but increased the time taken for processing by 100%. Interestingly, the decision to evaluate Parquet required the use of a novel DataFrame library: in the only time in history that outdated dependencies have been beneficial (eventually), our version of Pandas was too old to support Parquet.

Since we wanted to keep the scope of the optimization as narrow as possible, it meant that we’d use a new library specifically for those aspects of processing we were aiming to overhaul. Everywhere else, we’d continue to use our version of Pandas from the mid-to-late Mesozoic era.

*If I had to pick between dealing with Python dependencies and extinction by asteroid I’d probably pick the asteroid (“Complete skeleton of a Triceratops dinosaur in a museum”* by *Ivan Radic* *is licensed under* *CC BY 2.0 DEED, Attribution 2.0 Generic*)

In particular, we’d be evaluating Polars, a relatively new (at the time) mostly drop-in replacement DataFrame library, implemented in Rust but with bindings for Python. Importantly, Polars treated Parquet was a first-class citizen, and (just as importantly) emphasized speed — especially over and above the performance that we’d traditionally observed with Pandas.

Data Transformation Strategy

The ideal outcome, from a perspective of performance, would be to cut memory usage dramatically while keeping the time taken to process at or below 105–110% of our baseline. If we cut processing time by 20%, great! If it stayed the same, I’d live. If it doubled, we’d have to go back to the drawing board.

To that end, we’d be leveraging Polars’ “lazy” API, as recommended by Polars’ author. This would allow Polars to optimize query planning and concurrent execution, and while this would constitute a more dramatic refactor, it’d give us a more grounded sense of its real-world performance. Moreover, it’d highlight the total cost of ownership (TCO) a bit better: did it perform sufficiently better to warrant its usage over Pandas, Python’s de facto DataFrame library?

An example of what this migration might look would include:

page_break_tag = "Page Break"
page_break_terms = [
    ...
]
page_break_regex = re.compile(rf"({' *)|('.join(page_break_terms)} *)")
page_break_candidates = page_level_df[:50]
page_break_lines = page_break_candidates[
    page_break_candidates[TEXT].str.match(page_break_regex)
]
if np.any(page_break_lines):
    return page_break_tag

With Pandas…

num_lines = 50
txt_slice_col = "txt_slice"
# A trie pattern that corresponds to `page_break_terms` in the Pandas example above
contains_page_break_pattern = polars.col(txt_slice_col).str.contains(PAGE_BREAK_RE_TRIE_PATTERN)

return (
    # More on what lazy() is doing below
    dataframe.lazy()
    .groupby(PAGEINDEX)
    # Over each group of rows per page index, take the text column, sort those rows by the
    # line index, and then take the first 50 rows
    .agg([polars.col(TEXT).sort_by(LINEINDEX).slice(0, num_lines).alias(txt_slice_col)])
    # Explode those "first 50 rows" values into distinct rows
    .explode(txt_slice_col)
    .filter(contains_page_break_pattern)
    .select(PAGEINDEX)
    .with_columns([polars.lit("Page Break").alias("page_tag")])
    .drop_duplicates()
)

With Polars...

The two examples above are doing functionally the same thing in terms of “data transformation” operations, but one key difference is that nothing of substance in the Polars example is actually executing. Instead, it may be easier to think of it as describing the set of operations we want to execute eventually. The operations themselves are deferred until collect() is called. And, for example, if we had one or two dozen such types of transformations we wanted to execute, we could theoretically define all of them as in the above, concatenate the results, and then call collect, executing the whole thing in one shot. The advantage of this approach is that it allows Polars to manage what data can be disregarded, what operations can run in parallel threads, and what the optimal order of execution of each operation might be.

***Runtime, in seconds, of the operations noted above****. As before, “~700” and “~3k” refer to size of document as a function of number of pages. Once again, as the number of pages increases, the optimized approach becomes more and more efficient.*

As it just so happens, that’s exactly what we do, and the performance of the overhauled approach (above) speaks for itself.

To recap: we drastically improved the resilience of this service by adopting Parquet, mitigating memory utilization and stability issues to a large degree. Further, by leveraging Polars, we side-stepped any resulting latency issues completely: documents could now be processed faster than ever before! It wouldn’t necessarily scale forever, but nothing ever does, and we did have a plan for the long-term that’ll be described in the last section, concerning Rust.

Advantages and Disadvantages of Polars & Parquet

I’ll start with the disadvantages:

May not be quite a drop-in replacement for Pandas
More powerful ‘lazy’ API is definitely not a drop-in replacement for Pandas

Other than that, though, the disadvantages are probably more discretionary than anything else. I don’t personally know the degree to which Polars can be a drop-in replacement these days; when we first evaluated it, there were one or two Pandas functions for which I couldn’t find an analogue, but this may no longer be the case. However, to get the most out of Polars, it’s strongly recommended to use its lazy API, which does have a bit of learning curve if you’ve been writing imperative Pandas code for the majority of your ML-adjacent career.

Advantages:

Speed and memory utilization are consistently top-of-class (vs. Pandas, etc.)
Polars can interoperate with Pandas as necessary
Can be used in both Python and Rust code

Alternatives to Polars & Parquet

There’s a number of alternatives to Polars, including the incumbent Pandas. It’s likely helpful to aggregate the alternatives along (at least) two separate dimensions:

Whether the alternative is intended to be used in a single- or multi-machine context
Whether the alternative is primarily DataFrame-y or SQL-y (or even uses some other facade)

Alternatives:

E.g., Dask is best used in multi-machine contexts — a use case Polars doesn’t support. Additionally, DuckDB “looks like” something more akin to SQL than it does to traditional Pandas, as does ClickHouse. Ultimately, which alternative is most suitable for your use case is likely a function of how much data you have and whether your engineers and/or researchers are more comfortable with a DataFrame-like interface or not.

All that said, for our circumstances (“single-node” and “DataFrame-like”) Polars was the clear winner, especially given its interoperability with Rust. Polars does support out of core processing now, but there’s likely an inflection point beyond which you’d be best served by a library more specifically tailored to “big data” processing. That said, with Polars, the goalposts for what constitutes “big data” are likely not the same as they used to be.

Any particular alternative aside, I’d also recommend perusing the various benchmarks that are available. You may or may not find any one benchmark that gives you a conclusive go/no-go as regards any one library, but they can be used to highlight options you may not have been aware of and to demonstrate their general performance tendencies.

Better Data-intensive Applications: Rust

In the preceding discussions, I detailed how, via adoption of Parquet and Polars, we’d managed to make critical-path processing drastically more efficient, both in terms of memory utilization and processing latency. We’d managed to stave off the OOM boogeyman without ratcheting-up the resources allocated to the service, and the operations we’d optimized for memory were faster than they had been to begin with.

However, removing your biggest bottleneck usually just creates a new biggest bottleneck, and this sequence of optimizations was no different. By positioning ourselves to be able to successfully handle larger and larger documents, new runtime inefficiencies popped-up. Large documents (e.g., 2,000+ pages) had previously just killed the service; now, those documents were transiting what had previously been the bottleneck and were instead taking hours to finish processing.

The solution in this case consisted of two separate but related efforts:

Optimization, very similar to all the above in execution, of one of our inference workflows
Consolidation of yet more optimized extract, transform, and load (ETL) logic

Essentially: the first iteration of a more optimized approach had succeeded! However, to really cross the finish line, we needed to go a bit further. The ideal next version of the inference mentioned above would require a slightly different structure of data: where the preceding sections highlighted a migration from “document-oriented” to “page-oriented”, we’d instead be doing the opposite. Moreover, we wanted to centralize that logic such that multiple calling services (and multiple instances of each of those calling services, potentially dozens) could benefit from one service handling all of these transformations.

Our goal would be a single service that could replace similar functionality executing in anywhere from one to a hundred (or more!) services, with a smaller footprint. Based on the performance we’d observed with Polars (which is written in Rust) and armed with experience gained during the recent ETL refactor, we planned for a full-blown reimplementation of the logic in what would be DU Core’s first Rust service.

Rust?

If you haven’t heard of Rust, it’s been Stack Overflow’s “most loved” language since the fall of the Berlin Wall (well, 2016). If you don’t personally use Rust, the language you do use probably has tooling written in it (npm, ruff, uv), and if you’re in ML, it’s a near-guarantee that some part of your stack is using it (tokenizers, cryptography). Even if you’re not in ML, a growing number of companies are using it in performance-critical contexts (AWS, Microsoft, Figma, Cloudflare, Dropbox, etc.).

But why did we decide to evaluate it? It largely came down to a handful of considerations.

Meeting the performance goal detailed above would require something drastically more performant than Python. However, this was also a “data-crunchy” use case, and not necessarily a standard backend server. We’d be churning through an unpredictable number of indefinitely large files and performing CPU-intensive operations all the while, not performing standard CRUD operations. Lastly, we’d already observed that Polars was dramatically more performant than Pandas, and Polars was written in Rust. As a result, we knew that the libraries we’d likely be using were already supported in Rust, and that moreover, we could expect a focus on performance and efficiency.

Essentially, while our goal was to (hopefully overwhelmingly) improve our ability to ETL the content in question, we also wanted to preposition ourselves for other performance-critical workflows which may arise.

We’d certainly be well-served adopting any of the standard enterprise backend languages (C#, Java, Go, etc.) for the API layer, but that would still leave us with a bit of a gap insofar as the more CPU-bound workloads were concerned. While all of the above are likely more than capable of predictably handling large chunks of content, our team had functionally zero experience tuning their various garbage collectors. Moreover, in those use cases where the requirement would be best served by interoperability, the interop story between Python and all of the above is hit or miss, at best.

Aside from those, there is (was?) an industry-standard answer to Python’s suboptimal performance: C or C++. While both are valid options for improving the performance of Python code, they can also be difficult to get right, and ideally we’d manage to do all this without inventing new and unforgivably deranged Common Vulnerabilities and Exposures (CVEs) just to shave a second or two off of our request times. Lastly, it was 2021 — we’d ideally land on a language with dependency management tooling from this century.

*Believe it or not, some archaeologists posit that the Rosetta Stone may be one early example of a C makefile. (“Rosetta Stone, British Museum”* by *Maureen* *is licensed under* *CC BY 2.0 DEED, Attribution 2.0 Generic*)

Rust, however, allegedly bridged every gap mentioned above, checking all the boxes for our use case:

Competitive with C and C++ speed and efficiency
Contemporary, high-level syntax
Strong (and growing) reputation in backend engineering for predictable performance
Modern, painless dependency management
Established libraries for Parquet and Arrow
Proven track record of Rust-Python interop (e.g. Polars, Huggingface’s ‘tokenizers’ library, etc.)

Strategy

At a slightly higher level, since this was a POC, we would strangler-fig one component discussed previously (the ETL logic) and we’d keep the scope as narrow as possible. We’d implement transformation to both Arrow and Parquet, but otherwise it would have one job and one job only: retrieve XML content, run that content through the transformation logic, and store the transformed output where the calling service can retrieve it.

Per the section above, we wanted to performantly generate an Arrow (or Parquet) file per document with a stable memory footprint. The service would ideally be handling submissions from dozens of separate service instances at the same time as a matter of course, so we needed to ensure that even “big” documents didn’t constitute an outsized load on the system. To this end, we’d use a strategy similar to the one employed in Python: we’d leverage event-based parsing and represent our logic as a state machine. We’d read each event into a fixed buffer and collate a “page” of content at a time, flushing to an Arrow file between pages. This would allow us to keep the load for any document (regardless of file size) to a minimal, manageable average.

It may look more complex than loading the full XML tree into memory, but it ultimately allows us to represent the parsing lazily, where the action we take for any XML event is only ever contingent on our current state and what the newest node is. This allows us to keep our memory utilization relatively stable on a document-per-document basis, regardless of whether a document is 5MB or 5GB.

// Snippet of ETL state machine logic (also, coincidentally, a 
// world tour of most of Rust's syntax, even if we're going to 
// skip that for the sake of brevity
pub fn process(&mut self, xml_event: Event<'_>) -> Result<(), Box<dyn Error + Send + Sync>> {
    self.state = match self.state {
        ParserState::OutsidePage => match xml_event {
            // If we're currently 'OutsidePage' the only XML event we actually
            // do anything with is a start event for a page; for anything else
            // we just return 'OutsidePage' and move on to the next event
            Event::Start(node) if node.local_name().as_ref() == XML_PAGE_TAG => self.enter_page(),
            _ => ParserState::OutsidePage,
        },
        ParserState::InPage => {
            // If we're currently 'InPage', we can: start a paragraph, end a page,
            // scan a source node, or scan a rulerline node; otherwise, we return
            // 'InPage' and continue
            match xml_event {
                Event::End(node) if node.local_name().as_ref() == XML_PAGE_TAG => self.exit_page()?,
                Event::Empty(node) if node.local_name().as_ref() == XML_SOURCE_TAG => self.process_source(&node)?,
                Event::Empty(node) if node.local_name().as_ref() == XML_RULERLINE_TAG => self.process_rulerline(&node)?,
                Event::Start(node) if node.local_name().as_ref() == XML_PARA_TAG => self.enter_paragraph(&node)?,
                _ => ParserState::InPage,
            }
        }
        // Remainder elided because this is already too long
}

*Using Rust’s benchmarking tooling to profile transforming a nearly 4,000 “page” XML file*

The end result is that, per the above, the standalone service can parse a “page” of XML data every ~250 microseconds. To put this in perspective, in comparison to both the legacy and optimized Python implementations diagrammed above:

*Comparison of per-page processing for the sake of an apples-to-apples comparison*

Moreover, the service can handle multiple submissions at the same time, with (often) dozens of client services submitting content simultaneously, in comparison to a Python service handling a single input.

The memory footprint of the service’s two instances in our production environment

All told, in the last thirty days we’ve processed in excess of a half-million documents; for each and every one of those, there’s likely two or three separate calls to this service. Over approximately 1.5 million requests, neither instance has crept above roughly 300MB, neither has restarted, and neither has dropped a request.

But if you recall: all of the above was only in service of a larger, over-arching effort. The details of the inference optimization project aside, how’d it go? All the rigmarole above is wonderful, but if we couldn’t leverage it for the sake of resolving this one “last” bottleneck, it’d be kind of a waste, right?

The hard telemetry data for the above has unfortunately, over the intervening three years, evaporated into the ether (or still exists somewhere in Azure, it could go either way), but it does demonstrate the difference a concerted optimization effort can make: on the two dates in question, the same document (approximately 2,000 pages) was run through the DU Core systems.

The result: on June 16, 2021, a “large” document took more than three hours to process from start to finish. On July 12, a little less than a month later, the same document took approximately half a minute: a difference of nearly three orders of magnitude.

Advantages, Disadvantages, and the stuff betwixt

As in the above, I’ll start with the disadvantages:

Language learning curve
Third-party availability of machine learning libraries falls short of Python, especially for more “legacy” frameworks (e.g., scikit-learn)

To be clear, Rust has a learning curve. At the risk of editorializing: I think the steepness of its learning curve is grossly overstated. That said, I don’t deny that it exists. There are aspects that you’re unlikely to come across in most other languages. By and large, these can be mitigated, and even suboptimal Rust code is likely to be more efficient and less error-prone than the equivalent written in more approachable languages.

Secondly, the maturity of Rust’s third-party ecosystem is pretty great (bafflingly so for a descendant of OCaml competing with C++), but: machine learning libraries are a bit sparse on the ground. This is especially the case if you’re looking for analogues for older, all-in-one frameworks like scikit-learn. This is improving, but it’ll be some time before it’s competitive with Python in this regard (if it ever is). That said, in more deep learning-oriented contexts, things are slightly different; no less than Huggingface have started developing a Rust-based ML framework.

When it comes to advantages, it’s difficult to enumerate them without slightly rephrasing the reasons we decided to evaluate the language in the first place. However, in fairness, every reason was borne out more or less immediately, and there were secret, cooler advantages that needed to percolate for a bit:

Strong, predictable performance, especially in contexts where C or C++ might be used otherwise
Expressive, high-level syntax and sophisticated type system
Modern dependency management and other tooling (documentation, testing, benchmarking, etc.)
Mature third-party ecosystem for backend and data engineering contexts
Accessible interoperability with Python

All that said: Rust was a solve for us. Adopting a new language isn’t something to be done lightly, and there’s plenty of considerations even outside “the language” itself. It’s far easier to unilaterally suggest, for instance, using Parquet or Arrow and swapping out a file extension or function call. In contrast, suggesting that language XYZ is a perfect fit for all teams and all use cases would be naive (and possibly negligent).

For what it’s worth, though, we evaluated Rust for the sake of its performance, but since 2021 the real win has been that the services we write in Rust tend to need little to no maintenance, and far more often the latter than the former. The one described above was deployed in mid-2021 and is only rebuilt every so often if an in-house library is updated. Otherwise, the service just…runs. It doesn’t fall over; it doesn’t start randomly throwing 500s. I don’t recall ever scaling it up, even though its calling services may scale up to dozens of instances. We’re closing in on it running for three years, nonstop, having successfully handled tens of millions of requests.

In retrospect, the two or three weeks it took to build don’t seem like much of an investment in comparison.

Alternatives to Rust

As far as alternatives are concerned for this specific niche — i.e., “crunchy” backend services, there’s likely not much more to say than what was discussed above:

C or C++
Go, C#, and Java, etc.
Elixir and Rust

C or C++ are certainly an alternative from a performance standpoint. However, from a tooling and “rigor” standpoint, I’d argue that the learning curve is far higher (just much fuzzier). It’s overstating things to say that with Rust, “if it compiles it’ll run”, but not by a whole lot, and certainly not in comparison with C or C++.

Any of the standard enterprise backend languages (Go, C#, Java), are tough to fault, and could certainly be preferable depending on workload and cloud provider. In Azure, for example, C# is a first-class citizen, and I have to choke back tears every time I see how well-supported it is, even compared to other mainstream languages like Python. Real low-level number-crunching may not quite compare, but depending on the team, this may or may not be an acceptable trade-off.

Lastly, Elixir could be used for the server and API layer. Elixir, and more specifically, the Erlang virtual machine (BEAM), are good enough for telecommunications backbones, so they’re certainly good enough for the workloads that DU handles. Moreover, Elixir supports “native implemented functions”, which allows the usage of code implemented in C, Rust, and other languages that are potentially better-suited for performance-critical use cases. That said, it’s tough to see two niche languages as a viable alternative to one niche language.

Summary

Given the number of moving parts above, it may be beneficial to lay-out the aggregated set of changes introduced to the workflow in question.

We first wanted to ensure that the service in question didn’t immediately explode due to inefficient memory utilization. As such, we introduced lazy, event-based XML parsing, and opted for Parquet as our intermediate storage format (over, for example, serializing Pandas DataFrames).

In tandem with the above, we wanted to ensure our more memory-optimized implementation of the necessary operations didn’t require equally more time to execute. We opted for Polars given our usage of Parquet and our ancient version of Pandas, and the operations in question (XML transformation, page classification, and header/footer identification) all saw dramatic improvements in both memory utilization and latency. Page classification, in particular, benefited hugely. As the most “intensive” set of DataFrame operations out of the bunch, both its memory and latency ultimately ended-up at (roughly) 10% of what they had been prior to its adoption.

However, we could now process documents that had previously just “failed” as a result of their size, and one last operation that hadn’t yet received similar attention would end-up comprising the bulk of the runtime for longer documents — i.e., 90 to 95%, or more. Given the success we’d had with Parquet and Polars, our intuition was that we could apply the same patterns. If we swapped to Parquet or Arrow as a storage format, we’d likely improve memory efficiency, and further, we’d be able to leverage Polars’ far more optimized lazy transformations instead of Pandas.

That said, our page-level handling was best-served in a “local” context. The inference service needed the same information, but (more efficient I/O or not) sending or uploading thousands of artifacts for a pathologically-large document would likely be a half-measure. That approach would be required if we had resource-constrained Python services generating the artifacts, but ideally we’d be able to send one artifact with all the required data. We just needed a way to reliably generate those “aggregate” artifacts (one artifact per customer-uploaded file) regardless of upload size.

With that in mind, the last step was to introduce a standalone “ETL” service — a more optimized Rust implementation of the “XML Transformation” logic we’d been using previously. It’d accept XML from the calling service and return an Arrow file; the calling service would still perform page classification, etc. The inference service would accept the already-transformed artifacts, and in a single call perform all of the inference logic necessary, and for longer documents, hundreds of times faster.

Conclusion

It took a concerted effort, and several iterations, but using only a handful of technologies, DU Core has been able to apply a number of targeted, high-leverage optimizations that have had an outsized impact on our ability to cost-effectively scale our NLP workflows. We adopted more efficient structured formats (Parquet and Arrow) to offset the enormous memory overhead of more traditional formats. We adopted a vastly more performant data transformation library (Polars) to improve both the speed and memory efficiency of operations we’d traditionally performed in Pandas. Lastly, we adopted Rust for performance-critical components of our system, and its DU Core market share has really only grown ever since.

All that said, the main takeaways with which I’d want to leave you are:

Speed is a feature! Reliability is a feature! Software can be better!
Polars may not be as widespread as Pandas but can be dramatically more performant
If adopting Polars, it’s well worth the time to get familiar with its lazy API
Rust has a learning curve, but without a ton of C or C++ experience, it’s really tough to beat from a performance standpoint
Adoption of Rust is growing in a lot of ecosystems, but may not be mature enough for others