I have noticed a deficit of documentation or examples outside of the official Beam docs, as data pipelines are often intimately linked with business logic. While working with streaming pipelines, I developed a simple error handling technique, to reduce the disruption that errors cause to streaming or long-running jobs. Here I have an explanation of that technique, and a simple demo pipeline.
Apache Beam is a high level model for programming data processing pipelines. It provides language interfaces in both Java and Python, though Java support is more feature-complete.
Beam supports running in two modes: batch, and streaming. In batch mode, a finite data set is read in, processed, then output in one huge chunk. Streaming mode allows for data to be continuously read in from a streaming source (such as a message queue), processed in small chunks, and output as processing occurs. Streaming allows for analytics to be performed in “real time” as events occurs. This is extremely valuable for telemetry and logging, where engineers or other systems need feedback as events happen.
Beam pipelines are composed of a series of typed data sets (PCollections), and transforms. Transforms take a PCollection, perform a programmer-defined operation on the collection elements, then output zero or more new PCollections as a result.
The problem with these transforms is that they need to eventually operate on data. As anyone familiar with handling user input or data from large systems can attest, that data can be malformed, or just unexpected. If a bad piece of data enters the system, it may cause the entire pipeline to crash. This is a waste of time and compute resources at best, but can also result in losing in-memory streaming data, or disrupting downstream systems relying on the Beam output.
In order to stop a catastrophic failure, you need graceful error handling in your pipeline. The easiest way to do this is to add try-catch blocks within each transform, which prevents shutdown and allows all other elements to be processed.
This is a start, but it’s not enough on its own. You’ll want to record failures — what data failed what transform, and why. To do this, you’ll want to create a data structure to store these errors, and an output channel for them.
The data structure for a failure should contain:
- Source data in some form (data ID, the raw data fed into the transform, or the raw data precursor that was fed into the pipeline).
- The reason for the failure.
- The transform that failed.
We can instantiate a Failure if an exception or error is thrown during a transform.
Next, we need to be able to record the failure for developers to reference.
Beam transforms by default only have one output PCollection, but they can output multiple PCollections. A transform can return a PCollectionTuple, which uses TupleTag objects to reference which PCollection to put an element into, and which PCollection to fetch from the TupleTag. This has many uses, and we can use it here to separately output a PCollection of successful results, and a PCollection of Failure objects.
In the demo repo, successes and failures are simply written to files. In a real pipeline, they would likely be sent to a database, or a message queue for additional processing or reporting.
You may also want to extend coverage beyond just handling thrown exceptions. For example, we could validate that all data falls within expected parameters (EG all user ids are ≥ 0) and is present, to prevent logical errors, missing records, or DB insertion failures further along. That validation could be extended into the Failure class, or it could be a new Invalid class and PCollection.
This covers the handling of elements themselves, but there are many design decisions beyond that, such as: what next? Data scientists or developers must review the errors, and discard data that is outright bad. If data is merely in an unexpected format, or exposed a now-fixed bug in the pipeline, then that data should be re-processed. It’s common (moreso in batch pipelines) to retry a whole dataset after any bugs in the pipeline are addressed. This is time consuming to process, but easy to support, and allows for grouped data (sums, aggregates, etc) to be corrected by adding the missing data. Some pipelines may only retry individual elements, if the pipeline is a 1-in-1-out process.
There is a GitHub repo at https://github.com/vllry/beam-errorhandle-example which shows the full proof of concept using auditd log files.