#11: Say Cheese!

When to snapshot batches

Published in

Aljabr

9 min readNov 7, 2018

Data processing is often presented like a relay race of tangled swim lanes, in which the main point is to reach the end. It’s is a simplistic view of a distributed system, but it is a narrative of primary intent that we need to focus on, without neglecting other details. However, there are some subtle technical questions that play a key role in how to move data through a pipeline that we need to think about. A data relay race starts when data arrive from one or more sources, but the preparations begin before that — -when a user asks a question, and designs a process to compute it. Once the process is underway, some flows proceed in series, others are fan out (map) to parallel lanes, and are later integrated (reduced) into a final answer. So — who fires the gun? Who judges the finish line? These turn out to be crucial issues that a platform needs to understand. In this blog, we take a deep dive into these issues.

Packetizing flows: when to start and stop?

You put your data in, you get your data out, you do the hokey-algorithm and shake it all about. Easy right? If only it were so. In previous posts we’ve looked at complexity and causation, but there is another aspect to the logistics of scheduling: when can we start, when do we need to wait for data, when should we hand over processed data, and when are we clear to proceed? In other words, how do we slice and reassemble the processing cake’s many vertical and horizontal layers?

The history of breaking long streams of data into short batches, for efficient processing, goes back to Leonard Kleinrock and his work on queues that led to packet networks and TCP/IP. The idea is simple: if you break up work into chunks, smart infrastructure can compute and route the optimal delivery of a result for you. It’s exactly analogous to packaging manufactured items in boxes, component by component, say on a visit to IKEA.

Sometimes you might need several boxes from different shelves of the warehouse to complete your flat-packed furniture. You need tuples of input data to arrive and be declared simultaneous in order to proceed, as in the figure below. This is a lateral aggregation (from parallel lanes).

There may also be a longitudinal aggregation (you need to wait for 4x feet, from the same source, to complete the chair). Batching of work and typed packetization are the same problem, and we are going to see many of the same solutions appear in data processing.

Traffic, with stop signs and lights

Like the Internet, batching of dataflow is a traffic problem (see figure below). There are stages (tasks) and there are links that guide the flow in a forward direction (links). Each stage needs to wait for enough data to arrive to make it worth while to process.

Batch sizes are part of a divide and conquer approach to processing. Suppose you need 10 data records to be able to compute a value to pass on to the next stage, the data may arrive in groups of 10, or as a random arrival process 2,3,5,1,6… It’s up to the stages to decide when they have enough to proceed.

Longitudinal batching is related to scaling, and may lead to new lateral parallelization. When data come in bursts, it might be necessary to spin up and down resources to expedite data more quickly. Cluster scaling and process flow are thus tied together.

A longitudinal aggregation process is a form of efficient congestion control. You want to fill up a unit of processing — to “ferry” data from one side of a transformation to the other side of the processing divide — in an efficient manner. How many data vehicles do you wait to arrive before sailing the ferry? The answers might be easy, but they aren’t unique:

You sail as soon as a single car arrives.
You stick to a published policy schedule (wall clock time)
You wait for a minimum viable number (threshold time)
You could wait until the ferry was full (bucket fill time)

There are other constraints too. Imagine you are feeding data into a printer, which effectively takes a static snapshot of the input. When should you stop changing the source data? Is it okay to keep writing new content while the data are being printed? How long do you wait to say “cheese!” and take the photo? This all depends on how we sample and print the data.

There is a difference between a print-out and a ferry though. A print-out is a derivative copy, but a transport is a transaction. You can print data many times, but once a ferry has moved a vehicle, it can’t do it again. Different processes have different semantics.

Timeseries, snapshots, and policy

There is a reason why we don’t submit a video clip as a passport photo. Facial recognition is based on a particular visual matching process of a single static image, not on episodic patterns. A constant (immutable) representation, sampled from a moment in time is what we need, not an episode or story.

The photo in a passport plays the role of a parameter to a function — like in programming. The value you want to hand over is a one-time snapshot (a sample) of a possibly larger changing process. If we follow a single variable over its history, resampling it many times, we could turn it into a whole time series — versioning every sample with a timestamp, and forming a complete database. Alternatively, we might overwrite the sampled value each time and interpret the data as a random variable. Choosing between these interpretations is a policy decision that’s informed by how fast data change(s) relative to its processing. How we interpret the semantics affects how we process.

Remember there are different kinds of process.

Sometimes data come in continuously like a stream that needs to be processed immediately in realtime. Imagine recording a concert, or capturing flight avionics data. When we try to match timescales, pipelines run continuously, scaling up and down as supply and demand fluctuate.
Sometimes data are bursty, and pooled in reservoirs to be processed later, e.g. data from satellites and space probes, where the input process is much slower or much faster than the need to consume it. To decouple timescales, we use long term buffering, and processings resources are not needed all the time.

A passport photo is essentially a random variable, when treated as an entity. Its inner composition as an ordered stream of bits is not interesting, except to say that we assume it has version consistency. We bet on each photo sample being relatively slow to change and hopefully not too random. Nyquist’s law of sampling tells us that we have to sample at least twice as fast as the shortest change we want to capture, but sampling much more frequently than that is a waste of resources. These issues guide us about what sampling policy we ought to choose — -in practice, however, the details lead to endless problems of coordination in the semantics of distributed processes, as users make naive assumptions about how time and causation work from a purely local perspective.

So, the issue for a data platform is: how does a data pipeline know when it’s safe to rely on a data value — -e.g. when can we know that inputs are at rest, and when it’s safe to compute a function f(x) of x, without being afraid that x will change?

Deterministic computer processing languages use scoping and data immutability to address a part of this issue, but another part is application specific and requires a user policy decision to be made, perhaps even changes to algorithms. What should a pipeline do?

We can make some inferences based on timescales and proper instrumentation of data, but there are some decisions that can only come from user intent.

Specified or inferred?

A pipeline can’t begin until a user has made some basic policy decisions about the acceptance of inputs.

What to sample (and in what combinations)?
How much to sample (as a batch)?
What is an acceptable range for the input value?
What to do with the sample next?
How long do you need buffer or cache the samples to revisit later?

In the final point, we acknowledge that as workflows unfold, we might find that we need rework products again because of mistakes, changes of intent, and even new opportunities. The choices above are the users’ responsibilities to decide. But after they have been made, smart infrastructure’s behaviour can offer powerful assistance and flexibility to the user by helping to ask the questions, and help with the forensics of post delivery diagnosis.

How we parse data timeline depends on our changing view of intent.

Suppose you automate data collection from some instrumentation (perhaps you are recording a concert). You may want to test the sound system, throughout the whole process, without keeping the results. You only need to retain the recording of the final performance, not the sound checks, or noise after the concert. Also, you don’t want to keep recordings that contain errors or distortions . Later you may want to go back over those same data to remaster the recording for mp3, CD, and 24 bit hi res sound.

Two cases: transactions vs copies

Aggregation into batches (or tuples) can sometimes be a barrier to scaling and to data retention. At what level of processing should we queue and retain process data for future revisitation? For transactions, we should probably retain nothing in the queue (a transaction is a movement of data, not a copy), but for derivative data transformations, which are only copies of source data, there can be a value to keeping intermediate stages for UNDO operations and for changing our minds, during an experimental development phase. File system journals and blockchains are a hybrid approach to keeping transactions over extended time, thus allowing them to be replayed in case of disaster recovery.

Different sources may have similar or different semantics:

When data come from different locations, they have no specific order, but they have a unique context to preserve — -yet you might want to eliminate context by averaging.
When data come from different times, they have a potentially important order — yet, you might want to eliminate that order by declaring each arrival equivalent because they contribute to the same result (e.g. an average).
etc.

Intended semantics are central in making these decisions. For example, there is no sense in aggregating security camera footage from different countries, except perhaps for disaster recovery. The sources belong to different contexts. Aircraft black box recorders are another example where data refer to a single “flight episodes”, each with unique legal significance that should not be mixed up. On the other hand, in a different interpretation, we can give an average meaning to certain aspects of multiple episodes of the same flight, since flight routes are repeated many times. Fuel consumption, weather variations, incident reports, etc, could all lead to meaningful average statistics.

Every time we aggregate, we make decisions that are scale dependent and semantics dependent — -a small change in the amount of data or the source characteristics of an aggregation can change the outcome discontinuously.

Smart means context aware

To be an effective tool for data science, a pipeline execution system can try to offer users assistance in dealing with these issues. It can track, measure, and assess metadata as part of its framework of the smart wrapping. The unspoken truth is that statistical methods are unstable as decision making bases — -though machine learning practitioners are starting to realize this — -you can undertrain and you can overtrain your pattern recognition service, leading to very different results.

If we can’t fully immunize against problems of scale sensitivity, we can at least provide a forensic chain of custody detailing how data were used to derive outcomes so that the steersmen and women of data science can make their own value judgements and policy decisions. This is a topic we’ll have to return to.

Batching of data is an inevitable part of data processing. Deciding when or whether it is the right moment to recompute an answer has arrived is a non-trivial matter — it boils to a policy decision — and we’ve only scratched the surface in this post. How we parse the timeline of data depends on our changing view of intent. Only the process instigator knows whether data have long term significance or only temporary value — -but even those judgements may change, as new ideas come about. In either case, there is an argument to be made for persisting data values for a policy-specified amount of time. In the next posts, we’ll try to unpack how smarter platform semantics can help users to make sense of these issues, in the related larger context of error handling.