12# One step forwards, two steps backwards

Published in

Aljabr

9 min readNov 14, 2018

Faults, flaws, errors, and policy

In the previous post, we looked at how policy decisions about slicing the data cake can affect a pipeline’s ability to process data effectively, determine scale decisions, and decide the semantics of a correct outcome. There are also plenty of subtle issues in data sharing that can lead to indeterminism. If the goal of a pipeline is to enable reproducible version-controlled computation — -in a way that supports of proper chain of custody, then we need to look at the influence of scope, errors, and caching, with versioning of data and software too. This quickly becomes a combinatoric issue of challenging proportions. Thankfully, proper tooling can make this easier, but it can’t take away the problem completely. In this post, we look at the issues: from smart caching policy of the intermediate stages, to the scope and mutability of data. How can we recover from broken pipes, undo bad changes? What should be the Time To Live (TTL) for data in a pipeline cache? How shall we handle faults, flaws, and errors…

The three F’s

The three F’s (faults, errors, and flaws) are unavoidable parts of any real world system. But what do they have to do with data policy?

With clear definitions (see figure above and Treatise on Systems, volume 2), the answer becomes quite apparent:

Faults are failures to keep an intended promise;
Errors are execution anomalies or mistakes; and,
Flaws are inappropriate intentions/promises for the intended context at design stage.

In other words, you can’t define a fault, error, or flaw, without making promises about what your intent is, so decision-making, policy, and faults are joined at the hip. But programming of a system (whether by code or edict) can be dangerous when promises don’t deal with all the cases. Logic presumes complete enumerability, but our real world leads to the unexpected.

If a promise is broken, it is not simply “not kept”, the remediation may be to do it again, or have someone else do it, but it might be too late, especially in a “realtime” setting. This sounds easy enough, but nothing about distributed systems is ever easy.

So what promises does a data pipeline make?

Even simple data pipelines need to keep a lot of promises in order to hang together. When pipelines become large in number of stages or amount of data, span wide areas, or play mission/time critical roles in a process, like data processing onboard a vehicle, priorities may change significantly as compared to the more leisurely processing of a web crawler data or error logs in web applications.

Scale / parallelism: a promise to suck up all the data presented at the input and process it within a certain amount of time. A user might be able to speed up processing of data by utilizing parallelism, as long as a job does not rely on the serial order of data. Data processing might have a Service Level Objective (SLO) to promise, as part of a time-critical business process. This can also be a flawed idea. There will be limits to any system — and therefore eventually faults. No pipeline can ingest arbitrary volumes of data at an arbitrarily fast rate without managing flow control, and ultimately having to drop something (like TCP/IP).

If a parallel branch fails because of incorrect splitting, improper coordination, or resource exhaustion, the final result may be damaged, reordered, or incapable of completing. A damaged result might be more harmful than no result at all, and vice versa. That value judgement may require a forensic determination only after the incident — all very unsatisfactory for programmers who want “truth”.

Accuracy: This promise depends on the amount of data and the algorithm used by the data transformations. Software bugs and encoding mismatches could lead to errors and flaws in the software plugged into your pipeline. But when you are riding on top of a whole stack of infrastructure turtles from hardware to virtualized software, the possibility of being affected by faults in infrastructure has to be taken seriously. Networking is a common cause of errors, because it relies on so many collaborating parts.
Locality and latency: A more hidden promise relates to the effective location of data artifacts implicit in a computation. Where should these artifacts be located virtually and physically? How does that affect their timely arrival? Volumes in Kubernetes might be local or remotely mounted. There is often something to be said for keeping data on local disk. Only the largest datacentres may have shadow networks with fibre channel for storage. Latency is data transportation is a key issue — -and, while Ethernet is many times faster than the computer’s internal bus, this doesn’t necessarily translate into better performance for reading and writing data. Finite size limitations on memory and disk storage are equally common. When memory is short, computers start to fall apart — -CPU flies off the handle trying to page memory to disk, and this has a cascading impact on other services.

Underneath these issues, there are many dependent promises made by platform infrastructure too (see box below).

Data permanence — smart caching

In the last post we talked about how to decide when enough data are ready for processing, by buffering inputs into batches. A related issue is what happens when previously accepted data need to be recomputed, because of errors, faults, and flaws in previous runs. Recomputations may or may not need to preserve a sense of the original time and conditions, but they need to be reproducible in order to trust the process.

Data pipelines typically need to buffer, cache, and aggregate data, and even maintain state though their transformation rules could be stateless. Data transformations, which convolute data further, may need access to multiple constant data values that might be too large to load into a single container. They need attached storage that can be addressed using a method appropriate to the process.

Some computations are one-time events, with a clean expiry date; but data scientists may also want to revisit the data timelines many times in order to shape their thinking and perfect their code. They may not initially know what they need from data, but may need to be able to revisit sources until some “statute of limitations” has expired, e.g. for legal reasons, such as tax audits. The time in between might be decades (as for criminal investigations, movies, and music performance master tapes) — far in excess of the lifetime of a computing cluster in the public cloud. On the other hand, it might be mere hours for routine flight data telemetry. All this adds up to a need to archive sourced inputs for future reference, in a secure immutable location with relevant contextual metadata for reference. This is part of a larger story about knowledge management in IT.

A smart platform can assist in these matters:

By caching original upstream sources and tracking downstream changes.
By ensuring reproducibility of downstream results, with smart cache buffers and preservation of metadata context. (Caching may be “permanent” and from the initial ingress)

Then results can be recreated from cached source material if need be. Keeping a copy of any external source is essential for reproducibility, while retention of later stages is non-critical but convenient for speedup.

Hidden flaws: scope and reproducibility

The downside of caching and snapshotting of data is that it freezes time even when time keeps on going. We work hard in IT to maintain the illusion of determinism, but the most pernicious source of uncertainty is reliance on data that are exterior to a process. The analogous case is the use of global variables in programming.

Global variables (i.e. shared memory) are much maligned in computing these days, but they are a necessary evil for cooperation (and therefore for scaling) — all databases are global shared memory, for instance. You just need to understand the implications. If you want reproducibility, and the ability to roll reproducibility back and forth in time while debugging your process semantics, through different versions of code and computation, then you should stick cautiously to local pipeline data channels. For example, compiling a software version is “safe”, but crawling the web is not “safe”.

Of course many processes don’t aspire to reproducibility, they are quite ad hoc, e.g. machine learning, daily stats, etc. But some variations may have greater consequences than others! The hard truth about hidden flaws: anything you read from or write to that is mutable, over the duration of your process, i.e. is beyond the controlled scope of local links, will lead to non-reproducible results. If you read from an external source, you could cache the response to ensure a form of reproducibility, but if you write to anywhere outside the pipeline (except say a passive log) then you’ve changed the world. Caching can be a helpful tool in allowing you to replay the past — that’s why it can be crucial to those inevitable calls for “roll back”.

Intended and unintended outcomes

Some pipelines are based on intentional processes (like a deliberate query of a source), some on `realtime capture’ from sensors or streams, which is unintentional (take what you get as `best’ as you can). Should the system protect itself or crash on error? In order to survive, a process might have to drop data batches (as routers and switches do). Caching can help to buffer processing, but clearly we can’t cache everything forever. There has to be garbage collection. You have to forget data sometime. It will be interesting to see how many Photos and Videos on the Internet survive the decade.

When users pull data from mutable sources, the result can be different on each trial. “Tell me the latest” is not a reproducible query, so mutable outcomes will lead to flawed reproducibility. This is made worse when pipelines are recursive, i.e. when they feed back on themselves, we have to deal with the possibility of mutable state (global variables). A single immutable timeline, based on its own internal clock, is what Kafka ties to provide. This is an expensive solution though — perhaps not all users would be willing to pay for the infrastructure cost for this level of “truth”. It depends on the stakes.

Putting data into a database is a classic solution that brings with it a classic problem: databases, caches and storage eliminate causal ordering. You may try to replace process time ordering with timestamps according to someone’s clock, but that might not be a consistent set with respect to a query process using its own clock. Whenever you put data into a cache, database, or aggregate store (including a neural or Bayesian learning network) it becomes a “big grey mess”.

The upshot of all of this is that we cannot definitively promise reproducibility of process outcomes, only reproducibility of pipeline behaviours. Repeating the same actions will not always lead to the same result. It will always be incumbent on users to understand the process they are initiating.

Policy for handling simple interruptions

Caching can help trade latency for backlog, and avoid wasting energy costly CPU, at the expense of memory, but there are limits to how much caching can be tolerated.

All of us have some experience of intermediate caching for data with trivial semantics. Think of download managers for large files — such as video or music — these usually have a progress bar cursor. In case of interruptions, data transfers can be picked up where they left off, by keeping a cursor in the input stream. A cursor remembers its place within immutable source data, but becomes useless if the source data can change in “real time”.

During a recovery, we would always like to avoid doing work more than once, unless it was performed incorrectly (or the definition of correct changes). This might not only be convenient but necessary to deliver some level of reproducibility. However, a data process cannot know whether data were processed correctly or not, unless the process is aware of the data payload, at some level. A tantalizing question is therefore: can we make this data awareness a clear reality at a platform level? There are some hard policy issues lurking in this question.

It often comes as a shock to many that data science is itself an unstable process when scaling data. Small changes in data and processing policy can reverse the conclusions determined, because of the naive use coarse grain resolution, or a Boolean view of decision making.

In this post, we’ve discussed some of the important issues around caching and copying of data for continuity, and the distortions that come from the three F’s. We deliberately avoided the issue of data consistency between parallel process lanes, but that is an additional complication. From a purely dependency perspective, there is a great deal of room to improve the user burden and quality of experience by supporting traceable caching of artifacts. Ultimately, we cannot definitively promise reproducibility of process outcomes, only reproducibility of pipeline behaviours. It will always be up to the users to understand the process they are initiating. In the next post, we look at how batching could be used to make policy based trade-offs between accuracy and speed, during testing.