Storytelling in the Dataverse

Mark Burgess
Aljabr
Published in
10 min readMay 23, 2019

When Quantum Gravity governs your pipeline! (sort of :-))

In developing data platforms for the cloud era, we are effectively building (cognitive) learning systems that span a wide range of scales — agents that receive inputs and compute results, sometimes with feedback. Virtualization and parallelism throw a layer of obfuscation over the tracing of dataflows, as well as performance measurement and causality. Data pipelines from the edge of the cloud to the centre and back are especially vulnerable to distortion and loss of traceability, as they muddle the expectation of deterministic processing with the randomness of arrival times arising from the uncertainties of distributed processing. Sometimes dependency tracing takes care of this, but if we pool results into a `data lake’ we then wipe out meaningful order, rendering the timing moot. The result is that many users don’t have any control over what their data are telling them.

It’s not often one could say that Quantum Gravity has much use in the everyday world, but (as I explain in my book Smart Spacetime) basic spacetime physics holds the keys to understanding a lot of what goes on in the cloud. There are simple principles relevant to every aspect of change that explain the meaning of connectivity and the ordering of events. When data sources are distributed, we have to deal with partial ordering — sets of data whose causal order (as defined by a Causal Sets approach to Semantic Spacetime) is not absolute or even well defined.

No one teaches us this stuff in college!

The problem in the cloud is that parallelism and distribution lead to a classic problem of data relativity. No observer has complete information about everything locally — we rely on messages passed between agents inside of the system. These travel at finite speed, and may not be available to all parties. Data collection (which is central to monitoring) is, in fact, closely related to the more widely discussed data consensus problem. In consensus systems, we are trying to ensure that a broad audience sees that same version of local history as we do; in data collection, we are trying to aggregate local histories into a meaningful timeline that a single observer can make sense of.

Mapping interior and exterior context

Most information gathering (usually logging) systems simply push out data without calibrating their meanings, or aligning their causal order. They leave the entire burden of interpretation to human postmortem pathologists. This is a wasted opportunity.

The key to understanding complex systems is to break them down by separating behaviours by timescale — not just by function. We can sort and classify garbage using the machinery of the cloud, if we organize by scale. We don’t have to dump problems onto human operators. The machinery of our data processing universe are there to help.

This starts with the classic separation of decision processes from implementation, using platform tools like Kubernetes, but it continues by using embedded instrumentation that acquires knowledge from within each local process, so that context is not lost. We need to be clear about where the boundary between observer and observed lies, when collecting data, to avoid losing the ability to interpret it later. What we did versus what happened to us should be distinguishable. Not all data processes are simple input output graphs — they take considerably more work to comprehend. When we reach the level of an ecosystem, things get hard to understand — but not impossible.

In Koalja, we can define a data circuit (pipeline or cluster) using a simple declaration like the one below:

A pair of DAG pipelines conjoined by a service (declaration)

This defines a model “tfmodel” that consists of two pipelines that are correlated by a service relationship (see figure).

A pair of DAG pipelines conjoined by a service (visual)

Two separate (parallel) processes are defined, with independent causation. Above, there is a training process that accumulates data and builds a machine learning model, then uploads this model into a service for use by a recognition process. Below, the recognition process independently collects data for classification. It converts the forms into something compatible with the model, and then a prediction phase performs a lookup to the service maintained by the first pipeline. This is not a classic single DAG relationship, but a service dependency (query-response). Once the service returns, the predictor can complete its computation of a result and send it for output through the lower pipe.

The second pipeline has a formal output, the first doesn’t. All such combinations of input and output pattern are possible in data circuitry. But how can we see this as a causal trace? This challenges us both with configuration, comprehension of operation, and comprehension of outcome.

Which version of the model was being served when a particular output was processed? Are we even aware of the hidden dependency between the two pipelines? If we consider only DAG models, we couldn’t possibly know.

We need to be able to debug these processes in two ways: by the invariable relationships between their parts, and by the timelines of variable events that pass through them.

Time and space to the rescue

History plays a big role in our understanding of systems. We are storytellers by nature, so we look for truth in logs and journals. The problem is that logs and journals are written by single authors (or single processes).

In the cloud, we want to aggregate logs into a library of journals. However, because IT has had no strong theoretical underpinning, the approach has simply been to aggregate pages from everyone’s journals indiscriminately and join them together in a single journal, without pausing to keep separate what was causally independent. We believe we can throw the job of searching aggregate logs to a process of unlimited power, by brute force.

Aggregation is entropy. It creates garbage landfill. Technology is supposed to help us recycle meaning as well as help with the work of being an efficient collector. Dumping everything in a giant landfill to be sifted through for body parts in times of trouble (the opening scene of many a murder mystery) is not a good strategy. It’s very expensive and perhaps not even possible to piece together forensic information that has been thrown away. So let’s keep it.

We need to use information technology more intelligently, and spacetime principles are free and open for the picking.

Most logs use Unix clock time as the measure of causal order. But clocks are independent and potentially different on every machine. This no longer makes sense in a parallel world, or a world of turbulent coroutines that interleave timelines and threads. Logs have meaning on a different scale than what the kernel uses for timesharing, and this is again different from the scale on which data processing has meaning. If we muddle these, we just create muddle.

Computer programs are causal process threads. Order lies in their execution on the process interior, not in the arrival of messages between them. There is a proper time (NOW) related to the moment at which a process passes key landmarks in its code. Other messages may interpolate details with deltas relative to those. When we seek a so-called “root cause” of an event, a Unix time log is a garbage heap, but a causally ordered proper time log can point back to the root(s) from whence the event came — i.e. those earlier landmarks that fed into the current moment.

Even resource anomaly detection can be processed in line, to capture environmental changes with proper causal order. By designing the careful separation of variants (event instances) from invariants (persistent patterns), raw observational data can be highly compressed — far more efficiently than text based logs, without being too slow:

The meaning, however, is far richer than current logging formats. The trace below was generated from an illustrative test process to show how logs can express proper causation, even in programs with co-routines and parallel execution.

An anomalous CPU spike, during this containerized execution, becomes a signpost in the timeline of the process — measured from within, as Einstein taught us. We can also attempt to detect and point out interior non-simple causality, such as that produced by co-routines.

The traditional timestamps in the example are for real context guidance only: they can’t be considered accurate, but we still like to know roughly what was going on in the outside world. The sequence numbers from the code flow indicate the occurrence of major “signpost events”. This is how we told the time (before the invention of clocks) — we attached meaning to significant events: e.g. the year of the flood, the Christmas outage, etc. Such process anomalies are meaningful to us, but clock times are just repeating patterns that need further explanation.

Indentation, added by post hoc presentation helps to show the subordination of interior “subtime” deltas within the larger intentional scheme of a process. All this can dropped into code using normal logging-style messages.

The history of this thinking goes back to CFEngine and its machine learning anomaly detection, modified for the modern world. The Cellibrium project further developed an experimental fork of those ideas. The layers of wrapping and containment in the cloud detach causation from observable metrics, but this provides a more targeted “intentional” way of promising the state of containerized workloads.

Sum over histories

Histories are only one part of understanding. Meaningful concepts emerge by invariant pattern and repetition, not as single events. The path well-trodden by data events leads to an invariant average behaviour that we can call the stable path of action, intended by the code. The emergence of such invariants from repeated trials is what we mean by learning.

[Remarkably, this has a lot in common with the way we describe quantum physics: there is a general distribution of assets associated with conservation of resources, but the precise locations or values of observables are hard to pin down with certainty. The world is probabilistic when you see it second hand. Over time, it wears a path of `stationary action’, which is what we call the principle of least action, which is actually a principle of superposition of alternatives.]

We can do more with a spacetime model than trace simple path histories. By labelling the channels of influence using their spacetime semantics, we can separate and relate them according to the information that transmits their relevant influence — going beyond guesswork. By accessing a properly labelled aggregate overview, generated by iterative learning from cluster data, we can reconstruct a god’s eye view of the system over the past, for future prediction. The significance of invariants persists, and can be used to infer behaviour in the future.

This is the scientific method in action: in cases of sufficient stability, we are able to use the past as a guide to the future.

Advanced boundary conditions are what I have earlier called convergent processes — other names like “CALM” logical monotonicity have been used, but the spacetime structure is the simple common denominator that unifies these ideas.

The output from Koalja, from declared knowledge combined with process invariants, leads to a self-organization of persistent concepts, which is partially ordered by causal ordering. This tells a diagnostician far more than a log of ephemeral events — and can be used together with the proper timeline to elevate meaning above quantity — recycling meaning before it hits the landfill.

Automated reasoning based on theprinciples for Semantic Spacetime

By following only causally ordered pathways that generalize as “volumes of generalization” along different “spacetime” channels, we can quickly make predictions about complex outcomes based on past behaviours, without having to trawl logs in a lengthy and speculative postmortem. It’s the very same principle used in Quantum Gravity to attach directed evolution to networks. The `classical paths’ are the paths most travelled, worn by the common elements of repeated learning.

Turning Search into Certainty

I am not a fan of current logging systems. Koalja is not a logging library — but it embeds a semantically enhanced recording device that enables meaningful diagnosis of processes running in cloud. Experience has shown that the data measured in most such libraries say almost nothing that isn’t obvious about a system today.

One could say that going through someone’s garbage gives them a complete picture of their distributed lives. Certainly this is the logging and tracing approach to distributed systems (including microservice architectures), but surely there is a better way when the parties are collaborating. Not everything should be a murder mystery.

Meaning decays quickly with time, as human users stare helplessly at wiggly lines like paranoid security guards. As I wrote in my book In Search of Certainty, it all seems a lot like trying to diagnose the cause of a bad mood by taking someone’s pulse. The best part of reinventing system instrumentation, from the bottom up, is that we can choose more practical and efficient formats for it, that preserve simple signalling semantics and detailed diagnostic causation. For the data circuitry in the cloud, this is more important than ever.

You can read more about these issues, in my new book:

--

--