13# The 4 R’s
Reading, ‘Riting, Recursion, and Reusability
Data pipelines connect graphs of tasks together, reading and writing over data links in between. In previous posts, we’ve hinted at how a platform with smarter links and smarter tasks could help users to develop and test processing circuitry. It could simplify — perhaps even eliminate — certain issues that plague data pipelines, letting the platform do the work of processing, instead of forcing technical acrobatics onto end users. In this post, we point out how to enable simple breadboarding for a generalized circuitry model, based on a smart pipeline paradigm.
Data pipelines obviously read and write data, but while folks normally mention Directed Acyclic Graph (DAG) flowcharts, that model is limiting, and in general we need something more expressive — as the Ko language showed internally at Google. There is also a powerful case to be made for extending DAGs to include ordinary programming constructs like loop iteration and recursion in the evaluation of results.
When data get shared over wide areas, there’s usually a need to distribute the processing too. That turns the network into a giant computer. But we shouldn’t oversell this (the call for Turing Completeness in all things always feels more like a protest than a carefully considered position) — but let’s rather say that there are cases where reasoning needs to be distributed, because it’s impractical to localize computation when dealing with “Big Data”.
Data processing can benefit from explicit and implicit loops, with privisos. Everyone understands the idea of iterating over queues and lists, re-using the same code on each item. Recursion means that we first promise an outcome then calculate it `just in time’. The imperative approach would be to build all prerequisites for a result one by one, and then combine them at the end. The result is the same, but putting detail before understanding results in a clumsier form of expression.
In the last post we emphasized the relationship between process safety and data privacy, mentioning the importance of handling the 3 F’s (faults, error, and flaws). In this piece, we are jumping from the 3 F’s to the 4 R’s: reading, writing, recursion, and reusability. As mentioned, reproducibility is a common business imperative, but any public (global and mutable) data involved in a processing pipeline will render it potentially non-reproducible, and push the burden of managing process safety onto the user. That might well be an unjust burden if the user is primarily interested in data science. In the cloud, the stakes are higher, as the potential ramifications of scaling up a disaster could have potentially unbounded cost.
Recursion and privacy at scale
To do more general processing at scale, you might use a mixture of iteration and recursion, across distributed shared data, for:
- Reusability of code in the same process
- Reusability of data in the same process
The classical way to achieve reuse of code is to use local variables and parameter passing, with private workspace, just like passing parameters to functions. The problem when doing this at scale is that, for large data sets this can involve copying large amounts of data to the “stack frame” of every call. It’s expensive and unnecessary as long as input data are immutable.
The temptation to just let the “infinite cloud” handle it is also compelling, especially for inexperienced users who don’t know what goes on behind the scenes. Moreover, even some old school system developers still think in terms of simplistic patterns like the “good old load balancing middlebox” and may feel that they’ve handled scale just by throwing a naive load balancer process into the flow.
Moving data to and from functions directly, instead of pointing to it by reference, will lead to data bottlenecks that plague classical middlebox load balancers. Instead of “eating” data, processes can sit alongside the pipeline and dip into that data on a “need to know” basis (Don’t Eat, Take a Seat). This can lead to a side benefit of very useful helper functionality, for partial data analyses, such as in testing (see the next post #14).
For a platform to take smart decisions, on behalf of users, it needs to scale data processing in a smart way, taking the computation to the data rather than vice versa. It needs to make that computation safe to recursion and iteration, with guard rails against unintended mistakes and inexperienced choices.
Looping out of control
Users love databases as a means of permanent and temporary storage, but databases are global variables, and may lead to covert back channels. If a persistent non-ordered datastore, say a vanilla SQL database, is used instead of an explicit FIFO, a pipeline may end up revisiting values that it has already worked on before — because they don’t automatically disappear from the queue once done. At best this might be wasteful, at worst it may lead to inconsistency, unbounded loops, and deadlocks. Fortunately, the issues can be handled with adaptive locking policies, and the discipline of making outcomes idempotent. These disciplines can also be embedded as a service within smart platform as an optional convenience. Letting the platform provide meaningful assistance is a powerful way of keeping control over the pipeline on a local level, as well as simplifying major details for users.
Let’s look at such a hypothetical example of a pipeline, which is not a true DAG, and try to formulate the problem as `bread boarded’ pipeline, as simply as possible. The figure below shows a simplistic web crawler, seeded by a few URLs, which reaches out to those URLs and scans them for further links, returning any new URLs to the queue to visit in turn. The process is a process DAG, wrapped in an implicit data loop, making it an implicitly cyclic graph — with all the potential dangers of infinite looping, etc. Let’s further assume that the data are dropped into a plain old database that is not explicitly a part of the pipeline. Inexperienced users naturally find this convenient, and may not appreciate the issues. It could easily get stuck in an infinite loop by crawling the same pages again and again. Each thread is superficially a DAG, but the entire process is an endless non-deterministic loop. What the example needs is a proper FIFO queue, with repeated values expunged for a policy-determined interval of time.
The way we are taught to think about programming is through imperative constructions, from left to right. So, without special assistance, we might first create a process to spawn off a number of worker threads, then each thread can pass along its stages in separate channels. To share the data and keep track we write to a database (see the figure below).
As shown, this pipeline does nothing until some data are fed into the serial queue. To get it started we have to pre-seed the data queue with some URLs, by dropping them into the database out of band — then off it goes crawling from those locations and accumulating more. But crucially, it does nothing to avoid getting stuck in loops, e.g. if pages refer to one another, the entire pipeline can oscillate back and forth between the same URLs without end. The scaled process needs to avoid retracing recent paths.
Locks for short term memory
We can improve on this first draft in a few ways — first, by adding locking protection from revisiting recently visited URLs somewhere in the loop, with an expiry date on blocking data for repeat crawling (this is basically why neurons have a dead time after firing — to preventing tight loop seizures). We can also remove the explicit parallel logic, which is of no interest to a data scientist, and make the parallelism automatic according to the number of pending URLs in the stream, and a resource limit. In this way, when the pipeline is quiet, or when we are happy to “nice” it to a lower priority, we can reduce the number of threads without halting the process or losing its state. The diagram below shows this behaviour.
The obvious point, however, is why should the pipeline user need to think about these issues, and define the parallel processes at all. They are not part of the logic. In its simplest form, we can forget about all the scaling issues, and draw the process entirely in terms of its dependencies (see below).
We may then leave implicit or as a configuration parameter the amount of parallelism (say maximum of 3 threads) to be handled automatically by smart wrappers for the tasks and links. For the user, the whole thing could look as simple as this:
# Pipeline notebook
ingress -> getURLsgetURLs ->3 “crawl neighbours” -> “store graph” -> getURLs
# Notes[crawl neighbours]Container = crawler_*.docker.imgVersion = 1.2
The labels represent functions, some of which may need further configuration, annotated at the bottom for instance, in a notebook approach. Very few details need be given explicitly, and the need for locking can be detected implicitly from the self-reference implied by the loop.
To handle looping, the getURLs function only has to aggregate its inputs from two sources unambiguously: i) the ingress source where hints get dropped off, and ii) the data queue returned from crawling. A few policy choices are needed to disambiguate the method fully, but is otherwise well defined and automatable. The figure below sketches schematically how inputs may be aggregated. The details are for another time.
Smart and SmartRRRR
In this post, we considered how reading and writing of data, as well as recursion, and re-usability of code, can be handled safely and efficiently across multiple data channels, without unnecessary copying to repetition. Alongside those considerations, we can remove awkward issues, like scaling, into the abstractions about connectivity and task management. The role of a smart platform starts to look helpful in dealing with these issues. In the next post, we’ll return to dig deeper into the sampling of data across links — and see what smarter links could do to help users test pipelines.