Combating Software System Complexity: Entities Should Not Be Multiplied Unnecessarily

OneFlow
OneFlow
Sep 18 · 12 min read
via Pixabay, xresch

Written by Yuan Jinhui; Translated by Wang Kaiyan, Dong Wenwen

We are often faced with the problem of how to evaluate the quality of a large software system. The primary evaluation metric is definitely functionality and whether the software meets the main requirements (do right things). If there are multiple technical paths to achieve the same functionality, people tend to choose the more simple approach.

Occam’s Razor guideline “Entities should not be multiplied unnecessarily” sums up very well the preference for simplicity, which is to counter the challenge of complexity. The underlying logic of this preference is: “simplicity does things right.

The Age-Old Problem of Software Development: Combating Complexity

In the 1960s, the Software Crisis (Software crisis — Wikipedia) was once called because software development could not keep up with the development of hardware and the growth in complexity of real problems and could not be delivered in the planned time.

Fred Brooks, a Turing Award winner who led the development of System/360 and OS/360 at IBM, described the plight of a giant beast dying in a tar pit in the bible of software engineering, “The Mythical Man-Month”, to draw an analogy with software developers who are mired in software complexity and cannot get out. He also introduced the famous Brooks’ Law, “Adding people to a project that is behind schedule only makes it more behind schedule”.

In his paper “No Silver Bullet — Essence and Accidents of Software Engineering,” he further divides the difficulties of software development into essential and episodic and identifies several major causes of essential difficulties: complexity, invisibility, conformity, and changeability, with complexity leading the way.

In 2006, a paper entitled “Out of the Tar Pit” echoed Brooks. This paper argues that complexity is the only major difficulty preventing successful large-scale software development, and that several of the other causes Brooks suggests are secondary disasters resulting from unmanageable complexity, with complexity being the root cause. This paper, too, cites several Turing Award winners for their excellent discussions of complexity.

“…we have to keep it crisp, disentangled, and simple if we refuse to be crushed by the complexities of our own making…”

                                                     ----by Dijkstra

“The general problem with ambitious systems is complexity.”, “…it is important to emphasize the value of simplicity and elegance, for complexity has a way of compounding difficulties”

                                                      ----by Corbato

“there is a desperate need for a powerful methodology to help us think about programs. … conventional languages create unnecessary confusion in the way we think about programs”

                                                       ----by Backus

“…there is one quality that cannot be purchased… — and that is reliability. The price of reliability is the pursuit of the utmost simplicity”

“I conclude that there are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”

                                                       ---- by Hoare

Making things simple is the hardest thing to do, and the only way to combat complexity: on the one hand, simplicity implies comprehensibility, and comprehensibility is critical to the maintenance and iteration of large software; on the other hand, while simple abstractions do not guarantee simple implementations, overly complex abstractions almost certainly imply complex implementations.

This article will discuss how the design of the “dependency engine” in TensorFlow violates Occam’s Razor guidelines.

Background: Dependency Engines for Deep Learning Frameworks

DL frameworks and other distributed computing engines prefer to use the Dataflow model, which is simply a directed acyclic graph (DAG) with nodes divided into two types, op and data.

In the Dataflow model, whether an operation can start executing depends on the readiness of the input data for this operation, so all operations are executed in the topological order of a directed acyclic graph.

In the above figure, for example, the rectangle represents the op, and there are 4 ops in total, and the circle represents the data (there is actually only one copy of A from the producer’s viewpoint, and two copies from the consumer’s viewpoint, so it is drawn twice). As shown in the figure, there is no dependency between {B=A+1} and {C=A+2}, the order of execution between them is unspecified, and they can be executed in parallel as long as the computational resources are sufficient. If translated into an imperative program, both of the following code fragments are legal:

A = 2 | A = 2

B = A + 1 | C = A + 2

C = A + 2 | B = A + 1

D = B * C | D = B * C

DL frameworks based on static graphs rely on a dependency engine module to manage the dependencies between operations and the trigger timing of each operation.

The usual implementation of the dependency engine is to set a counter for each circle representing the data to indicate whether the data is ready or not. The initial value of the counter is 0. When the producer of the data finishes execution it will change the counter to 1. When the consumer of the data sees that the counter has become 1, it can also read the data. In the example above, {D=B*C} is triggered when the counters for both B and C become 1.

DL frameworks based on the static graph mechanism, including TensorFlow and MXNet, the execution engine (executor) uses such a mechanism to manage the dependencies between ops and the timing of each op’s trigger.

If different ops are executed on different threads within the same process, multiple threads can access the counters concurrently through the shared memory mechanism, except that to resolve competition between concurrent reads and writes, each counter needs to be locked or use atomic variables.

If different ops are distributed on different processes or even different machines, how should the cross-machine dependencies be expressed and managed?

Problem: Dependency Engine for Distributed Versions

For distributed scenarios (data parallelism, model parallelism, etc.) that can be supported by cluster communication primitives (all-reduce, all-gather, reduce-scatter, etc.) like NCCL, the subgraphs on each machine do not “appear” to have dependencies at the computation graph level. Each node can be managed as a completely independent computation graph (this dependency is implicitly implemented through the underlying trunking communication), and the complexity at the execution engine level is rather small.

However, when faced with an “irregular communication pattern”, where a cross-machine dependency is achieved through peer-to-peer communication for producer-consumer interaction (producer on one machine, consumer on another), it introduces some subtle problems that are more complex than scenarios that can be solved by trunking communication primitives. Only this case is discussed below.

When the op in the computation graph is partitioned to different machines in a peer-to-peer manner, the functionality needs to support: (1) the sending of data from the producer’s machine to the consumer’s machine. (2) Should the state counter be placed on the producer’s process or the consumer’s process? (3) If it is placed on the consumer’s process, by what means does the producer modify this counter?

TensorFlow’s Abstraction of Cross-Machine Data Movement

Let’s see how TensorFlow solves the producer-consumer cross-network problem.

The dashed line in the figure indicates the boundary between the machines. {A=2} is executed on the machine in the left half and {B=A+1} is executed on the machine in the right half. After the machine on the left has produced A, it needs to be carried to the machine on the right before it can be consumed.

A natural implementation of cross-machine state synchronization is “message passing”. TensorFlow inserts a pair of send and recv operations into the computation graph, decoupling the normal user-written op from the cross-machine data movement in the computation graph, so that the normal op doesn’t need to worry about and deal with the hassle of cross-machine communication. For example, at the producer end, after {A=2} is executed, the producer knows that A is being consumed by send and just needs to hand the data to send. On the consumer side, {B=A+1} knows that it depends on A to be fetched from recv, and just needs to wait for recv to finish executing.

To solve the problem of execution mismatch between send and recv, TensorFlow also introduces the concept of rendezvous on the producer side, and the following is the approximate logic.

  1. send puts the data in rendezvous’ KV dictionary, if the recv side has sent the demand for fetching data, and the two happen to be connected, it will start the underlying data transfer immediately; if the recv side has not sent the request at this time, it just need to put the data in the KV dictionary and wait for the recv to send the request.
  2. The recv side requests data to rendezvous, if found that the data is already in the KV dictionary, immediately start the underlying data transfer; if found that there is no data needed in the dictionary, then start the data transfer wrapped into a callback function in the KV dictionary, wait for send to put the data into the KV dictionary when the callback function is triggered.

Let’s think about this: Is the TensorFlow abstraction minimal in order to allow the dependency engine to support producer and consumer problems across machines? Does it do what Occam’s Razor guideline describes: “Entities should not be multiplied unnecessarily”? Does it have the concept of redundancy: is rendezvous necessary? Are send and recv both necessary?

Is “Rendezvous” a Must?

Rendezvous is primarily designed to address the need for a “handshake” between producer and consumer, and it is also feasible to place it on the consumer side, for example:

Once the producer generates data A, it can just send the data to the consumer’s machine without caring whether the execution has started on the consumer’s end. If the data arrives at the consumer’s machine but the consumer subgraph is not yet started, just wait for the consumer subgraph to start to use the data; if the consumer subgraph is already started and waiting when the data arrives at the consumer’s machine, then just trigger the callback function containing the consumer subgraph.

Considering that we are dealing with static graphs, where the subgraphs to be executed on each machine are known in advance, and the producer-consumer relationship is known in advance, rendezvous is not necessary and why?

An easy point to achieve in the static graph mechanism: the subgraphs on each machine are already initialized before data starts to flow into the system (OneFlow does this so that data from the source is allowed to flow down when all the actor corresponding to all the subgraphs on all the nodes are initialized, which is actually achieved by a barrier). As long as this is guaranteed, rendezvous can be removed, because the producer knows for sure that the consumer’s subgraph is waiting for data at the other end.

In fact even in PyTorch, which uses the dynamic graph mechanism, only the send and recv abstractions are introduced when performing peer to peer transfers, and the concept of rendezvous is not introduced.

After removing the redundant rendezvous, the concept of producer-consumer network can be simplified as shown in the following figure.

send or recv: Select Either

At an abstract level, if we define the duty of the send operation to Push data to another machine, and the duty of the recv operation to Pull data from another machine to local, we will find that both the producer-initiated push and the consumer-initiated pull, send and recv only require either one. If both are introduced, there is always one of send and recv that just plays the role of placeholder and does not do the actual operation.

Further, introducing both send and recv is not just a matter of introducing redundant concepts, it breaks system consistency (recall how “The Mythical Man-Month” emphasizes the importance of conceptual consistency): a system that uses dataflow abstraction can use Push semantics as well as Pull semantics, but using both Push semantics and Pull semantics makes the system difficult to understand.

If the production side actively Pushs data to the consumer side, then recv is redundant. PyTorch’s send and recv can be thought of as taking Push semantics but introducing redundant recv, which is merely a placeholder.

If the consumer side actively Pulls the data over, then send is redundant. The send side of TensorFlow always waits for the recv side to send a request to pull the data, so it can be thought of as taking Pull semantics. In this case, send is redundant, which is just a placeholder.

Push or Pull?

Purely in terms of functional requirements for cross-machine data movement, and neither is superior to the other. (By extension, RDMA provides both READ and WRITE unilateral operations) To make a choice, it is necessary to examine which semantics is better in a larger context, where we want to use a semantics that can cover both single-sided intra-machine operations and cross-machine operations.

There are two types of interactions between ops within a single machine, one is the passing of state, which is a Push semantics at this level, e.g., the producer will actively modify the counter that the consumer depends on after execution, thus pushing the state to the consumer; the other is the passing of data, which is a Pull semantics at this level, e.g., the producer does not push the data to the consumer, but the consumer reads the data produced by the producer according to its needs. The starting point of this Pull semantics is Zero-copy, which is efficient.

Clearly, when we choose between send op or recv op, the functional requirement is data transfer and we should choose the Pull semantics. From this perspective, PyTorch’s Push semantics is inefficient and TensorFlow’s Pull semantics is efficient, though its send is redundant.

OneFlow uses Pull semantics, both within and across machine relationships, where the consumer op always pulls data from the producer op, and therefore only recv is introduced, not send.

Uniformity is achieved conceptually at the computation graph level by adopting Pull semantics. The data flow mechanism of the system can be explained in one sentence: the consumer inside the machine reads the data produced by the producer and the data pull across machines are semantically the same, except that when the producer and the consumer are on the same machine, the consumer reads from the machine it is on and writes to the same machine, while when the producer and the consumer are not on the same machine, the consumer (recv) reads from another machine and writes to the machine where it is on.

This conceptual consistency will reduce the cognitive burden, requiring only the simplest Pull rules to explain both the single and multiple machine cases.

The above discussion is at the abstract level, at the implementation level is a different issue. If the underlying communication uses RDMA, then a single-sided operation of READ will support Pull; if the underlying is a socket, then the underlying has to implement Pull semantics through the cooperation of send and recv, which is exactly what OneFlow’s CommNet module implements.

Conclusion

The dependency engine is the core module of the DL framework, and the shared memory-based counter mechanism is a common way to implement a standalone dependency engine.

When distributed computing is introduced, cross-machine dependencies are usually implemented with the help of “message passing” (of course, it is possible to continue using the shared memory counter mechanism if it is supported by a distributed shared memory mechanism).

TensorFlow’s implementation of distributed dependency engine using “message passing” introduces several redundant concepts that are not in line with “Entities should not be multiplied unnecessarily”.

When the redundant abstractions are removed, the TensorFlow implementation can be simplified.

With the help of this “fault-finding” process, I would like to share: Only the paranoid pursuit of minimalism will lead to the essence of things.

You may have noticed: a programming model that relies on shared memory within a single machine, and a programming model that uses message passing across machines, here again introducing inconsistency. Is this a problem? We’ll continue the discussion in the next article, so stay tuned.

Related articles:

  1. OneFlow v0.5.0RC came out!
  2. The Limitations of Existing Deep Learning Frameworks: Dynamic Scheduling

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

CodeX

Everything connected with Tech & Code. Follow to join our 500K+ monthly readers