How To Write Large Programs

Photo by Drew Hays on Unsplash

Modular Programming

A typical dependency graph
  • A module defines a set of imports — other modules that this module depends on. And a set of exports — definitions of variables, constants, data types, functions, and classes that are the public interface of this module. A module can also have private definitions which it doesn’t export. In this way, modules achieve information hiding.
  • A program is a collection of modules which (ideally) form a directed acyclic graph (DAG) of dependencies. In this DAG, the nodes are modules and the edges are “uses” relationships.
  • Modules lie on a spectrum from high-level (specific) to low-level (generic). The highest level module contains the entry point of the program, whereas the lowest level modules are usually generic libraries. In a case of super confusing terminology, modules that a given module depends on are called upstream modules. Whereas modules that depend on a given module are called downstream modules. Therefore, high-level = downstream and low-level = upstream.
  • The module dependency graph is horizontal if, on average, a given module uses many other modules. And it is vertical if, on average, a given module uses few other modules. There is even a software metric to compute the horizontalness vs verticalness of a graph called the Normalized Cumulative Component Dependency (NCCD). Horizontal graphs have lower coupling than vertical graphs in the graph-theoretic sense.
  • There are tools available for nearly every language to help you visualize your module dependency graph. For example, JavaScript/TypeScript has dependency-cruiser, C/C++ has cinclude2dot, Dart has lakos (shameless plug 😊). I encourage you to leverage these tools on large programs. For example, I have dependency-cruiser set up to run every time I make a build. It really helps me understand the macrostructure of my program!
  • Because of the DAG structure, module branches can be “cut off” and tested in isolation. To test a given module you only need all of its upstream dependencies. In my opinion, this ability to develop and test modules independently is the primary benefit of modular programming. If the module interfaces are clearly defined, then multiple developers can work in parallel.
  1. No dependency cycles between modules. This design rule will prevent your code from turning into a big ball of mud. When people say their code is “modular”, they usually mean that their module dependencies form a DAG. Dependency cycles are bad because they increase coupling in the graph-theoretic sense. Modules that form a dependency cycle might as well be one big module because they can’t be tested in isolation. This rule can be enforced programmatically and some languages, like Golang, forbid circular dependencies altogether.
  2. Stability increases at lower levels. A module that many other modules depend on better have a stable interface (or have a version number), because if that interface changes some of the downstream modules also have to change. Because of this, low-level modules are literally more important to get right than high-level modules. Rushing decisions about low-level modules often leads to technical debt. Low-level modules should be developed (or bought off the shelf) first. In terms of risk management, the system should be built bottom-up.
  3. Reusability increases at lower levels. Low-level modules should be generic libraries so that they can be reused in other projects.

Dataflow Programming

A typical dataflow pipeline
  • In dataflow programming, the program is decomposed into black box processes called nodes. These nodes have a set of inputs and a set of outputs. Nodes transform inputs into outputs in some way. For example, in Nuke, you could load an image using a Read node, then resize that image to quarter resolution using a Reformat node, then save out the smaller image using a Write node. The original input image is never overwritten, and this is why dataflow programming is called non-destructive editing.
A simple Nuke pipeline to resize an image
  • Nodes are arranged into a “pipes and filters” pipeline, similar to a manufacturing assembly line, where the pipes carry data and the filters are process nodes. A dataflow pipeline always forms a directed acyclic graph (DAG).
  • Nodes are executed in topological sort order from upstream to downstream. Changing any of the inputs in an upstream node automatically recomputes all downstream nodes. In this way, we say that the data is flowing through the nodes.
  • While dataflow programming and functional programming are similar, there are a few important differences. First, in dataflow programming, the structure of the DAG is specified externally in some runtime environment. Nodes aren’t aware of each other, whereas in functional programming functions can call other functions. Second, dataflow programming doesn’t allow recursion. Third, dataflow programming is typically set up for parallel execution, whereas in functional programming parallel execution is not a given.
  • Nodes usually have their own parameters. These parameters are often stored externally together with the DAG. Sometimes the parameters are computed using other nodes or expressions.
  • Nodes are coupled only by data, which makes them endlessly reconfigurable. At any moment, the artist can inspect the outputs of a specific node. This leads to a deep understanding of every step of the process. Skilled artists can build node networks of incredible complexity without ever writing a single line of code. This ability for artists to do “visual programming” is probably why dataflow programming is so attractive to them.
  1. The tool should not have an interactive prompt. The tool cannot be interactive because the Render Farm Management Software, which will run the tool, is an automated process. Therefore, the tool can only accept command line arguments.
  2. The tool must be like a pure function. The only difference is that the data exists on disk instead of in memory. For example, if you were to write a command line tool to composite image A over image B, it could have the following specification: over <pathToImageA> <pathToImageB> <unpremultiply> <pathToOutputImage>.
  3. The tool must fail gracefully. Error handling can be done in two ways: exit codes and logging. The exit code is programmatic, while the logging is meant for humans. Exit code 0 always means success. The meaning of other exit codes should be documented. In the image compositing example above, the exit codes might be: 0: OK, 1: IMAGE_A_HAS_NO_ALPHA_CHANNEL, 2: INCOMPATIBLE_IMAGE_DIMENSIONS, 3: INVALID_IMAGE_FORMAT, 4: IMAGE_DOES_NOT_EXIST, 5: CANNOT_WRITE_OUTPUT, 6: CANNOT_OVERWRITE_INPUTS, etc.
  4. The tool must be idempotent. Running the tool more than once with the same inputs should always produce the same outputs. You should assume the tool will be run more than once due to retries. The tool can never overwrite the inputs! And it should always overwrite the outputs! None of this --overwrite flag BS.
  5. The tool should be as dumb as possible. It should never try to massage invalid inputs to make them work.
  6. The tool must always exit. I once worked with a third-party tool which said “Press any key to exit” at the end. Please don’t do that.
  7. The tool should not be aware of any other running processes or frameworks. It’s up to the Render Farm Management Software to manage dependencies between processes. Wrapping a third-party command line tool (or several) into a single process is okay. Spawning threads is also okay.
  8. Avoid data collisions. The tools are arranged into DAG pipelines in the Render Farm Management Software. The DAGs themselves are usually parameterized so that different instances of the same DAG can run in parallel. It’s important that data in one DAG instance is completely isolated from data in another DAG instance. All you have to do is put the data for each DAG instance into a separate folder. Also, you should prevent a DAG instance with the same parameters from being created twice.
  9. Avoid data corruption. What if you have a batch process which somehow touches data in all of the DAG instance folders? In that case, you must stop all running DAG instances, run the batch process, and then restart the DAG instances again. Think of it as a crosswalk. The DAG instances are the cars and the batch process is the pedestrian wishing to cross the street. Bad things will happen if the pedestrian doesn’t wait for the cars to stop.
  10. Updating an upstream node must automatically update all downstream nodes. Never rerun a single node upstream (with different parameters) without also rerunning all downstream nodes in topological sort order. If your Render Farm Management Software doesn’t come with a “Requeue Downstream Jobs” feature, make sure to write this feature yourself.

Static Typing Is Your Friend

Paradigms Are Micro, Not Macro

  • Anything that can be expressed in one paradigm can also be expressed in another.
  • Some programmers find one of the paradigms most “natural” because it’s closest to the way they think. But not all programmers think the same way. Some programmers like to keep nouns (data) and verbs (functions) separate, while others prefer to group verbs under nouns. Some are comfortable with recursion, while others prefer loops. Some try to push all state to the outskirts, while others prefer evenly distributed pockets of state.
  • There are no bad paradigms, only bad programmers.
  • Some paradigms are arguably a better fit in certain situations than others. Functional programming for computation. Object-oriented programming for simulation. Procedural programming for automation. Therefore, using a healthy mix of paradigms is the best approach.

Parting Words

--

--

--

Sr. R&D Engineer at Lowe’s Innovation Labs, Seattle. http://olegalexander.com/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

7 Popular Use Cases for Apache Kafka

Debug a website (local or remote hosted)on an android mobile device

Voting using Blockchain

Simple weather forecast app using Kotlin, Retrofit and RxJava2

Monitoring Services with Elasticsearch APM

Budget Commander — Eleven & Will Spellslinger

2.5D Platformer — Double Jump #3

Getting closer to meeting (almost) all your resilience requirements…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Oleg Alexander

Oleg Alexander

Sr. R&D Engineer at Lowe’s Innovation Labs, Seattle. http://olegalexander.com/

More from Medium

What is Scalability & Highly Available?

Webhook integrations | How to use webhooks in your integration flows

Working with AWS SQS FIFO Queues

Move Dead Letter Queue Messages From SQS to DynamoDB using Pulumi