Techniques for programming in the large.
I’m a Research and Development Engineer and a former Technical Artist. I’ve been programming professionally since 2005. In this post, I’ll tell you about my approach to writing large programs. This set of techniques is something I’ve converged on after many years of experience building large Visual Effects production pipelines.
My own arbitrary definition of “programming in the large” is any program over 1000 lines. 1000 lines isn’t very large, you might say, but it’s usually large enough to start breaking up the program into multiple source files. It’s also usually large enough where complexity starts to slow the rate of development.
We’re going to start with a few basic observations about programming languages and paradigms. Then I will show you two specific decomposition techniques suitable for large scale software development. The first is modular programming and the second is dataflow programming.
Static Typing Is Your Friend
Today I prefer statically typed languages for large programs. This wasn’t always the case. For a long time, my primary language was Python. As a former technical artist, Python was all I needed to get my job done. In fact, I loved Python so much, that I shied away from learning other languages. Because, you know, I could already do everything with Python!
Everything, that is, except write a large program.
Here’s the story. In 2016 I started working on a hobby video game project at home. Naturally, I started developing it in Python. For a while, things were going great. Then, at around 2500 lines of code, something strange happened. I hit some kind of a wall. Development slowed to a crawl. Refactoring became very painful. Why? Things were going so well and now I felt like I couldn’t move. Couldn’t breathe. Constricted even.
After trying to program in many different languages for several months, I started to form an opinion:
Static typing is better for large programs than dynamic typing.
“Better” in the sense of self-documenting code, ease of refactoring, reduced cognitive load, IDE integration, and performance.
I’m not suggesting it’s impossible to write a large program in a dynamic language, just that it may be more difficult. Also, I’m not suggesting that it’s impossible to write battle-tested, production-grade code in a dynamic language. Obliviously, people do it all the time. I’m only talking about very large programs, thousands of lines long, where I believe static typing (in addition to tests) will help you immensely. Conversely, a statically typed language may be overkill for small programs and prototypes. Basically, I’m suggesting that the choice of language should be directly related to the size of the program.
I’m currently rewriting my game in TypeScript, and it’s going well. For me, the real difference between working in a dynamic language like Python and a statically typed language like TypeScript comes down to fear of refactoring. In a large Python program, I’m terrified of refactoring. If I make a change, the only way to find out if I introduced a bug is to run the program. So I make a small change, run the program, fix the errors, make another small change, run the program, fix the errors. Did I get them all? Are my tests thorough enough? In TypeScript, I can refactor with a lot more confidence. If I make a change, even a big change, my IDE will light up like a Christmas tree. I have to run the program less often and I don’t have to write run-time type checks. A good compiler is a powerful tool. I put the compiler to work managing some of the complexity of my program, so I don’t have to. If programmers are supposed to be lazy, what can be lazier than that?
To wrap up the story, TypeScript in
--strict mode has become one of my favorite languages. The other language I wish I had more time to play with is Rust. Instead of shying away from learning new languages, now I welcome the opportunity. I still use Python at work and it’s still my favorite dynamic language, but I wouldn’t start another large program in Python. I’m aware of type hinting in Python 3.6, but I haven’t had a chance to use it in production yet.
- I don’t need types by Dimitri Merejkowsky. A story similar to mine.
- Ideology. Gary Bernhardt explains why you need types and tests.
Paradigms Are Micro, Not Macro
There are 3 popular programming paradigms today: object-oriented, functional, and procedural. You will hear a lot of hype about how one paradigm is better than another for writing large programs. I encourage you to remain skeptical when you hear such claims on the Internet, or worse, from your teachers!
Here’s what I believe:
- Anything that can be expressed in one paradigm can also be expressed in another.
- Some programmers find one of the paradigms most “natural” because it’s closest to the way they think. But not all programmers think the same way. Some programmers prefer to keep nouns (data) and verbs (functions) separate, while others prefer to group verbs under nouns. Some are comfortable with recursion, while others prefer loops.
- It’s possible to program badly in any paradigm.
- Some paradigms are arguably a better fit in certain situations than others. Functional programming for math. Object-oriented programming for scene graphs. Procedural programming for automation. Therefore, using a healthy mix of paradigms is probably the best approach.
Honestly, the word “paradigm” sounds more important than it really is in the context of programming. I would demote it to something like “style”. In reality, the 3 major paradigms are just 3 different styles of programming in the small — 3 different ways to organize a single source file. Programming paradigms are micro, not macro. Therefore, as far as programming in the large is concerned, it doesn’t matter which paradigm you lean towards.
- Exercises in Programming Style by Cristina Videira Lopes. One program written in 33 different styles in Python.
- Thirteen ways of looking at a turtle. Scott Wlaschin describes 13 ways to implement a turtle graphics API in F#.
I’ll admit that for many years I didn’t really understand how to write large programs. I’ve written tons of code over the course of my career, but most of that code was either 3D artist tools or entire pipelines of self-contained scripts coupled only by data. My video game project was the first time I encountered a single large program, and for a while, I had no idea how to deal with that kind of complexity. I knew I had to somehow decompose my program into parts, but my usual go-to decomposition method, dataflow programming, just didn’t fit a game. Programming paradigms didn’t help either, because, as I said, paradigms are micro, not macro. Finally, after much research, I somehow stumbled onto modular programming. If you only learn one technique for programming in the large, this should be the one.
Now let’s define modular programming further:
- A module defines a set of imports — other modules that this module depends on. And a set of exports — definitions of constants, data types, functions, and classes that are the public interface of this module. A module can also have private definitions which it doesn’t export. In this way, modules achieve information hiding.
- A program is a collection of modules which (ideally) form a directed acyclic graph (DAG) of dependencies. In this DAG, the nodes are modules and the edges are “uses” relationships.
- Modules lie on a spectrum from high-level (specific) to low-level (generic). The highest level module contains the entry point of the program, whereas the lowest level modules are usually generic libraries. In a case of super confusing terminology, modules that a given module depends on are called upstream modules. Whereas modules that depend on a given module are called downstream modules. Therefore, high-level = downstream and low-level = upstream.
- The module dependency graph is horizontal if, on average, a given module uses many other modules. And it is vertical if, on average, a given module uses few other modules. There is even a software metric to compute the horizontalness vs verticalness of a graph called the Normalized Cumulative Component Dependency (NCCD). Horizontal graphs have lower coupling than vertical graphs in the graph-theoretic sense.
- Because of the DAG structure, module branches can be “cut off” and tested in isolation. To test a given module you only need all of its upstream dependencies. In my opinion, this ability to develop and test modules independently is the primary benefit of modular programming. If the module interfaces are clearly defined, then multiple developers can work in parallel.
So now that we’ve defined modular programming, what are the rules for doing it well?
- No dependency cycles between modules. This design rule will prevent your code from turning into a big ball of mud. When people say their code is “modular”, they mean that their module dependencies form a DAG. Dependency cycles are bad because they increase coupling in the graph-theoretic sense. Modules that form a dependency cycle might as well be one big module because they can’t be tested in isolation. This rule can be enforced programmatically and some languages, like Golang, forbid circular dependencies altogether.
- Stability increases at lower levels. A module that many other modules depend on better have a stable interface (or have a version number), because if that interface changes some of the downstream modules also have to change. Because of this, low-level modules are literally more important to get right than high-level modules. Rushing decisions about low-level modules often leads to technical debt. Low-level modules should be developed (or bought off the shelf) first. In terms of risk management, the system should be built bottom up.
- Reusability increases at lower levels. Low-level modules should be generic libraries so that they can be reused in other projects.
I’ve been using modular programming successfully both at home and at work for some time. It’s a powerful technique for structuring large programs and even large hardware systems. I can develop and test each module in a bottom-up fashion, assembling my program piece by piece. If a module interface needs to change, I can tell at a glance, by looking at the dependency graph, which downstream modules may be affected. If a module assumes too much responsibility, I can push some of the responsibility to a lower level module. If I need to rewire my dependency graph, I can see visually how I’m going to do it.
What’s really weird is how long it took me to discover modular programming. I knew about Python modules, of course. But I never really used Python modules beyond the simple case of grouping utility functions together. It never occurred to me that I can repeat this grouping process recursively. Only when I went back in time to understand the history of programming, that I discovered “deep” modular programming in languages like Modula-2, Oberon, and C. From what I can tell, modular programming was popular during the late 70s, 80s, and early 90s and that’s where you’ll find most of the literature. Then object-oriented programming came to dominate the programming world and it seems modular programming concepts were largely forgotten about for a period of 25 years. This is a shame because object-oriented programming and modular programming are not mutually exclusive! (Information hiding is the only overlap.) In my view, modular programming subsumes all the paradigms.
Today modular programming is being rediscovered, with nearly every language supporting it or adding official support for it (ES6 modules, Java 9 modules). In Golang, which was inspired by C and Oberon, modules (packages) are pretty much the only way to structure your code. And of course, functional languages like OCaml and F# have always had modules.
Unfortunately, some of the deeper graph-theoretic concepts I listed above are rarely discussed anymore. Modern modular programming literature seems sparse as if everyone is expected to understand these concepts from birth. Does anyone else think modular programming deserves more attention?
- Large-Scale C++ Software Design by John Lakos. The modular programming bible. Highly recommended.
- On the Criteria To Be Used in Decomposing Systems into Modules by David Parnas. The concept of information hiding is introduced in this 1971 paper.
- Project Oberon: The Design of an Operating System and Compiler by Niklaus Wirth and Jürg Gutknecht. A rare book which lists the full source code of an entire operating system structured using modular programming.
- Agile Principles, Patterns, and Practices in C# by Robert C. Martin and Micah Martin. See Chapter 28: Principles of Package and Component Design.
- How Yelp Modularized the Android App by Sanae Rosen. A modular programming success story.
- Java 9 Modularity: Patterns and Practices for Developing Maintainable Applications by Sander Mak and Paul Bakker. One of the only recent books about modular programming.
While modular programming can help you build a single large program, dataflow programming can help you build a large pipeline of many interconnected programs. As a former technical artist, dataflow programming is very near and dear to my heart. Over 50% of all the code I’ve ever written could fit under this category. Dataflow programming is very common in the VFX, 3D Animation, and Video Game industries. In these industries, you can’t throw a rock (and I mean literally) without hitting a 3D artist working in some kind of dataflow program. Popular dataflow programs are Maya, Nuke, Substance Designer, and Houdini. These programs are often called “node-based”, “non-destructive”, or “procedural”.
Let’s define exactly what dataflow programming is:
- In dataflow programming, the program is decomposed into black box processes called nodes. These nodes have a set of inputs and a set of outputs. Nodes transform inputs into outputs in some way. For example, in Nuke, you could load an image using a Read node, then resize that image to quarter resolution using a Reformat node, then save out the smaller image using a Write node. The original input image is never overwritten, and this is why dataflow programming is called non-destructive editing.
- Nodes are arranged into a “pipes and filters” pipeline, similar to a manufacturing assembly line, where the pipes carry data and the filters are process nodes. A dataflow pipeline always forms a directed acyclic graph (DAG).
- Nodes are executed in topological sort order from upstream to downstream. Changing any of the inputs in an upstream node automatically recomputes all downstream nodes. In this way, we say that the data is flowing through the nodes.
- While dataflow programming and functional programming are similar, there are a few important differences. First, in dataflow programming, the structure of the DAG is specified externally in some runtime environment. Nodes aren’t aware of each other, whereas in functional programming functions can call other functions. Second, dataflow programming doesn’t allow recursion. Third, dataflow programming is typically set up for parallel execution, whereas in functional programming parallel execution is not a given.
- Nodes usually have their own parameters. These parameters are often stored externally together with the DAG. Sometimes the parameters are computed using other nodes or expressions.
- Nodes are coupled only by data, which makes them endlessly reconfigurable. At any moment, the artist can inspect the outputs of a specific node. This leads to a deep understanding of every step of the process. Skilled artists can build node networks of incredible complexity without ever writing a single line of code. This ability for artists to do “visual programming” is probably why dataflow programming is so attractive to them.
Dataflow programming extends beyond a desktop application runtime like Maya. Bigger runtimes, called Render Farm Management Software, exist to orchestrate massive distributed pipelines of renderers and other command line tools. (If you’re more comfortable with Web tech than VFX tech, check out Apache Airflow.) And these are the kinds of pipelines that I’m often tasked with designing and writing.
But how do you write a command line tool which can be plugged into a dataflow pipeline? What are the rules for writing such a tool well? Here’s a list of rules I follow:
- The tool should not have an interactive prompt. The tool cannot be interactive because the Render Farm Management Software, which will run the tool, is an automated process. Therefore, the tool can only accept command line arguments.
- The tool must be like a pure function. The only difference is that the data exists on disk instead of in memory. For example, if you were to write a command line tool to do a weighted average of two images, it could have the following specification:
weightedAverage.exe <pathToImageA> <pathToImageB> <weight0To1> <pathToOutputImage>.
- The tool must be idempotent. Running the tool more than once with the same inputs should always produce the same outputs. You should assume the tool will be run more than once due to retries. The tool can never overwrite the inputs! And it should always overwrite the outputs! None of this
- The tool must reject invalid inputs. Error handling can be done in two ways: exit codes and logging. The exit code is programmatic, while the logging is meant for humans. Exit code 0 always means success. The meaning of other exit codes should be documented. I usually put all my
assertstatements at the beginning to validate the inputs before continuing.
- The tool should be as dumb as possible. It should never try to massage invalid inputs to make them work.
- The tool must always exit. It should not be a daemon. I once worked with a third-party tool which said “Press any key to exit” at the end. Please don’t do that.
- The tool should not be aware of any other running processes or frameworks. It’s up to the Render Farm Management Software to manage dependencies between processes. Wrapping a third-party command line tool (or several) into a single process is okay. Spawning threads is also okay.
- Avoid data collisions. The tools are arranged into DAG pipelines in the Render Farm Management Software. The DAGs themselves are usually parameterized so that different instances of the same DAG can run in parallel. It’s important that data in one DAG instance is completely isolated from data in another DAG instance. All you have to do is put the data for each DAG instance into a separate folder. Also, you should prevent a DAG instance with the same parameters from being created twice.
- Avoid data corruption. What if you have a batch process which somehow touches data in all of the DAG instance folders? In that case, you must stop all running DAG instances, run the batch process, and then restart the DAG instances again. Think of it as a crosswalk. The DAG instances are the cars and the batch process is the pedestrian wishing to cross the street. Bad things will happen if the pedestrian doesn’t wait for the cars to stop.
- Updating an upstream node must automatically update all downstream nodes. Never rerun a single node upstream (with different parameters) without also rerunning all downstream nodes in topological sort order. If your Render Farm Management Software doesn’t come with a “Requeue Downstream Jobs” feature, make sure to write this feature yourself.
That’s all there is to it! I can tell you from hard-earned experience that breaking any of these rules will lead to data corruption. But with practice, these rules will become second nature to you.
Dataflow programming is my go-to decomposition technique whenever there is a stream of data flowing from one process to the next. The first thing I do when designing a pipeline is draw a Data Flow Diagram. Once I’m confident that I understand both the data and the processes involved, test data can be gathered and implementation of the processes can begin. If the data is clearly defined, then multiple developers (using potentially different languages) can work in parallel.
- Complete Maya Programming: An Extensive Guide to MEL and C++ API by David Gould. See Chapter 2: Fundamental Maya Concepts for a masterful explanation of how the Maya DAG works.
I hope these techniques will help you write large programs!
You can 👏 up to 50 times to help people find this post. 😊