Techniques for programming in the large.
I’m a Senior Research and Development Engineer and a former Technical Artist. I’ve been programming professionally since 2005. In this post, I’ll tell you about my approach to writing large programs. This set of techniques is something I’ve converged on after many years of experience building large Visual Effects production pipelines.
My own arbitrary definition of a “large program” is any program over 1000 lines of code. 1000 lines isn’t very large, you might say, but it’s usually large enough to start breaking up the program into multiple source files. It’s also usually large enough where you start thinking about the design of your program.
How do you organize thousands of lines of code? How do you decompose a large program into reusable components? To answer these questions, I will show you two specific techniques suitable for large scale software development: modular programming and dataflow programming. Then I’ll try to make the case that static typing is a good thing and that programming paradigms are irrelevant to structuring large programs.
In 2016 I started working on a hobby video game project at home. For a while, things were going great. But as the project grew bigger, the complexity became unmanageable. Finally, it became apparent that I didn’t really understand how to organize large programs at all!
I’ve written tons of code over the course of my career, but most of that code was either 3D artist tools or entire pipelines of self-contained scripts coupled only by data. My video game project was the first time I encountered a single large program. I knew I had to somehow decompose my program into reusable parts, but my usual go-to decomposition method, dataflow programming, just didn’t fit a game. Programming paradigms didn’t help either, because, as you will see, paradigms are micro, not macro. Finally, after much research, I stumbled onto modular programming. If you only learn one technique for programming in the large, this should be the one.
Now let’s define modular programming further:
- A module defines a set of imports — other modules that this module depends on. And a set of exports — definitions of variables, constants, data types, functions, and classes that are the public interface of this module. A module can also have private definitions which it doesn’t export. In this way, modules achieve information hiding.
- A program is a collection of modules which (ideally) form a directed acyclic graph (DAG) of dependencies. In this DAG, the nodes are modules and the edges are “uses” relationships.
- Modules lie on a spectrum from high-level (specific) to low-level (generic). The highest level module contains the entry point of the program, whereas the lowest level modules are usually generic libraries. In a case of super confusing terminology, modules that a given module depends on are called upstream modules. Whereas modules that depend on a given module are called downstream modules. Therefore, high-level = downstream and low-level = upstream.
- The module dependency graph is horizontal if, on average, a given module uses many other modules. And it is vertical if, on average, a given module uses few other modules. There is even a software metric to compute the horizontalness vs verticalness of a graph called the Normalized Cumulative Component Dependency (NCCD). Horizontal graphs have lower coupling than vertical graphs in the graph-theoretic sense.
- Because of the DAG structure, module branches can be “cut off” and tested in isolation. To test a given module you only need all of its upstream dependencies. In my opinion, this ability to develop and test modules independently is the primary benefit of modular programming. If the module interfaces are clearly defined, then multiple developers can work in parallel.
So now that we’ve defined modular programming, what are the rules for doing it well?
- No dependency cycles between modules. This design rule will prevent your code from turning into a big ball of mud. When people say their code is “modular”, they usually mean that their module dependencies form a DAG. Dependency cycles are bad because they increase coupling in the graph-theoretic sense. Modules that form a dependency cycle might as well be one big module because they can’t be tested in isolation. This rule can be enforced programmatically and some languages, like Golang, forbid circular dependencies altogether.
- Stability increases at lower levels. A module that many other modules depend on better have a stable interface (or have a version number), because if that interface changes some of the downstream modules also have to change. Because of this, low-level modules are literally more important to get right than high-level modules. Rushing decisions about low-level modules often leads to technical debt. Low-level modules should be developed (or bought off the shelf) first. In terms of risk management, the system should be built bottom-up.
- Reusability increases at lower levels. Low-level modules should be generic libraries so that they can be reused in other projects.
I’ve been using modular programming successfully both at home and at work for some time. It’s a powerful technique for structuring large programs and even large hardware systems. I can develop and test each module in a bottom-up fashion, assembling my program piece by piece. If a module interface needs to change, I can tell at a glance, by looking at the dependency graph, which downstream modules may be affected. If a module assumes too much responsibility, I can push some of the responsibility to a lower level module. If I need to rewire my dependency graph, I can see visually how I’m going to do it.
What’s really weird is how long it took me to discover modular programming. I knew about Python modules, of course. But I never really used Python modules beyond the simple case of grouping utility functions together. It never occurred to me that I can repeat this grouping process recursively. Only when I went back in time to understand the history of programming, that I discovered “deep” modular programming in languages like Modula-2, Oberon, and C. From what I can tell, modular programming was popular during the late 70s, 80s, and early 90s and that’s where you’ll find most of the literature. Then object-oriented programming came to dominate the programming world and it seems modular programming concepts were largely forgotten about for a period of 25 years. This is a shame because object-oriented programming and modular programming are not mutually exclusive! (Information hiding is the only overlap.) In my view, modular programming is macro and subsumes all the paradigms.
Today modular programming is being rediscovered, with nearly every language supporting it or adding official support for it (ES6 modules). In Golang, which was inspired by C and Oberon, modules (packages) are pretty much the only way to structure your code. And of course, functional languages like Haskell have always had modules.
Unfortunately, some of the deeper graph-theoretic concepts I listed above are rarely discussed anymore. Modern modular programming literature seems sparse as if everyone is expected to understand these concepts from birth. Does anyone else think modular programming deserves more attention?
- Large-Scale C++ Software Design by John Lakos. The modular programming bible. Highly recommended.
- On the Criteria To Be Used in Decomposing Systems into Modules by David Parnas. The concept of information hiding is introduced in this 1971 paper.
- Project Oberon: The Design of an Operating System and Compiler by Niklaus Wirth and Jürg Gutknecht. A rare book that lists the full source code of an entire operating system structured using modular programming.
- Agile Principles, Patterns, and Practices in C# by Robert C. Martin and Micah Martin. See Chapter 28: Principles of Package and Component Design.
- How Yelp Modularized the Android App by Sanae Rosen. A modular programming success story.
While modular programming can help you build a single large program, dataflow programming can help you build a large pipeline of many interconnected programs. As a former technical artist, dataflow programming is very near and dear to my heart. Over half of all the code I’ve ever written fits under this category. Dataflow programming is very common in the VFX, 3D Animation, and Video Game industries. In these industries, you can’t throw a rock (and I mean literally) without hitting a 3D artist working in some kind of dataflow program. Popular dataflow programs are Maya, Nuke, Substance Designer, and Houdini. These programs are often called “node-based”, “non-destructive”, or “procedural”.
Let’s define exactly what dataflow programming is:
- In dataflow programming, the program is decomposed into black box processes called nodes. These nodes have a set of inputs and a set of outputs. Nodes transform inputs into outputs in some way. For example, in Nuke, you could load an image using a Read node, then resize that image to quarter resolution using a Reformat node, then save out the smaller image using a Write node. The original input image is never overwritten, and this is why dataflow programming is called non-destructive editing.
- Nodes are arranged into a “pipes and filters” pipeline, similar to a manufacturing assembly line, where the pipes carry data and the filters are process nodes. A dataflow pipeline always forms a directed acyclic graph (DAG).
- Nodes are executed in topological sort order from upstream to downstream. Changing any of the inputs in an upstream node automatically recomputes all downstream nodes. In this way, we say that the data is flowing through the nodes.
- While dataflow programming and functional programming are similar, there are a few important differences. First, in dataflow programming, the structure of the DAG is specified externally in some runtime environment. Nodes aren’t aware of each other, whereas in functional programming functions can call other functions. Second, dataflow programming doesn’t allow recursion. Third, dataflow programming is typically set up for parallel execution, whereas in functional programming parallel execution is not a given.
- Nodes usually have their own parameters. These parameters are often stored externally together with the DAG. Sometimes the parameters are computed using other nodes or expressions.
- Nodes are coupled only by data, which makes them endlessly reconfigurable. At any moment, the artist can inspect the outputs of a specific node. This leads to a deep understanding of every step of the process. Skilled artists can build node networks of incredible complexity without ever writing a single line of code. This ability for artists to do “visual programming” is probably why dataflow programming is so attractive to them.
Dataflow programming extends beyond a desktop application runtime like Maya. Bigger runtimes, called Render Farm Management Software, exist to orchestrate massive distributed pipelines of renderers and other command line tools. (If you’re more comfortable with Web tech than VFX tech, check out Apache Airflow.) And these are the kinds of pipelines that I’m often tasked with designing and writing.
But how do you write a command line tool which can be plugged into a dataflow pipeline? What are the rules for writing such a tool well? Here’s a list of rules I follow:
- The tool should not have an interactive prompt. The tool cannot be interactive because the Render Farm Management Software, which will run the tool, is an automated process. Therefore, the tool can only accept command line arguments.
- The tool must be like a pure function. The only difference is that the data exists on disk instead of in memory. For example, if you were to write a command line tool to composite image A over image B, it could have the following specification:
over <pathToImageA> <pathToImageB> <unpremultiply> <pathToOutputImage>.
- The tool must fail gracefully. Error handling can be done in two ways: exit codes and logging. The exit code is programmatic, while the logging is meant for humans. Exit code 0 always means success. The meaning of other exit codes should be documented. In the image compositing example above, the exit codes might be:
0: OK, 1: IMAGE_A_HAS_NO_ALPHA_CHANNEL, 2: INCOMPATIBLE_IMAGE_DIMENSIONS, 3: INVALID_IMAGE_FORMAT, 4: IMAGE_DOES_NOT_EXIST, 5: CANNOT_WRITE_OUTPUT, 6: CANNOT_OVERWRITE_INPUTS, etc.
- The tool must be idempotent. Running the tool more than once with the same inputs should always produce the same outputs. You should assume the tool will be run more than once due to retries. The tool can never overwrite the inputs! And it should always overwrite the outputs! None of this
- The tool should be as dumb as possible. It should never try to massage invalid inputs to make them work.
- The tool must always exit. I once worked with a third-party tool which said “Press any key to exit” at the end. Please don’t do that.
- The tool should not be aware of any other running processes or frameworks. It’s up to the Render Farm Management Software to manage dependencies between processes. Wrapping a third-party command line tool (or several) into a single process is okay. Spawning threads is also okay.
- Avoid data collisions. The tools are arranged into DAG pipelines in the Render Farm Management Software. The DAGs themselves are usually parameterized so that different instances of the same DAG can run in parallel. It’s important that data in one DAG instance is completely isolated from data in another DAG instance. All you have to do is put the data for each DAG instance into a separate folder. Also, you should prevent a DAG instance with the same parameters from being created twice.
- Avoid data corruption. What if you have a batch process which somehow touches data in all of the DAG instance folders? In that case, you must stop all running DAG instances, run the batch process, and then restart the DAG instances again. Think of it as a crosswalk. The DAG instances are the cars and the batch process is the pedestrian wishing to cross the street. Bad things will happen if the pedestrian doesn’t wait for the cars to stop.
- Updating an upstream node must automatically update all downstream nodes. Never rerun a single node upstream (with different parameters) without also rerunning all downstream nodes in topological sort order. If your Render Farm Management Software doesn’t come with a “Requeue Downstream Jobs” feature, make sure to write this feature yourself.
That’s all there is to it! I can tell you from hard-earned experience that breaking any of these rules will lead to data corruption. But with practice, these rules will become second nature to you.
Dataflow programming is my go-to decomposition technique whenever there is a stream of data flowing from one process to the next. The first thing I do when designing a pipeline is draw a Data Flow Diagram. Once I’m confident that I understand both the data and the processes involved, test data can be gathered and implementation of the processes can begin. If the data is clearly defined, then multiple developers (using potentially different languages) can work in parallel.
- Complete Maya Programming: An Extensive Guide to MEL and C++ API by David Gould. See Chapter 2: Fundamental Maya Concepts for a masterful explanation of how the Maya DAG works.
Static Typing Is Your Friend
Today I prefer statically typed languages for large programs. This wasn’t always the case. For a long time, my primary language was Python. As a former technical artist, Python was all I needed to get my job done. In fact, I loved Python so much, that I shied away from learning other languages. Because, you know, I could already do everything with Python!
Everything, that is, except write a large program.
Here’s my story. In 2016 I started developing a video game in Python. For a while, I was making good progress. Then, at around 2500 lines of code, something strange happened. I hit some kind of wall. Development slowed to a crawl. Refactoring became very painful. Why? Well, as it turned out, I was supposed to be writing unit tests! Oops. So I started writing tests, and that definitely helped, but I still felt that I could be even more productive somehow. Was Python itself part of the problem?
After trying to program in many different languages for several months, I started to form an opinion:
Static typing is better for large programs than dynamic typing.
“Better” in the sense of self-documenting code, ease of refactoring, reduced cognitive load, IDE support, and performance.
I’m not suggesting it’s impossible to write a large program in a dynamic language, just that it may be more difficult. Also, I’m not suggesting that it’s impossible to write battle-tested, production-grade code in a dynamic language. Obliviously, people do it all the time. I’m only talking about very large programs, thousands of lines long, where I believe static typing will help you immensely. Conversely, a statically typed language may be overkill for small programs and prototypes. Basically, the choice of language should be directly related to the size of the program.
I’m currently rewriting my game in TypeScript, and it’s going well. For me, the real difference between working in a dynamic language like Python and a statically typed language like TypeScript comes down to refactoring without fear. In a large Python program, I’m terrified of refactoring and the only way to overcome this fear is to write (and maintain) lots of tests. In TypeScript, however, I can refactor with a lot more confidence. If I make a breaking change, my IDE will light up like a Christmas tree. Tests that I write in TypeScript provide correctness guarantees on top of the static analysis done by the compiler. And this gives me the confidence to try out new ideas quickly.
- I don’t need types by Dimitri Merejkowsky. A story similar to mine.
- Ideology. Gary Bernhardt explains why you need types and tests.
Paradigms Are Micro, Not Macro
There are 3 popular programming paradigms today: object-oriented, functional, and procedural. You will hear a lot of hype about how one paradigm is better than another for writing large programs. I encourage you to remain skeptical when you hear such claims on the Internet, or worse, from your teachers!
Here’s what I believe:
- Anything that can be expressed in one paradigm can also be expressed in another.
- Some programmers find one of the paradigms most “natural” because it’s closest to the way they think. But not all programmers think the same way. Some programmers like to keep nouns (data) and verbs (functions) separate, while others prefer to group verbs under nouns. Some are comfortable with recursion, while others prefer loops. Some try to push all state to the outskirts, while others prefer evenly distributed pockets of state.
- There are no bad paradigms, only bad programmers.
- Some paradigms are arguably a better fit in certain situations than others. Functional programming for computation. Object-oriented programming for simulation. Procedural programming for automation. Therefore, using a healthy mix of paradigms is the best approach.
Honestly, even the word “paradigm” sounds inflated in the context of programming. I would demote it to something like “style”. In reality, the 3 major paradigms are just 3 different styles of programming in the small — 3 different ways to organize a single source file. Programming paradigms are micro, not macro. Therefore, as far as programming in the large is concerned, it doesn’t matter which paradigm you lean towards.
- Exercises in Programming Style by Cristina Videira Lopes. One program written in 33 different styles in Python.
- Thirteen ways of looking at a turtle. Scott Wlaschin describes 13 ways to implement a turtle graphics API in F#.
I hope these techniques will help you write large programs! If you also use these techniques or others, please let me know in the comments. I want to learn as much as I can about this subject.
If you enjoyed this post, please consider leaving some claps 👏 so that other people can find it. You may leave up to 50 claps. Thanks! 🙏