How To Write Large Programs

Techniques for programming in the large.

Photo by Drew Hays on Unsplash

I’m a Senior Research and Development Engineer and a former Technical Artist. I’ve been programming professionally since 2005. In this post, I’ll tell you about my approach to writing large programs. This set of techniques is something I’ve converged on after many years of experience building large Visual Effects production pipelines.

My own arbitrary definition of a “large program” is any program over 1000 lines of code. 1000 lines isn’t very large, you might say, but it’s usually large enough to start breaking up the program into multiple source files. It’s also usually large enough where you start thinking about the design of your program.

How do you organize thousands of lines of code? How do you decompose a large program into reusable components? To answer these questions, I will show you two specific techniques suitable for large scale software development: modular programming and dataflow programming. Then I’ll try to make the case that static typing is a good thing and that programming paradigms are irrelevant to structuring large programs.

In 2016 I started working on a hobby video game project at home. For a while, things were going great. But as the project grew bigger, the complexity became unmanageable. Finally, it became apparent that I didn’t really understand how to organize large programs at all!

I’ve written tons of code over the course of my career, but most of that code was either 3D artist tools or entire pipelines of self-contained scripts coupled only by data. My video game project was the first time I encountered a single large program. I knew I had to somehow decompose my program into reusable parts, but my usual go-to decomposition method, dataflow programming, just didn’t fit a game. Programming paradigms didn’t help either, because, as you will see, paradigms are micro, not macro. Finally, after much research, I stumbled onto modular programming. If you only learn one technique for programming in the large, this should be the one.

A typical dependency graph

Modular programming decomposes a large program into modules. A module is usually a source file or a small set of source files which logically groups some related code. Modules go by many names. In Python and JavaScript they are called modules. In Golang and Java they are called packages. In Dart they are called libraries. Some languages, like C and C++, don’t have a module system, but modular programming is routinely done through convention. In fact, some of the best literature I found about modular programming comes from books about C and C++.

Now let’s define modular programming further:

  • A module defines a set of imports — other modules that this module depends on. And a set of exports — definitions of variables, constants, data types, functions, and classes that are the public interface of this module. A module can also have private definitions which it doesn’t export. In this way, modules achieve information hiding.
  • A program is a collection of modules which (ideally) form a directed acyclic graph (DAG) of dependencies. In this DAG, the nodes are modules and the edges are “uses” relationships.
  • Modules lie on a spectrum from high-level (specific) to low-level (generic). The highest level module contains the entry point of the program, whereas the lowest level modules are usually generic libraries. In a case of super confusing terminology, modules that a given module depends on are called upstream modules. Whereas modules that depend on a given module are called downstream modules. Therefore, high-level = downstream and low-level = upstream.
  • The module dependency graph is horizontal if, on average, a given module uses many other modules. And it is vertical if, on average, a given module uses few other modules. There is even a software metric to compute the horizontalness vs verticalness of a graph called the Normalized Cumulative Component Dependency (NCCD). Horizontal graphs have lower coupling than vertical graphs in the graph-theoretic sense.
  • There are tools available for nearly every language to help you visualize your module dependency graph. For example, JavaScript/TypeScript has dependency-cruiser, C/C++ has cinclude2dot, Dart has lakos (shameless plug 😊). I encourage you to leverage these tools on large programs. For example, I have dependency-cruiser set up to run every time I make a build. It really helps me understand the macrostructure of my program!
  • Because of the DAG structure, module branches can be “cut off” and tested in isolation. To test a given module you only need all of its upstream dependencies. In my opinion, this ability to develop and test modules independently is the primary benefit of modular programming. If the module interfaces are clearly defined, then multiple developers can work in parallel.

So now that we’ve defined modular programming, what are the rules for doing it well?

  1. No dependency cycles between modules. This design rule will prevent your code from turning into a big ball of mud. When people say their code is “modular”, they usually mean that their module dependencies form a DAG. Dependency cycles are bad because they increase coupling in the graph-theoretic sense. Modules that form a dependency cycle might as well be one big module because they can’t be tested in isolation. This rule can be enforced programmatically and some languages, like Golang, forbid circular dependencies altogether.
  2. Stability increases at lower levels. A module that many other modules depend on better have a stable interface (or have a version number), because if that interface changes some of the downstream modules also have to change. Because of this, low-level modules are literally more important to get right than high-level modules. Rushing decisions about low-level modules often leads to technical debt. Low-level modules should be developed (or bought off the shelf) first. In terms of risk management, the system should be built bottom-up.
  3. Reusability increases at lower levels. Low-level modules should be generic libraries so that they can be reused in other projects.

I’ve been using modular programming successfully both at home and at work for some time. It’s a powerful technique for structuring large programs and even large hardware systems. I can develop and test each module in a bottom-up fashion, assembling my program piece by piece. If a module interface needs to change, I can tell at a glance, by looking at the dependency graph, which downstream modules may be affected. If a module assumes too much responsibility, I can push some of the responsibility to a lower level module. If I need to rewire my dependency graph, I can see visually how I’m going to do it.

What’s really weird is how long it took me to discover modular programming. I knew about Python modules, of course. But I never really used Python modules beyond the simple case of grouping utility functions together. It never occurred to me that I can repeat this grouping process recursively. Only when I went back in time to understand the history of programming, that I discovered “deep” modular programming in languages like Modula-2, Oberon, and C. From what I can tell, modular programming was popular during the late 70s, 80s, and early 90s and that’s where you’ll find most of the literature. Then object-oriented programming came to dominate the programming world and it seems modular programming concepts were largely forgotten about for a period of 25 years. This is a shame because object-oriented programming and modular programming are not mutually exclusive! (Information hiding is the only overlap.) In my view, modular programming is macro and subsumes all the paradigms.

Today modular programming is being rediscovered, with nearly every language supporting it or adding official support for it (ES6 modules). In Golang, which was inspired by C and Oberon, modules (packages) are pretty much the only way to structure your code. And of course, functional languages like Haskell have always had modules.

Unfortunately, some of the deeper graph-theoretic concepts I listed above are rarely discussed anymore. Modern modular programming literature seems sparse as if everyone is expected to understand these concepts from birth. Does anyone else think modular programming deserves more attention?

Further reading:

While modular programming can help you build a single large program, dataflow programming can help you build a large pipeline of many interconnected programs. As a former technical artist, dataflow programming is very near and dear to my heart. Over half of all the code I’ve ever written fits under this category. Dataflow programming is very common in the VFX, 3D Animation, and Video Game industries. In these industries, you can’t throw a rock (and I mean literally) without hitting a 3D artist working in some kind of dataflow program. Popular dataflow programs are Maya, Nuke, Substance Designer, and Houdini. These programs are often called “node-based”, “non-destructive”, or “procedural”.

A typical dataflow pipeline

Let’s define exactly what dataflow programming is:

  • In dataflow programming, the program is decomposed into black box processes called nodes. These nodes have a set of inputs and a set of outputs. Nodes transform inputs into outputs in some way. For example, in Nuke, you could load an image using a Read node, then resize that image to quarter resolution using a Reformat node, then save out the smaller image using a Write node. The original input image is never overwritten, and this is why dataflow programming is called non-destructive editing.
A simple Nuke pipeline to resize an image
  • Nodes are arranged into a “pipes and filters” pipeline, similar to a manufacturing assembly line, where the pipes carry data and the filters are process nodes. A dataflow pipeline always forms a directed acyclic graph (DAG).
  • Nodes are executed in topological sort order from upstream to downstream. Changing any of the inputs in an upstream node automatically recomputes all downstream nodes. In this way, we say that the data is flowing through the nodes.
  • While dataflow programming and functional programming are similar, there are a few important differences. First, in dataflow programming, the structure of the DAG is specified externally in some runtime environment. Nodes aren’t aware of each other, whereas in functional programming functions can call other functions. Second, dataflow programming doesn’t allow recursion. Third, dataflow programming is typically set up for parallel execution, whereas in functional programming parallel execution is not a given.
  • Nodes usually have their own parameters. These parameters are often stored externally together with the DAG. Sometimes the parameters are computed using other nodes or expressions.
  • Nodes are coupled only by data, which makes them endlessly reconfigurable. At any moment, the artist can inspect the outputs of a specific node. This leads to a deep understanding of every step of the process. Skilled artists can build node networks of incredible complexity without ever writing a single line of code. This ability for artists to do “visual programming” is probably why dataflow programming is so attractive to them.

Dataflow programming extends beyond a desktop application runtime like Maya. Bigger runtimes, called Render Farm Management Software, exist to orchestrate massive distributed pipelines of renderers and other command line tools. (If you’re more comfortable with Web tech than VFX tech, check out Apache Airflow.) And these are the kinds of pipelines that I’m often tasked with designing and writing.

But how do you write a command line tool which can be plugged into a dataflow pipeline? What are the rules for writing such a tool well? Here’s a list of rules I follow:

  1. The tool should not have an interactive prompt. The tool cannot be interactive because the Render Farm Management Software, which will run the tool, is an automated process. Therefore, the tool can only accept command line arguments.
  2. The tool must be like a pure function. The only difference is that the data exists on disk instead of in memory. For example, if you were to write a command line tool to composite image A over image B, it could have the following specification: over <pathToImageA> <pathToImageB> <unpremultiply> <pathToOutputImage>.
  3. The tool must fail gracefully. Error handling can be done in two ways: exit codes and logging. The exit code is programmatic, while the logging is meant for humans. Exit code 0 always means success. The meaning of other exit codes should be documented. In the image compositing example above, the exit codes might be: 0: OK, 1: IMAGE_A_HAS_NO_ALPHA_CHANNEL, 2: INCOMPATIBLE_IMAGE_DIMENSIONS, 3: INVALID_IMAGE_FORMAT, 4: IMAGE_DOES_NOT_EXIST, 5: CANNOT_WRITE_OUTPUT, 6: CANNOT_OVERWRITE_INPUTS, etc.
  4. The tool must be idempotent. Running the tool more than once with the same inputs should always produce the same outputs. You should assume the tool will be run more than once due to retries. The tool can never overwrite the inputs! And it should always overwrite the outputs! None of this --overwrite flag BS.
  5. The tool should be as dumb as possible. It should never try to massage invalid inputs to make them work.
  6. The tool must always exit. I once worked with a third-party tool which said “Press any key to exit” at the end. Please don’t do that.
  7. The tool should not be aware of any other running processes or frameworks. It’s up to the Render Farm Management Software to manage dependencies between processes. Wrapping a third-party command line tool (or several) into a single process is okay. Spawning threads is also okay.
  8. Avoid data collisions. The tools are arranged into DAG pipelines in the Render Farm Management Software. The DAGs themselves are usually parameterized so that different instances of the same DAG can run in parallel. It’s important that data in one DAG instance is completely isolated from data in another DAG instance. All you have to do is put the data for each DAG instance into a separate folder. Also, you should prevent a DAG instance with the same parameters from being created twice.
  9. Avoid data corruption. What if you have a batch process which somehow touches data in all of the DAG instance folders? In that case, you must stop all running DAG instances, run the batch process, and then restart the DAG instances again. Think of it as a crosswalk. The DAG instances are the cars and the batch process is the pedestrian wishing to cross the street. Bad things will happen if the pedestrian doesn’t wait for the cars to stop.
  10. Updating an upstream node must automatically update all downstream nodes. Never rerun a single node upstream (with different parameters) without also rerunning all downstream nodes in topological sort order. If your Render Farm Management Software doesn’t come with a “Requeue Downstream Jobs” feature, make sure to write this feature yourself.

That’s all there is to it! I can tell you from hard-earned experience that breaking any of these rules will lead to data corruption. But with practice, these rules will become second nature to you.

Dataflow programming is my go-to decomposition technique whenever there is a stream of data flowing from one process to the next. The first thing I do when designing a pipeline is draw a Data Flow Diagram. Once I’m confident that I understand both the data and the processes involved, test data can be gathered and implementation of the processes can begin. If the data is clearly defined, then multiple developers (using potentially different languages) can work in parallel.

Further reading:

Today I prefer statically typed languages for large programs. This wasn’t always the case. For a long time, my primary language was Python. As a former technical artist, Python was all I needed to get my job done. In fact, I loved Python so much, that I shied away from learning other languages. Because, you know, I could already do everything with Python!

Everything, that is, except write a large program.

Here’s my story. In 2016 I started developing a video game in Python. For a while, I was making good progress. Then, at around 2500 lines of code, something strange happened. I hit some kind of wall. Development slowed to a crawl. Refactoring became very painful. Why? Well, as it turned out, I was supposed to be writing unit tests! Oops. So I started writing tests, and that definitely helped, but I still felt that I could be even more productive somehow. Was Python itself part of the problem?

I decided to take a break to find out. This led me on a journey (more like an obsession) to learn about as many different programming languages as I could. Here are some of the languages I learned about: Java, Kotlin, Dart, C#, F#, OCaml, Haskell, JavaScript, TypeScript, Elm, Clojure, Golang, C, C++, and Rust. I also learned about older languages like Pascal, Modula-2, and Oberon. The idea was to learn just enough about each language to understand its origins, some of its idioms, and some of its use cases. Those of you familiar with these languages, and how different they are, can probably imagine how far my mind was stretched beyond Python-land!

After trying to program in many different languages for several months, I started to form an opinion:

Static typing is better for large programs than dynamic typing.

“Better” in the sense of self-documenting code, ease of refactoring, reduced cognitive load, IDE support, and performance.

I started thinking about languages as being analogous to materials of different tensile strengths. JavaScript and Python are like plastic. Java and C# are like aluminum. Haskell and Rust are like steel.

I’m not suggesting it’s impossible to write a large program in a dynamic language, just that it may be more difficult. Also, I’m not suggesting that it’s impossible to write battle-tested, production-grade code in a dynamic language. Obliviously, people do it all the time. I’m only talking about very large programs, thousands of lines long, where I believe static typing will help you immensely. Conversely, a statically typed language may be overkill for small programs and prototypes. Basically, the choice of language should be directly related to the size of the program.

I’m currently rewriting my game in TypeScript, and it’s going well. For me, the real difference between working in a dynamic language like Python and a statically typed language like TypeScript comes down to refactoring without fear. In a large Python program, I’m terrified of refactoring and the only way to overcome this fear is to write (and maintain) lots of tests. In TypeScript, however, I can refactor with a lot more confidence. If I make a breaking change, my IDE will light up like a Christmas tree. Tests that I write in TypeScript provide correctness guarantees on top of the static analysis done by the compiler. And this gives me the confidence to try out new ideas quickly.

Further reading:

There are 3 popular programming paradigms today: object-oriented, functional, and procedural. You will hear a lot of hype about how one paradigm is better than another for writing large programs. I encourage you to remain skeptical when you hear such claims on the Internet, or worse, from your teachers!

Here’s what I believe:

  • Anything that can be expressed in one paradigm can also be expressed in another.
  • Some programmers find one of the paradigms most “natural” because it’s closest to the way they think. But not all programmers think the same way. Some programmers like to keep nouns (data) and verbs (functions) separate, while others prefer to group verbs under nouns. Some are comfortable with recursion, while others prefer loops. Some try to push all state to the outskirts, while others prefer evenly distributed pockets of state.
  • There are no bad paradigms, only bad programmers.
  • Some paradigms are arguably a better fit in certain situations than others. Functional programming for computation. Object-oriented programming for simulation. Procedural programming for automation. Therefore, using a healthy mix of paradigms is the best approach.

Honestly, even the word “paradigm” sounds inflated in the context of programming. I would demote it to something like “style”. In reality, the 3 major paradigms are just 3 different styles of programming in the small — 3 different ways to organize a single source file. Programming paradigms are micro, not macro. Therefore, as far as programming in the large is concerned, it doesn’t matter which paradigm you lean towards.

Further reading:

I hope these techniques will help you write large programs! If you also use these techniques or others, please let me know in the comments. I want to learn as much as I can about this subject.

If you enjoyed this post, please consider leaving some claps 👏 so that other people can find it. You may leave up to 50 claps. Thanks! 🙏

Sr. R&D Engineer at Lowe’s Innovation Labs, Seattle.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store