How To Write Large Programs

Oleg Alexander
Jan 12 · 15 min read

Techniques for programming in the large.

Photo by Drew Hays on Unsplash

I’m a Research and Development Engineer and a former Technical Artist. I’ve been programming professionally since 2005. In this post, I’ll tell you about my approach to writing large programs. This set of techniques is something I’ve converged on after many years of experience building large Visual Effects production pipelines.

My own arbitrary definition of “programming in the large” is any program over 1000 lines. 1000 lines isn’t very large, you might say, but it’s usually large enough to start breaking up the program into multiple source files. It’s also usually large enough where complexity starts to slow the rate of development.

We’re going to start with a few basic observations about programming languages and paradigms. Then I will show you two specific decomposition techniques suitable for large scale software development. The first is modular programming and the second is dataflow programming.

Static Typing Is Your Friend

Today I prefer statically typed languages for large programs. This wasn’t always the case. For a long time, my primary language was Python. As a former technical artist, Python was all I needed to get my job done. In fact, I loved Python so much, that I shied away from learning other languages. Because, you know, I could already do everything with Python!

Everything, that is, except write a large program.

Here’s the story. In 2016 I started working on a hobby video game project at home. Naturally, I started developing it in Python. For a while, things were going great. Then, at around 2500 lines of code, something strange happened. I hit some kind of a wall. Development slowed to a crawl. Refactoring became very painful. Why? Things were going so well and now I felt like I couldn’t move. Couldn’t breathe. Constricted even.

I decided to take a break to find out what I was doing wrong. This led me on a journey (more like an obsession) to learn about as many different programming languages as I could. Here are some of the languages I learned about: C#, F#, OCaml, Haskell, Golang, JavaScript, TypeScript, Elm, Clojure, C, C++, and Rust. I also learned about older languages like Pascal, Modula-2, and Oberon. The idea was to learn just enough about each language to understand its origins, some of its idioms, and some of its use cases. Those of you familiar with these languages, and how different they are, can probably imagine how far my mind was expanded beyond Python-land!

After trying to program in many different languages for several months, I started to form an opinion:

Static typing is better for large programs than dynamic typing.

“Better” in the sense of self-documenting code, ease of refactoring, reduced cognitive load, IDE integration, and performance.

I started thinking about languages as being analogous to building materials. Dynamic languages like JavaScript and Python are like plastic — you can do fast and cheap prototyping by 3D printing plastic parts, but you can only build relatively small structures. Type-safe languages with a garbage collector like Java and C# are like aluminum — working with aluminum is a good compromise because you can work at a reasonable pace and cost and you can build medium-sized structures. Finally, statically typed languages without a garbage collector like C++ and Rust are like steel — arguably the slowest and most expensive to work with, but the structures you build can be massive.

I’m not suggesting it’s impossible to write a large program in a dynamic language, just that it may be more difficult. Also, I’m not suggesting that it’s impossible to write battle-tested, production-grade code in a dynamic language. Obliviously, people do it all the time. I’m only talking about very large programs, thousands of lines long, where I believe static typing (in addition to tests) will help you immensely. Conversely, a statically typed language may be overkill for small programs and prototypes. Basically, I’m suggesting that the choice of language should be directly related to the size of the program.

I’m currently rewriting my game in TypeScript, and it’s going well. For me, the real difference between working in a dynamic language like Python and a statically typed language like TypeScript comes down to fear of refactoring. In a large Python program, I’m terrified of refactoring. If I make a change, the only way to find out if I introduced a bug is to run the program. So I make a small change, run the program, fix the errors, make another small change, run the program, fix the errors. Did I get them all? Are my tests thorough enough? In TypeScript, I can refactor with a lot more confidence. If I make a change, even a big change, my IDE will light up like a Christmas tree. I have to run the program less often and I don’t have to write run-time type checks. A good compiler is a powerful tool. I put the compiler to work managing some of the complexity of my program, so I don’t have to. If programmers are supposed to be lazy, what can be lazier than that?

To wrap up the story, TypeScript in --strict mode has become one of my favorite languages. The other language I wish I had more time to play with is Rust. Instead of shying away from learning new languages, now I welcome the opportunity. I still use Python at work and it’s still my favorite dynamic language, but I wouldn’t start another large program in Python. I’m aware of type hinting in Python 3.6, but I haven’t had a chance to use it in production yet.

Further reading:

Paradigms Are Micro, Not Macro

There are 3 popular programming paradigms today: object-oriented, functional, and procedural. You will hear a lot of hype about how one paradigm is better than another for writing large programs. I encourage you to remain skeptical when you hear such claims on the Internet, or worse, from your teachers!

Here’s what I believe:

  • Anything that can be expressed in one paradigm can also be expressed in another.

Honestly, the word “paradigm” sounds more important than it really is in the context of programming. I would demote it to something like “style”. In reality, the 3 major paradigms are just 3 different styles of programming in the small — 3 different ways to organize a single source file. Programming paradigms are micro, not macro. Therefore, as far as programming in the large is concerned, it doesn’t matter which paradigm you lean towards.

Further reading:

Modular Programming

I’ll admit that for many years I didn’t really understand how to write large programs. I’ve written tons of code over the course of my career, but most of that code was either 3D artist tools or entire pipelines of self-contained scripts coupled only by data. My video game project was the first time I encountered a single large program, and for a while, I had no idea how to deal with that kind of complexity. I knew I had to somehow decompose my program into parts, but my usual go-to decomposition method, dataflow programming, just didn’t fit a game. Programming paradigms didn’t help either, because, as I said, paradigms are micro, not macro. Finally, after much research, I somehow stumbled onto modular programming. If you only learn one technique for programming in the large, this should be the one.

A typical dependency graph

Modular programming decomposes a large program into modules. A module is usually a source file which logically groups some related functionality. Modules go by many names. In Python and JavaScript, they are called modules. In Golang, they are called packages. Some languages, like C and C++, don’t have a module system, but modular programming is routinely done through convention. In fact, some of the best literature I found about modular programming comes from books about C and C++.

Now let’s define modular programming further:

  • A module defines a set of imports — other modules that this module depends on. And a set of exports — definitions of constants, data types, functions, and classes that are the public interface of this module. A module can also have private definitions which it doesn’t export. In this way, modules achieve information hiding.

So now that we’ve defined modular programming, what are the rules for doing it well?

  1. No dependency cycles between modules. This design rule will prevent your code from turning into a big ball of mud. When people say their code is “modular”, they mean that their module dependencies form a DAG. Dependency cycles are bad because they increase coupling in the graph-theoretic sense. Modules that form a dependency cycle might as well be one big module because they can’t be tested in isolation. This rule can be enforced programmatically and some languages, like Golang, forbid circular dependencies altogether.

I’ve been using modular programming successfully both at home and at work for some time. It’s a powerful technique for structuring large programs and even large hardware systems. I can develop and test each module in a bottom-up fashion, assembling my program piece by piece. If a module interface needs to change, I can tell at a glance, by looking at the dependency graph, which downstream modules may be affected. If a module assumes too much responsibility, I can push some of the responsibility to a lower level module. If I need to rewire my dependency graph, I can see visually how I’m going to do it.

What’s really weird is how long it took me to discover modular programming. I knew about Python modules, of course. But I never really used Python modules beyond the simple case of grouping utility functions together. It never occurred to me that I can repeat this grouping process recursively. Only when I went back in time to understand the history of programming, that I discovered “deep” modular programming in languages like Modula-2, Oberon, and C. From what I can tell, modular programming was popular during the late 70s, 80s, and early 90s and that’s where you’ll find most of the literature. Then object-oriented programming came to dominate the programming world and it seems modular programming concepts were largely forgotten about for a period of 25 years. This is a shame because object-oriented programming and modular programming are not mutually exclusive! (Information hiding is the only overlap.) In my view, modular programming subsumes all the paradigms.

Today modular programming is being rediscovered, with nearly every language supporting it or adding official support for it (ES6 modules, Java 9 modules). In Golang, which was inspired by C and Oberon, modules (packages) are pretty much the only way to structure your code. And of course, functional languages like OCaml and F# have always had modules.

Unfortunately, some of the deeper graph-theoretic concepts I listed above are rarely discussed anymore. Modern modular programming literature seems sparse as if everyone is expected to understand these concepts from birth. Does anyone else think modular programming deserves more attention?

Further reading:

Dataflow Programming

While modular programming can help you build a single large program, dataflow programming can help you build a large pipeline of many interconnected programs. As a former technical artist, dataflow programming is very near and dear to my heart. Over 50% of all the code I’ve ever written could fit under this category. Dataflow programming is very common in the VFX, 3D Animation, and Video Game industries. In these industries, you can’t throw a rock (and I mean literally) without hitting a 3D artist working in some kind of dataflow program. Popular dataflow programs are Maya, Nuke, Substance Designer, and Houdini. These programs are often called “node-based”, “non-destructive”, or “procedural”.

A typical dataflow pipeline

Let’s define exactly what dataflow programming is:

  • In dataflow programming, the program is decomposed into black box processes called nodes. These nodes have a set of inputs and a set of outputs. Nodes transform inputs into outputs in some way. For example, in Nuke, you could load an image using a Read node, then resize that image to quarter resolution using a Reformat node, then save out the smaller image using a Write node. The original input image is never overwritten, and this is why dataflow programming is called non-destructive editing.
A simple Nuke pipeline to resize an image
  • Nodes are arranged into a “pipes and filters” pipeline, similar to a manufacturing assembly line, where the pipes carry data and the filters are process nodes. A dataflow pipeline always forms a directed acyclic graph (DAG).

Dataflow programming extends beyond a desktop application runtime like Maya. Bigger runtimes, called Render Farm Management Software, exist to orchestrate massive distributed pipelines of renderers and other command line tools. (If you’re more comfortable with Web tech than VFX tech, check out Apache Airflow.) And these are the kinds of pipelines that I’m often tasked with designing and writing.

But how do you write a command line tool which can be plugged into a dataflow pipeline? What are the rules for writing such a tool well? Here’s a list of rules I follow:

  1. The tool should not have an interactive prompt. The tool cannot be interactive because the Render Farm Management Software, which will run the tool, is an automated process. Therefore, the tool can only accept command line arguments.

That’s all there is to it! I can tell you from hard-earned experience that breaking any of these rules will lead to data corruption. But with practice, these rules will become second nature to you.

Dataflow programming is my go-to decomposition technique whenever there is a stream of data flowing from one process to the next. The first thing I do when designing a pipeline is draw a Data Flow Diagram. Once I’m confident that I understand both the data and the processes involved, test data can be gathered and implementation of the processes can begin. If the data is clearly defined, then multiple developers (using potentially different languages) can work in parallel.

Further reading:

Parting Words

I hope these techniques will help you write large programs!

You can 👏 up to 50 times to help people find this post. 😊

Oleg Alexander

Written by

R&D Engineer at Lowe’s Innovation Labs, Seattle. http://olegalexander.com/