These are practices that I’ve applauded myself for doing at the beginning of some projects, and kicked myself for not doing early enough in other projects.
TL;DR: Build the whole thing badly, test as you build, log everything, plan the error handling, and automate stop and start.
Build the whole thing badly.
Instead of getting a particular piece perfect and then moving on to the next piece, build the minimum necessary for each step to have a system that goes all the way through a process from the beginning to the end. The first few iterations will be ugly, but you’ll learn what features need to exist in order for the whole system to work.
Think of changing a tire: you screw in the lug nuts loosely first, in a star pattern. Once you have all the lug nuts in loosely, you tighten them a little bit at a time, going around in that star pattern, until they’re all tight. This makes sure that no single lug nut goes in too much farther than the others, which would cause the tire to become mis-aligned and get stuck.
Do the same thing with the modules of your software: build simple versions of all of them first, then do small improvements to them one at a time, keeping them in sync with each other. This will increase the likelihood that you discover architectural problems as soon as possible and fix them before they do too much damage.
When we first built the software for running our microscopes at 3Scan (what we call the “acquisition stack”), we built features for saving the images and all the pieces of metadata we could think of, knowing that we would need to have pieces of information such as the slice thickness and image width in order to perform the data analysis that we knew would be the next stage of the business.
However! When we started building the analysis software stack a year or two later, we quickly discovered that in order to stitch together the images we’d been taking, we would also need to know the knife edge location in the images, which was not calculated anywhere, and in fact could only be entered in by an operator. We ended up retro-fitting the user interface to make sure we had the necessary data for future imagery.
The moral of the story is that we should have written some basic analysis code as soon as we got the images, just to make sure that we could. The more general architectural lesson is to build as much of the skeleton of the whole application as possible before gettin’ granular in any single part of the application.
Test as you build.
Don’t wait until the whole thing is done to write tests. If you’re going to have tests, write them as you write the module or system that will undergo the testing. In my experience, the most important tests are integration tests and tests that prove a particular bug has been fixed.
It takes some time to build a test framework that allows you to write tests for a complex system. If you build the test framework as you go, instead of waiting until the system is mature, there are several benefits:
When you discover a bug, if you already have a test framework you can quickly whip up a test that reproduces the bug. You then have an automated bug reproduction, which you can use to rapidly iterate on solutions to the bug and verify that a solution works. And once the test passes, you can keep it in the test suite and use it as a regression test.
As proponents of Test-Driven Development are fond of pointing out, designing code to be testable can lead to less coupled designs.
Finally, building tests and application in lock-step takes less development time than building application first and tests later, since the design will stay closer to testable as it changes, and the tests will stay closer to the code.
Replace those soggy old print statements with a call to a logging system. That way, not only will you know that something errored out, you’ll know when it happened and which module threw the error. You’ll also be able to search backward through time to discover patterns in the system’s behavior. Many languages have built-in logging, or a well-maintained third-party logging library, so it doesn’t cost much to add a logger.
For extra points, from the beginning, create a logging module that encapsulates logging and hides how it happens. That way, when you switch how you perform logging (which we have done four times in the past two years at 3Scan), you won’t have to replace all the imports and logger calls in your application.
Don’t just log message strings; track every number you can think of that could aid in characterizing the system. The cost in CPU and bandwidth is negligible for most applications, and it can convert bug-hunting from a witch-hunt guided by some combination of gut feeling and divine providence into a rigorous empirical analysis. Recently we started using DataDog (no affiliation, just a satisfied customer) to monitor our machines and log metrics and events. Within the first day of operation, we discovered that three of our servers were operating in semi-broken condition. Over the next couple of weeks, our feel for the level of operation of our systems improved dramatically as we started adding more and more metrics.
Plan the error handling.
For everything that you know can go wrong, plan how that error will be handled. When you first get started, you don’t always have time to build complicated error recovery routines. So at first, error out and log the error.
What you don’t want is for your application to error out in an unexpected way when an expected error condition arises. If you have to connect to a remote server, for example, losing the connection to that server should be an expected error condition. You know it will happen (or, if you don’t know, now you know).
Test that your system handles every expected error in the expected way. For example, if you know that your system will encounter an error when the internet cuts out, then write a test that emulates a dead internet. It’s not as hard to do this as it might seem: mock the server (or create a real one if it’s part of the system), then guillotine it mercilessly while it’s in the middle of something. Test that the client throws the right error. This accomplishes two goals: 1) ensuring that the system behaves as desired under those conditions, and 2) providing a solid foundation for more complex error-handling, such as various recovery behaviors.
Automate stop and start.
You need to be able to start the system from scratch, and stop the system gracefully. There are common difficulties with initialization and shutdown, and the earlier you handle those difficulties, the better off you’ll be. In particular, the system and each of its components should be able to be started and stopped in a fully automated fashion. This allows for a test framework to create and destroy full applications and modules and maintain the independence of the tests. It will also significantly lubricate deployment.
Initialization should be automated. For system startup, you shouldn’t have to run any manual commands to initialize the system from scratch. You shouldn’t have to remember to create three directories, create two database collections, create a mock dataset to prevent that null error, and set the system’s ulimit to 1024 in order for the software to run. Have the system create all the data and fixtures it needs, without needing a human expert to look over your shoulder and say “Oh, that’s not working because first you have to copy over this JSON file into your home directory so the package manager knows where to find the modules.”
Many systems have haphazard initialization routines. Don’t let that happen; plan out which module initializes what. Each module should know what resources it requires, and be able to initialize those resources as necessary. Otherwise you’ll end up with deployment nightmares where there are missing database entries that prevent one process from starting, and no one knows what the value of the entry should be. I can’t tell you how much more reliable our deployments have been since we adopted this attitude.
Shutdown should be courteous. Close your sockets! Make sure your system cleans up after itself in all possible ways. Temporary files should be temporary; child processes should be terminated; modules should emit log messages that they are shutting down. Similarly, when terminating child processes, first send them a signal that allows them to close their own sockets. Good shutdown practice can be the difference between a module that’s infeasible to test and one whose test suite runs fast with great coverage.
Many of these pieces of advice harmonize well with each other: if you can stop and start an application automatically, then it’s easier to build a test suite as you go. I also recommend thinking of the tests as another lug nut, to remind you to keep the tests up to date with the application.
The lug nut concept is the unifying concept behind all these pieces of advice: they’re all designed to bring attention to different aspects of development to keep things going in the right direction.