Hobgoblins, Data Science code structure, and Mental Load

Robert Nicholls
4 min readFeb 23, 2024

--

Many will have heard the phrase “A Foolish Consistency is the Hobgoblin of Little Minds” (from PEP8). Recently, I have happily noticed my team referring to “hobgoblins” in our code base. It occurred to me that to identify a hobgoblin, or “a foolish consistency”, is to identify a situation where it makes sense to break with convention. In order to break the rules you have to know them extremely well, hence my happiness.

Learn the rules

I have always emphasised to my teams the importance of knowing these rules back to front. They must have an obsessive relationship with the neatness and standards of their code. One of the most concerning things I sometimes hear is “we are Data Scientists, not Software Engineers, we don’t need to have standards this high.” Honestly, apart from just displaying a worrying lack of professional pride, this reveals a fundamental misunderstanding of the purposes of code standards. As a Data Scientist / MLE, does your code not:

  1. Need to be reviewed and maintained?
  2. Run in business critical production systems?
  3. Have significant financial impact when it fails or misbehaves?

If the answer to any of these questions is “yes” then you have at least as much reason to want to obsess over the standards of your code as a Software Engineer does.

The other thing I sometimes hear is some variant on “it is slowing me down”. This objection is perhaps an even more serious misunderstanding of code and the role coding plays in businesses. No large-scale project has ever been slowed down by writing code more carefully, more neatly, and in a more standardised way. This is because of one of the next great concepts in coding, to borrow another phrase, this time from Brian Kernighan:

“…debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?”

And also the fairly well known fact that people spend substantially more time reading code, including their own, than they do writing it. When you gain time in the process of writing code, you lose time in the process of reading or maintaining it, and this is an extremely imbalanced trade, I would be willing to bet that every minute gained while writing costs at least 5 minutes for the eventual reader / maintainer / debugger of that code. So it’s a false time economy in the extreme to save yourself time by writing code quickly.

Mental Load

All of this really comes down to the concept of “Mental Load” (sometimes “Cognitive Load”). There are far too many coding standards to list them all here, and it isn’t the point of this article just to go through them all one by one, but every one of them exists primarily to reduce Mental Load for the reader of the code (and to some extent also the writer). If you do not think about Mental Load pretty much constantly when writing code then you will not write good code. It’s also a useful concept to understand if you want to live a more productive life, but that’s a conversation for another day.

To demonstrate the point lets look at what people sometimes call the “Single Responsibility Principle”, which is a stupid name, and should more appropriately be called the “Does one thing rule”. If you can’t work out why the second name is better than the first, you really need to understand Mental Load.

We had a function that looked something like this (I have left a few things out just for simplicity):

def save_file_to_blob(
file: Any,
path: str,
blob_connection: BlobConnectionObject,
archive: bool = True
):
"""Insert properly formatted docstring (very important)"""
container = blob_connection.get_container(path)
if archive:
latest_file = get_latest_file_from_blob(blob_connection)
container.move(latest_file, "/archive/")
container.upload_file(file)
# insert appropriate logging and exception handling

As a result the runner which invoked this function looked like:

def runner(credentials):
blob_connection = BlobConnectionObject(credentials)
outputs = train_model()
save_file_to_blob(outputs, "our/path", blob_connection)

This function violates our “Does one thing rule” and buries a secret I/O operation 1 layer deep without making it obvious. Debugging this requires unnecessary mental load because I cannot tell from the surface level runner function that an archiving operation is taking place, it is hidden behind a default boolean variable which is not explicitly declared when the function is invoked.

By simply removing the archiving functionality from save_file_to_blob() and creating a function which performs that operation explicitly our runner becomes:

def runner(credentials):
blob_connection = BlobConnectionObject(credentials)
outputs = train_model()
archive_previous_outputs("our/path", blob_connection)
save_file_to_blob(outputs, "our/path", blob_connection)

And our save_file_to_blob() function gets much simpler:

def save_file_to_blob(
file: Any,
path: str,
blob_connection: BlobConnectionObject,
):
"""Insert properly formatted docstring (very important)"""
container = blob_connection.get_container(path)
container.upload_file(file)
# insert appropriate logging and exception handling

We must add a new function for archiving, but now each individual unit is more testable, and when something goes wrong it is obvious where we need to go to fix it. Our runner now tells us everything it does, rather than hiding some information from us, and overall our code exerts less Mental Load to maintain and understand.

This is a real world example that has been simplified a bit for the purposes of demonstration, but this sort of thing happens constantly in Data Science code bases and I am sure the code bases of other business areas.

All of this can be applied to any other standards, such as consistent application of whitespaces, formatting of docstrings, variable naming, and every other useful standard that exists.

To understand mental load fully, and why you should seek to minimise it, is to really understand coding, and until you really understand coding you will not write good, maintainable, testable, and impactful code. Nor will you be able to identify and destroy hobgoblins. Which is ultimately the end goal.

--

--

Robert Nicholls

Head of Data Science, MLOps enjoyer, automotive industry pro, promoter of good structure in code and teams