Your software needs to be hopeless

Published in

tarmac

4 min readAug 14, 2020

Photo by Philipp Katzenberger on Unsplash

Over the years of working in IT, first as a developer, now as DevOps, I’ve seen a certain recurring pattern in all aspects of software development that I believe should be avoided as much as possible. This is the pattern of introducing “hope” into your software. I see the ideal world of IT in a very deterministic way, meaning that every future state of the system is preceded by an exact change in the current state.

To “adapt” Newton’s third law in the context of software, I would rephrase it as such:

Every non-exit action should trigger an appropriate reaction.

In a bit more words, the statement above means that for every action (event) in your system that is not a final action, the next step in the workflow should be triggered by the completing action, rather than hoping that it will start. “Hoping” here refers to the implementation where a function is configured to do short or long polling for a specific event/message/log or simply start “hoping” that the previous steps finished successfully.

The different “hope” patterns

The “It’s my time to run” pattern.

We all have done this at some point in our IT careers. Let me start off with an example:

You have a B2B software that requires data pulls.
Your integration with the other business requires that you do a daily load of a data dump from their side into your software.
They have scheduled their dump to start creating at 00:00UTC in order to grab the exact whole day of data.
They have identified that the average data dump creation lasts for ~25 mins.
You schedule a cron job to pull the dump into your environment at 01:00UTC which should be plenty time for the dump to finish creating.

And just like that, you have introduced hope into your system. By decoupling the two functions of creating and pulling the dump which are inherently consequential you are running the pull on the hope that the dump has finished creating on time.

Problems with this approach:

If the dump creation fails, the pull will still happen and most likely crash because of missing files.
As the business grow the data dump will probably grow with it. If left unattended, the dump creation time will also increase and at some point 01:00UTC will become 01:30 and then 02:00. Question is until when are you going to adjust the times?
If you put the schedule to far ahead, like 10:00UTC, then you have an unnecessary delay in the data availability.

How to avoid it?

Change the integration from pull to push logic. Instead of pulling the dump have the other side push the dump to your system. That way you can listen for events when a new file is added and run any processing jobs right there and then.
Have an api endpoint the other side can call to signal that a file is ready to be pulled.

The “clock out” pattern

A big part of the DevOps day to day work revolves around automation and optimization. Optimization not just around performance but costs. One of the ways DevOps reduce costs on their cloud bills is to turn off non production environments/resources during some hours of the day. An example:

Team works 09:00 to 17:00
You schedule a Lambda to stop all EC2 from your DEV and TST environments at 20:00 and another one to start them again at 06:00.

This way you are hoping that everybody has finished working at 20:00 or whatever time you have selected. In this day and age, the fixed working hours are slowly being phased out of all professions, especially IT, so you can’t really rely on that. Additionally, somebody might have left a long job running from end of day to not impact everyone. That is then at risk to be turned off.

How to avoid it?

Since you can’t ever be sure if someone is working on the environment or not, consider reducing the desired capacity, instead of completely turning it off. The cost savings might not be as big, but no one will get abruptly cut off from their work by it.

The “bypass” pattern

Another way products inject hope into their systems is by adding bypasses of standard procedures. An example of this would be:

Your product supports API calls and charges the client per API call.
For testing purposes on a new integration you add a hardcoded header value that if the API call contains it, that call will be excluded from the billing calculation.

Another example is:

Your business grows and you acquire some other product company to integrate with yours.
You want to assimilate the users of the acquired company into yours, so you just migrate them without running the full registration process.

In the examples above, you have introduced hope that your “bypass” will not be discovered and/or abused by anyone.

How to avoid it?

In the first example you could make use of trial accounts. Charge less or don’t charge at all for any requests coming from a “trial” integration. In that approach the state of the system changes with the state of the account: after the trial expires, the billing starts.

For the second scenario, the resolution is to avoid mass user migration. Re-register each user via the standard registration process already in place. Automate what you can, but follow all steps.

Conclusion

There are many more examples of how software products today include hope in their systems. The world of computers is governed by 1s and 0s (that is until the quantum computers become a regular thing), so there is not really a gray area. Hope-free might be a better term here, rather than hopeless, but either way the point is that yes, in real life “hope dies last”, but in software it should be “hope dies first”.

Your software needs to be hopeless

The different “hope” patterns

The “It’s my time to run” pattern.

The “clock out” pattern

The “bypass” pattern

Conclusion

Written by Darko Klincharski