Errors, Composition & Not Going Crazy (PART 1)

Published in

Floyd Programming Language

8 min readJun 11, 2019

Self promotion: I’m researching error handling for the programming language Floyd. https://github.com/marcusz/Floyd (check it out!)

These two posts tries to dissect the important parts involved in error handling and attempts to catalog and name the bits.

This is part 1. Here is part 2.

I realise discussions about errors often leads to heated debate. This post does not attempt to argue in any direction, just get the facts on the table.

It’s important to compare apples and apples. Ignoring errors or recording only very little data about an error will always be more convenient than making robust code.

I’ve tried to find the unique problems that bigger, real-world applications have, applications composed of many modules from many vendors, not toy programs or theoretical examples. Then you can chose what’s important to you project.

Part 1: the mechanics of the call stack, about thunking between modules and the responsibilities of different bits of code to propagate errors.

Part 2: the strategies to use for thinking about errors, the error contracts of functions and interface and modules, how to collect error context information and how to think about programming errors.

ROBUSTNESS AND COMPOSABILITY

In this text module is a chunk of code that exposes an interface with functions, their signature and their error handling. The module can have lots of implementation code and call other modules. It can be custom or by 3rd party. It can use an error mechanism compatible with your code or not. Think DLL:s, source code packages, posix, macOS API and so on.

There are two important goals: robustness and composability:

Robustness = the code gives meaningful output, never corrupts data, never gets into invalid states, never crashes.

Composability = you can assembly pre-existing and custom code together and they have the same robustness as if you wrote them as one big module.

Being able to compose modules into an application is critical to making big programs.

Many code bases have error policies that are less robust than this. It’s not uncommon to make the simplification to ignore out-of-memory conditions. I don’t make those kinds of assumptions in this post. Code that makes simplifications to its error handling based on its current use are less composable and less reusable because of those assumptions.

There is also a difference between making an application and a module. Applications are closed solutions to a specific problem. When you design an application you know everything about its context and you know the absolute facts about which parts are used for what. You can take shortcuts.

Modules aren’t like that. Modules are like a software integrated circuit and should be their own thing. Modules can be used by many clients and cannot make any assumptions on behalf of the application.

Error handling is like +5 volt and ground on a circuitboard — it’s basic infrastructure that needs to be in place for all the components to work together.

This text is about making precise error handling where the finished applications are expected to be very robust

AN EXAMPLE ERROR CALL STACK

Errors are often detected deep down in the call stack and needs to be propagated upwards. The errors propagates up through library code from different vendors, via different types of callbacks and interfaces, via OS calls and via your own modules. Along the way the errors may be transported as native errors, via OS error codes, via exceptions and via other error handling mechanisms.

EXAMPLE: SNAPSHOT OF CALL STACK WHEN AN ERROR IS DETECTED

The error discovered at the bottom of the call stack (pict above) needs to be passed upwards. You want the information about the error to be kept (or improved) while traveling upwards.

Green = your code, yellow code = uses same runtime / language that your code uses, red = code using some other language / runtime.

MORE DETAILED CALL STACK

In the picture below there is more details: can see for each stack frame which sort of error value is used and where conversions, thunking, is needed. The green upward arrows shows where some error information needs to travel in a side channel to go past a spot.

EXAMPLE: SNAPSHOT OF CALL STACK WHEN AN ERROR IS DETECTED — DETAILS

ERROR VALUES: NORMALISED VS ALIEN

Error values = how you encode an error into bytes or data types.

Normalised error values = the ones you use in the module you are in control of.

Alien error values = incompatible way to represent errors, used by some other module. Examples: error codes from macOS or from jpeglib or posix or something like C++ exceptions.

We need to be able to convert (=thunk) back and forth between alien error values and normalised error values. This is often a weak point in many code bases: the thunking code is often written at each call site and often loses most error information, like this:

int error = read_file()
if(error == no_err){
…
}

The ideal is to have non-lossy round-tripping of error values between normalised and alien errors.

The error values must make sense for each module too, not just be casted to the correct data type. This often introduces smaller translation errors.

We need a strategy for robustly thunking normalised errors to/from alien errors.

INSIDE AN ERROR VALUE

There is a bunch of information that might be relevant to store in an error value:

What was the action that failed? Read, allocate? read-jpeg, read-png, read-directory-record, read-file, read-socket, unsupported-format, unsupported-filetype, corrupt-format, corrupt-jpeg-format, unsupported-jpeg-format.
What was the subject / object we failed to operate on? This could be “file” or maybe even the full path of the file we tried to operate on.
Which module discovered the error?
Which function failed?
Default error string.
Has the error been refined on its way up the call stack?
The source file & line where the error was detected.
A call stack for were the error was detected.
Extra data: maybe include a copy of the JSON you failed to parse.

Some of this data is useful for the code handling the error. Some of the data may be useful for logging or debugging purpose.

There needs to be a clean way to describe normalised errors that makes sense for all the code.

NORMALISED ERROR NAMESPACE

Normalised errors needs to be organised and deduplicated neatly across all modules, including alien modules.

We don’t want several ways/tactics to define error types — that makes it hard to select and check the errors. Example of this problem:

jpeg-format-error
format-error
file-format-error
jpeg-error
jpeg-read-error
iOS-jpeg-read-error

If modules freely define their own errors in the normalised namespace they will likely create overlaps with other modules. That requires thunking and merging /splitting errors even between the normalised modules.

More about this in part 2.

WORKING TOGETHER TO PROPAGATE ERRORS

There are several types of code involved in propagating errors. Some functions contain several of these.

TYPE A — Origin: The original code where an error is discovered and an error value is created and propagation up the call stack starts.
Examples: an OS file system function can no longer find a file, or a malloc() implementation can’t find a heap block to give the caller, a REST handler detects a badly formatted request.
This code is not that common, maybe 1% of you code detects errors from scratch.
TYPE B — Vanilla code: The code that gets it calls a function. It needs to roll back what it’s been doing so far, then propagate the error up the call chain.
This is how the bulk of functions in a bigger program works. This code doesn’t care about the details of the error. It’s critical it passes all information about the error upwards!
TYPE C — Refine: Code that gets an error and improves it by adding more context or details to it before passing it up the call stack.
Example: a module that deals with preference files might turn disk-io-read-error to preference-file-bad-format.
TYPE D — Tactical recovery: Code that gets an error and does tactical error recovery, like attempting to do a REST request up to 3 times before finally giving up and propagating an error up its call stack.
Tactical error recovery is a high-level concept and very application dependant. If you build this kind of code into a lower-level module you limit it’s composability / reuse.
TYPE E — Application endpoint: This is the border between your application and its clients. This code converts all errors into something the application’s client cares about.
Example: convert errors to HTTP error responses. Example: convert errors to alert boxes for the user.
This is top-level code that defines you end product. This does not compose. This is OK.
TYPE F — Error exporter: This code thunks all errors from normalised to a specific alien format.
Example: You implement an OS-callback function that returns Windows error codes (aka alien errors) = you need to convert all errors in you function to Windows errors.
TYPE G — Error importer code: this code thunks all alien errors from a module to normalised errors. This is very common when you wrap an external module, like calling OS:es or C libraries. More common than type F.

ERROR VALUE SYNTAX AND TRANSPORT

These are the normal language mechanics to create errors and propagate them around the application.

Error codes. An integer where each number defines a specific error. Has no additional info except the integer. Normally uses up each function’s return for the error code. Client needs to manually check each call for errors.
Exceptions. Separate mechanism in parallel to function’s return value. Throw — try — catch keywords or similar. Can transport types and data. Function can return values as normal. No need to manually check errors in type C code, unless you need to do manual rollback. C++ combines this with RAII to avoid manual error handling in type-C code.
Monads. Can transport error type with extra data. Uses function’s return but still lets function return a normal value for happy paths.
Error type. This is similar to 1 (error codes) but the value contains more information than just an integer.
Global variable or thread-local variable. Like C’s errno. Similar to 1 but makes it more unclear which function’s error you get.

ERROR SIDE-CHANNEL

In the diagram the small green vertical arrows shows error side channels. Only a part of the error value can be passed through the error mechanism. You can’t pass a big error value through a posix function that uses an integer error code. Error data needs to travel outside the error mechanism so we don’t lose parts of it.

Example: on_win32_message() cannot pack a complete normalised error value into the Windows error integer. If we want to keep the data we need a side channel for that information.

It’s sometimes possible to solve this by storing the current error value separately and pass a simplified error to the client. On the other side you recreate the error value by checking the side-channel.

Plan B is to just simplify the error and lose all other error information.