Stories by Yair Lenga on Medium

A Streaming JSON Formatter That Works With Existing Serializers

Yair Lenga — Wed, 20 May 2026 20:46:00 GMT

A Python streaming post-filter for compact, human-readable JSON with configurable formatting behavior.

Built-in JSON serializers give us two choices:

The default output is built for machines and optimized for efficiency. It is compact, without any extra whitespace. While technically “text”, it feels “binary” — a dense wall of brackets, quotes, commas, and braces that is painful to inspect.

To solve this problem many serializers provide a “Pretty-print” mode, which adds indentation, spacing around tokens and line breaks — making it readable for humans. The problem is that for large documents it often goes too far: A small array of numbers becomes ten lines. A tiny metadata object becomes a block. Deep structures become readable only by making the file much longer.

That extra formatting is not free. It makes logs larger, diffs noisier, terminal output harder to scan, and requires “speed-scrolling” to review the data sets.

What I wanted was a middle ground: JSON that keeps the shape of pretty-printed output, but folds small, simple structures back onto one line when they fit.

This article describes jsonfold, a process for "compacting" pretty-print JSON data to make it more readable for humans. The level of "compactness" is controlled by parameters, and there are a few preset configurations that can be used to get output with minimal effort.

jsonfold in 2 minutes

Get fine control over the pretty-print JSON output. Keep it machine-readable and human-friendly.

Minimal Usage

Pull jsonfold.py from GitHub project

import jsonfold
import sys
data = {
    "meta": {"version": 1, "ok": True},
    "ids": [1, 2, 3, 4, 5],
    "items": [{"id": 1, "name": "alpha"}, {"id": 2, "name": "beta"}],
}
# compact can be: default, low, med, high, max
jsonfold.dump(data, sys.stdout, compact="default")

GitHub Project

Repository: https://github.com/yairlenga/jsonfold

Python implementation is under python directory.

Different levels of compaction

// compact=low
{
  "a": {
    "b": { "c": "abc" }
  },
  "x": {
    "y": { "z": "xyz" }
  }
}

// Compact=default
{
  "a": { "b": { "c": "abc" } },
  "x": { "y": { "z": "xyz" } }
}

// Compact=max
{ "a": { "b": { "c": "abc" } }, "x": { "y": { "z": "xyz" } } }

jsonfold on real data.

Using the geojson file geojson.xyz: admin 1 states provinces shp. You can view the actual output:

Baseline — no formatting: 130K, 1 line, 130,429 columns, 0% overhead
Pretty-Printed, indent=2: 285K, 11731 lines, 79 columns, 120% overhead
jsonfold compact=low: 167K, 2344 lines, 120 columns, 28% overhead
jsonfold compact=default: 167K, 2344 lines, 120 columns, 28% overhead
jsonfold compact=high: 166K, 2239 lines, 120 columns, 27% overhead
jsonfold compact=max: 156K, 1321 lines, 255 columns, 20% overhead

Key ideas

Do not replace the serializer

For my implementation, I chose NOT to build another serializer. There are already many good serializers available in multiple languages. Many of them support custom transformations, special handling for application classes, non-standard numeric values, date/time objects, and other data types. For example:

The Python json module — JSON encoder and decoder provides options for custom data types, NaN handling, custom encoders, and more.
JavaScript’s JSON.stringify() supports a replacer function that can alter the stringification process.
Java Jackson ObjectMapper can perform complex transformation of POJOs based on annotations, introspection and templates.

Wrap the output stream

Instead of replacing or extending the serializer, jsonfold acts as a filter between the serializer and the final output stream.

The serializer still does what it already knows how to do: convert application data into valid JSON text. jsonfold only looks at the generated pretty-printed output and decides which parts can be safely folded back onto one line.

This keeps all the existing serializer functionality and customization in place.

Operate on pretty-printed token stream

The jsonfold does not reparse the JSON format or reconstruct the data objects. Instead, it operates directly on the pretty-printed token stream generated by the serializer. It relies on a few assumptions:

The input is valid JSON.
Each input line represents an atomic JSON fragment that can be safely moved or merged as a unit, and should NOT be split or reformatted.
Indentation provides structural clues about the relationship between elements.

If the above assumptions are violated, the jsonfold falls back into "raw" mode - where the data is passed through unchanged, without attempting any unsafe transformation.

The three phases

The compaction is done in three logical phases, we will name them: pack, fold and join. Each one performs a specific transformation that makes the JSON easier to read (by removing whitespace), while not changing the data itself. Separating the process into phases makes the implementation simpler and more predictable. Each phase operates on progressively more compact structures while preserving the original JSON semantics. All three phases are incremental, and process the stream as data becomes available.

Pack

The pack phase handles merging of scalar items inside containers (array, object). It will “pack” array items and object properties that belong to the same containers into single line, subject to specific width, and limits. Basically:

// From            To:
[                  [
    "1",
    "2",      ->      "1", "2", "3"
    "3"
]                  ]

{                  {
    "a": 1,
    "b": 2,   ->      "a": 1, "b": 2, "c": 3
    "c": 3
}                  }

Example:

{
    "summary": {
        "source": "wikipedia"
    },
    "meta": {
        "generated": "2026-03-13"
    },
    "by_land": [
        "RUS",
        // 4 more entries
        "AUS"
    ],
    "by_population": [
        "IND",
        "CHN",
        // 16 Additional entries
        "DEU",
        "TZA"
    ]
    "name": {
        "RUS": "Russia",
        "CAN": "Canada",
        "CHN": "China",
        "USA": "United States",
        "BRA": "Brazil",
        "AUS": "Australia"
    }
}

to:

{
  "summary": { "source": "wikipedia" },
  "meta": { "generated": "2026-03-13" },
  "by_land": [
    "RUS", "CAN", "CHN", "USA", "BRA", "AUS"
  ],
  "by_population": [
    "IND", "CHN", "USA", "IDN", "PAK", "NGA", "BRA", "BGD", "RUS",
    "ETH", "MEX", "JPN", "EGY", "PHL", "COD", "VNM", "IRN", "TUR",
    "DEU", "TZA"
  ],
  "name": {
    "RUS": "Russia", "CAN": "Canada", "CHN": "China",
    "USA": "United States", "BRA": "Brazil", "AUS": "Australia"
  }
}

More technically: The first phase looks for lines with the same indentation level, and will merge consecutive lines in such a way that it will respect the user provided line width. In addition, it is possible to cap the count of lines that will be packed for arrays and for objects.

Fold

The Fold phase handles merging of containers that have only one line of items with the container opener/closer (For arrays: [ and ], for objects: {, }), subject to specific width, nesting level and item counts. Basically:

// List Folding: From 3 lines 
[
    "1", "2", "3"
]
//     To: single line
[ "1", "2", "3" ]

// Object Folding: From 3 lines:
{
    "a": 1, "b": 2, "c": 3
}
//     To: single line
{ "a": 1, "b": 2, "c": 3 }

Continuing with the above example, the attributes ‘by_land’ and ‘summary’ are now shown in a single line.

{
  "summary": { "source": "wikipedia" },
  "meta": { "generated": "2026-03-13" },
  "by_land": [ "RUS", "CAN", "CHN", "USA", "BRA", "AUS" ],
  "by_population": [
    "IND", "CHN", "USA", "IDN", "PAK", "NGA", "BRA", "BGD",
    "RUS", "ETH", "MEX", "JPN", "EGY", "PHL", "COD", "VNM",
    "IRN", "TUR", "DEU", "TZA"
  ],
  "name": {
    "RUS": "Russia", "CAN": "Canada", "CHN": "China", "USA": "United States",
    "BRA": "Brazil", "AUS": "Australia"
  }
}

Join

The join phase is similar to the pack phase — it will attempt to merge folded lines together, potentially merging folded objects into the same line, subject to specific width, nesting level and item counts.

At this stage, previously folded containers are treated as atomic units that can be merged together while preserving their internal structure. This allows nested structures such as coordinate pairs or small embedded objects to behave similarly to scalar values during compaction.

Continuing with the above example, the attributes ‘summary’ and ‘meta’ are now merged into a single line.

{
  // summary and meta merge into a single line.  
  "summary": { "source": "wikipedia" }, "meta": { "generated": "2026-03-13" },
  "by_land": [ "RUS", "CAN", "CHN", "USA", "BRA", "AUS" ],
  "by_population": [
    "IND", "CHN", "USA", "IDN", "PAK", "NGA", "BRA", "BGD",
    "RUS", "ETH", "MEX", "JPN", "EGY", "PHL", "COD", "VNM",
    "IRN", "TUR", "DEU", "TZA"
  ],
  "name": {
    "RUS": "Russia", "CAN": "Canada", "CHN": "China", "USA": "United States",
    "BRA": "Brazil", "AUS": "Australia"
  }
}

Why streaming matters

JSON documents can be very large and deeply nested. It’s easier to implement the compaction by operating on a complete pretty-printed JSON document — but this has a price:

Additional memory — having to hold both the original document and the compacted document can increase temporary memory usage to 2–4 times the size of the original document.
Operations on large strings: Concatenation and iteration over large strings are more costly than operations on smaller chunks.
Time to first byte (“ttfb”): delaying processing until the full documents is generated means that ttfb increases significantly. This can have noticeable negative impact on the service responsiveness to end users.

The jsonfold processes the document in small bites - leveraging the incremental generation provided by the python json.dump() call - arrays are processed one item at a time, and objects are processed one key/value pair at a time. The extra memory that is needed for processing is approximately 4X the maximum width (actual or set).

If the string generation call json.dumps() is being used - there is no choice but to build and return the (potentially huge) final string. In this case, the incremental processing will cap the amount of extra memory as described above, and io.StringIO() to build the final string reduces the cost.

One important advantage of the streaming approach is that it should work with any other encoders (parameter cls in json.dump) and pretty-printers that can send output directly to a file-like object, by wrapping the existing file-like output stream with the jsonfold JSONFoldWriter class. (disclaimer: I do not use any third-party libraries, did not test any specific package).

Example: using custom encoder

import json
import jsonfold
# Custom object
class Foo:
    def __init__(self, name):
        self.name = name

# Custom encoder for Person objects
class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Foo):
            return {"ID": obj.name}
        return super().default(obj)

# Sample custom object
foo_obj = Foo("Bar")
# Encoding custom object using the custom encoder
jsonfold.dump(foo_obj, cls=CustomEncoder)

Or using different serializer like rapidjson

import rapidjson
import jsonfold
import sys

def pp(obj, fp, *,
         compact =  "",
         indent: int = 2, **kwargs) -> None:

    with jsonfold.JSONFoldWriter(fp, compact=compact) as out:
        rapidjson.dump(obj, out, indent=indent, **kwargs)

Cross-language portability

This article covers the jsonfold implementation in python - the same approach can be used in other languages to format JSON according to the same rules - leveraging existing JSON serializers, and various stream filtering in other languages.

Javascript: In Node, the Writable stream can be wrapped to apply the jsonfold logic on the output from any JSON serializer.
In Java, the java.io.FilterWriter can be used to add jsonfold formatting to any character stream.
In C, the FILE * object can be customized using the GLIBC extension fopencookie or BSD funopen

Future articles will describe implementations in JavaScript, Java, C and other languages/platforms. Each implementation will be:

Single file that can be dropped into the code base (Note: certain languages need separate header file).
Filter that will attach the jsonfold behavior to existing output stream.
Efficient implementation that minimizes memory usage, and overhead.
Self-contained, and does not introduce additional dependencies.

Limitations

Reiterating the limitations of this approach: The jsonfold depends on the structure produced by a normal pretty-printer. jsonfold is not a general JSON parser, and it does not try to understand arbitrary JSON text.

In particular:

The input must already be valid JSON.
The input should use a regular pretty-printed layout.
Indentation must reflect the nesting structure.
Each input line is expected to represent an atomic JSON fragment.
The formatter is not designed to recover from malformed JSON.
Highly customized pretty-printers may produce layouts that cannot be safely compacted.

When these assumptions are violated, jsonfold falls back to pass-through mode rather than risking an unsafe transformation.

This also means that jsonfold is best used as a post-filter for trusted serializer output, not as a cleanup tool for arbitrary JSON pasted from unknown sources.

Disclaimer

The examples and benchmarks in this article, including linked code snippets, are simplified and reconstructed for illustration purposes. They are not taken from any production system, and do not reflect the design or implementation of any specific codebase.

This is a personal approach based on general experience working with codebases. It does not represent any official guideline or the opinion of my employer.

As always, evaluate and test the code carefully before adopting it in production.

Usage and License

The supporting file (jsonfold.py, and json examples) are provided under the MIT license and are intended to be copied and used as-is in your own projects.

You can simply copy and/or modify them into your project and integrate those files into your build process — no special packaging or setup is required

Stop Waiting for defer: A Practical Cleanup Layer for C

Yair Lenga — Thu, 07 May 2026 14:18:32 GMT

Not another macro trick. Just a small set of cleanup helpers that cover memory, files, descriptors, sockets, and custom objects.

Do Not Wait for DEFER

Introduction

I recently came across a post on Reddit: “I’m tired of waiting for the C language to finish specifying the defer function. What can I do?”

If you do a quick search, there is no shortage of answers. Over the years, many developers have built defer emulation in C - some clever, some portable, some tricky, and some break on newer compilers. The problem is not lack of ideas, but that many of them are not something you would standardize across a real codebase. (For a survey of implementations, see (Un)portable defer in C

My article is not about another attempt to implement defer, describe what you can do with it, compare it to other languages, or debate design choices. My goal here is much simpler: define a small, consistent set of cleanup macros that we can safely use in day-to-day C code. In practice, this approach eliminates most repetitive cleanup code in typical C functions, reduces error-handling boilerplate, and makes resource management predictable across the codebase.

The initial implementation will use compiler extensions. When a standard defer becomes available, these macros can be adapted to use it.

This approach is especially relevant for projects that run across multiple environments. In many cases older environments must be supported for years, which makes adapting new language features like defer a long-term process. Many legacy projects are also 5-10 years behind the current technology stack - which means they will not be able to leverage new defer anytime soon.

What this gives you

Uniform cleanup patterns across your codebase
Fewer leaks and double-free bugs
Cleaner error handling paths (especially with early returns)
No dependency on future language features

Example

The following patterns already cover most real-world resource management in typical C codebases:

// calloc -> free
char *p = calloc(n, sizeof(*p)) ;
if ( !p ) ... ;
DEFER_FREE(p) ;

// fopen -> fclose
FILE *fp = fopen(filename, "r") ;
if ( ! fp ) ... ;
DEFER_FCLOSE(fp) ;
// Work file -> remove file
FILE *new_file = fopen(workfile, "w") ;
DEFER_REMOVE(workfile) ;
if ( ! new_file ) ... ;
DEFER_FCLOSE(new_file) ;

These macros follow a simple pattern: free, close, destroy, or restore. Once those are standardized, most cleanup becomes predictable.

If your codebase works with lower level objects (file descriptors, sockets, mutexes, …), a few additional macros will be useful:

// open -> close
int fd = open(filename, O_RDONLY) ;
if ( fd < 0 ) ... ;
DEFER_FD_CLOSE(fd) ;

// Socket -> shutdown/close
int sock = socket(AF_INET, SOCK_STREAM, 0) ;
if ( sock < 0 ) ... ;
DEFER_FD_CLOSE(sock) ;
if ( connect(...) ) ... ;
DEFER_SOCK_SHUTDOWN(sock, SHUT_RDWR) ;
// Mutex lock
pthread_mutex_lock(&lock) ;
DEFER_MUTEX_UNLOCK(&lock) ;

Usage and License

The supporting files (defer_call.c, defer_call.h) are provided under the MIT license and are intended to be copied and used as-is in your own projects.

You can simply copy and/or modify them into your project and integrate them into your build process — no special packaging or setup is required

Header file: defer_call.h
Helper code: defer_call.c
GitHub Repo: (including test cases): https://github.com/yairlenga/articles/blob/main/2026-defer-now/

Inventory of provided macros

This project is intentionally small.

The goal is not to introduce a new abstraction layer, but to provide a set of useful macros that cover the common resource-cleanup patterns in C using DEFER-like macros.

The cleanup function is invoked with the value of the resource identifier at the end of the block.

Rule: If the resource is released in the middle of the block — it is important to set its identifier to NULL (or other invalid value like -1) to prevent double-cleanup.

The core list covers the most common resource types.

DEFER_FREE(void *p) for heap allocated memory
DEFER_FCLOSE(FILE *fp) for FILE * streams
DEFER_FD_CLOSE(int fd) for file descriptors
DEFER_DESTROY(void (*fn)(void *p), void *p) for user defined objects.

The full list (categories) includes

Memory:

DEFER_FREE(void *p) — Core
DEFER_FREE_PTR_ARRAY(void **a, int sz) — free list of pointers.

Files:

DEFER_FCLOSE(FILE *fp) — core
DEFER_REMOVE(const char *p)
DEFER_PCLOSE(FILE *fp)
DEFER_CLOSEDIR(DIR *d)

File Descriptors, Sockets:

DEFER_FD_CLOSE(int fd) — core
DEFER_SOCK_SHUTDOWN(int fd, int how)

Synchronization:

DEFER_MUTEX_UNLOCK(pthread_mutex_t *m)
DEFER_RWLOCK_UNLOCK(pthread_rwlock_t *lock)

Using user defined destroy functions:

DEFER_DESTROY(void (*fn)(void *p), void *p) for user defined objects — core.
DEFER_DESTROY_M(void (*fn)(void *p, int mode), void *p, int mode) to pass a mode parameter.
DEFER_DESTROY_X(void (*fn)(void *p, void *cxt), void *p, void *cxt) for user defined objects.

CORE: DEFER_FREE(void *ptr)

To prevent memory leak — add DEFER_FREE after allocating the block with malloc, calloc, realloc(NULL, ...) or similar. The pattern covers resizing with realloc, reallocarray or similar functions, as long as resized address is stored to the same variable.

{
    char *cp = malloc(n) ;
    if ( !cp ) return ERROR ;
    DEFER_FREE(cp) ;
    ...
    cp = realloc(cp, n + 100) ;
    ...
}

If the allocated memory can be freed before the end of the block, the resource identifier must be set to NULL.

{
    char *cp = calloc(n, sizeof(*cp)) ;
    if ( !cp ) return ERROR ;
    DEFER_FREE(cp) ;
    ...
    free(cp) ; 
    cp = NULL ;
}

CORE: DEFER_FCLOSE(FILE *fp)

To prevent leakage of FILE * object, DEFER_FCLOSE can be called after functions that create FILE * - fopen, fdopen, freopen.

{
    FILE *fp = fopen(filename, "r") ;
    if ( !fp ) return ERROR ;
    DEFER_FCLOSE(fp) ;
}

If the file is closed before the end of the block, the resource identifier must be set to NULL. At that point, it can even be reused.

{
    FILE *fp = fopen(filename, "r") ;
    if ( !fp ) return ERROR ;
    DEFER_FCLOSE(fp) ;
    ...
    
    ...
    fclose(fp) ;
    fp = NULL ;
    ...
    fp = fopen(filename2, "r") ;
    if ( !fp ) return ERROR ;
    ...
}

CORE: DEFER_FD_CLOSE(int fd)

To prevent leakage of file descriptors, DEFER_FD_CLOSE can be called after any function that creates a file descriptor (open, creat, socket, dup, ...).

{
    int fd = open(filename, O_RDONLY) ;
    DEFER_FD_CLOSE(fd) ;
    if ( fd < 0 ) return ERROR ;
    ...
}

If the file descriptor is explicitly closed in the block it is important to set the resource identifier to -1. Possible to set the resource identifier even before the first call.

{
    int fd = -1 ;
    DEFER_FD_CLOSE(fd) ;
    ...
    fd = open(filename, O_RDONLY) ;
    if ( fd < 0 ) return ;
    ...
    close(fd) ;
    fd = -1 ;
    ...
    fd = open(file2, O_RDONLY) ;
    if ( fd < 0 ) return ;
}

MEMORY: DEFER_FREE_PTR_ARRAY(void **a, int sz)

One common use case for managing list of large objects is to track list of pointers to created objects inside a fixed-size, or dynamic array of pointers. The DEFER_FREE_PTR_ARRAY can be used to call free on each element. The DEFER_FREE_PTR_ARRAY does not free the array itself — which should be registered with DEFER_FREE(a) when it is allocated dynamically.

struct foo { ... }

{
    struct foo **list = NULL ;
    DEFER_FREE(list) ;
    int pos = 0 ;
    DEFER_FREE_PTR_ARRAY(list, pos) ;
    ...
    for (...) {
        list = realloc(list, (pos+1)*sizeof(*list)) ;
        list[pos] = calloc(1, sizeof(*list[pos])) ;
        pos++ ;
        ...
    }
}

FILES: DEFER_REMOVE(const char *pathname)

When creating work files, it might be useful to remove the work file in addition to closing the FILE * object. This will ensure work files are removed when no longer needed.

{
    FILE *fp = fopen(workfile, "w") ;
    DEFER_REMOVE(workfile) ;
    if ( !fp ) return ERROR ;
    DEFER_FCLOSE(fp) ;    
}

FILES: DEFER_PCLOSE(FILE *fp)

{
    FILE *fp = popen("ls", "r") ;
    DEFER_PCLOSE(fp) ;
    ...    
}

FILES: DEFER_CLOSEDIR(DIR *dirp)

{
    DIR *dir = opendir(dir_path) ;
    if ( !dir ) return ERROR ;
    DEFER_CLOSEDIR(dir) ;
}

SOCKET: DEFER_SOCK_SHUTDOWN(int sock, int how)

Managing sockets requires executing shutdown once the socket has been connected (or after accept).

{
    int sock = socket(...) ;
    if ( sock < 0 ) return ERROR ;
    DEFER_FD_CLOSE(sock) ;
    ...
    // Shutdown required only after connect
    if ( connect(sock, ...) < 0 ) return ERROR ;
    DEFER_SOCK_SHUTDOWN(sock, SHUT_RDWR) ;
    ...
}

Synchronization: DEFER_MUTEX_UNLOCK(pthread_mutex_t *m)

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER ;

{
    pthread_mutex_lock(&m) ;
    DEFER_MUTEX_UNLOCK(&m) ;
}

If the lock is released before the exit, the resource should be set to NULL to avoid double-cleanup

pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER ;

{
    pthread_mutex_t *mp = &my_mutex ;
    pthread_mutex_lock(mp) ;
    DEFER_MUTEX_UNLOCK(mp) ;
    ...
    // early release of the lock:
    pthread_mutex_unlock(mp) ;
    mp = NULL ;
}

Synchronization: DEFER_RWLOCK_UNLOCK(pthread_rwlock_t *lock)

pthread_rwlock_t  my_lock = PTHREAD_RWLOCK_INITIALIZER ;

{
    pthread_rwlock_t *lp = &my_lock ;
    // Read some data
    pthread_rwlock_rdlock(lp) ;
    DEFER_RWLOCK_UNLOCK(lp) ;
    ...
    // Release
    pthread_rwlock_unlock(lp) ;
    lp = NULL ;
    ...
    // Write some data, using the same lock
    // cleanup registered above will handle this.
    lp = &my_lock ;
    pthread_rwlock_wrlock(lp) ;
    ..
}

Cleanup model

Looking at the common patterns of cleanup operations, we can classify them based on the parameters that they take:

Resource Identifier: pointer/integer

P (pointer) Most resources are identified by their memory address
I (integer) Some resources are identified by their integer handle

In addition, some cleanup operations need extra context:

X — Extra information is needed for the cleanup operation
M — Extra information of integer “mode” to distinguish between small set of modes.

For a total of six possible combinations:

P — cleanup(void *p)
I — cleanup(int handle)
PX — cleanup(void *p, void *cxt)
IX — cleanup(int handle, void *cxt)
PM — cleanup(void *p, int mode)
IM — cleanup(int handle, int mode)

If a cleanup operation needs more than one extra parameter, it can be modeled by passing all additional parameter inside a single structure that will be passed as the context parameter.

In practice:

Most code uses: P
Some cases need: I — usually for system objects
Rare cases need: X or M (extra arguments)

Using values at block exit

Cleanup is performed using the current value of the variable at scope exit.

This allows resources to be resized, reassigned, or disabled by setting them to their invalid value (NULL, -1, ...).

In simple cases, we create a resource, and we perform the cleanup by calling the “destructor” function with the same object that was created

// Old Code
{
    char *const x = calloc(n, sizeof(*x)) ;
    ...
    if ( ... ) { free(x) ; return ; }
    ...
    free(x) ;
}
// With DEFER_FREE
{
    char *const x = calloc(n, sizeof(*x)) ;
    DEFER_FREE(x) ;
    ...
}

Since the cleanup is using the resource address at block exit, it works even when the source address is changing. One example is with realloc, when the free function should be invoked with the final value of x:

{
    char *x = calloc(n, sizeof(*x)) ;
    DEFER_FREE(x) ;
    ...
    x = realloc(x, m*sizeof(*x)) ;
    ...
}

Another case is when the resource is released earlier in the function, and there is no need to perform the cleanup at the end. In those cases possible to reset the resource identifier to protect against double-cleanup.

{
    char *x = calloc(n, sizeof(*x)) ;
    DEFER_FREE(x) ;
    ...
    free(x) ;
    x = NULL ;
    ...
    return ;
}

Releasing the memory when it is no longer needed is good practice. In this case, the value of the resource has to be reset, so that the automatic invocation will not attempt to free the block again (which will likely crash the program).

Most resources already have a natural “NA” value (e.g., NULL, -1), which can be used to mark them as already cleaned up.

Resource Identifier: Pointer vs Integer

In many cases, resources are identified by their memory address, and the cleanup function only needs this memory address. Almost anything derived from malloc/calloc (or other allocators). This includes:

File Object (FILE *) created by fopen, fdopen or popen
Dynamically created strings created by strdup, strndup
Network structures created by getaddrinfo and similar
User defined objects created on the heap.

The second category of identifiers is handles — resources that are identified by integer handle — in many cases, referencing system resources, outside our process space

File descriptors (open, creat, socket, ...),
Process identifiers (fork)
IPC resources like shared memory (shmget)

Extra Context

Certain cleanup functions require extra information. For example:

The shutdown system calls take a parameter (int how).
The munmap system call takes a length parameter

We generalize the support for extra parameters by adding support for an extra pointer, which can be used to pass additional required parameters.

Naming rules

To support all variations, we use consistent naming rules:

DEFER_CALL_(P|I)(|X|M)

For example:

DEFER_CALL_P(cleanup, var) — Call cleanup(void *).
DEFER_CALL_I(cleanup, fd) — Call cleanup(int).
DEFER_CALL_IM(cleanup, sock, mode) — Call cleanup(int, int), note that mode is set at registration time.

Defining Cleanup function for user objects.

The DEFER_CALL_* macros can be used to define cleanup helper for objects that have create and destroy functions.

struct foo { char *name, ... } ;

struct foo *fooCreate(char *name, ...) 
{
    struct foo *v = calloc(1, sizeof(*v)) ;
    v->name = strdup(name) ;
    ...
    return v ;
}
void fooDestroy(struct foo *p)
{
    free(p->name) ;
    ...
    free(p) ;
}
void doSomething(void)
{
    struct foo *foo1 = fooCreate("name1") ;
    DEFER_DESTROY(fooDestroy, foo1) ;
    ...
    // foo1 will be automatically destroyed with fooDestroy at exit
}

If additional parameters are needed any of the other helper macros can be used to specify cleanup functions that takes arbitrary parameters (with DEFER_DESTROY_X), or integer modifier (DEFER_DESTROY_M). The extra parameter (context or mode) is captured at the time of the cleanup registration.

For example, the fooDestroy may take an integer indicator for logging. Note that the modifier is captured at the time of registration.

void fooDestroy(struct foo *p, int verbose)
{
    if ( verbose ) {
        printf("Destroying: %s\n", p->name) ;        
    }
    free(p->name) ;
    ...
    free(p) ;
}

void doSomething(void)
{
    struct foo *foo1 = fooCreate("name1") ;
    DEFER_DESTROY_M(fooDestroy, foo1, 1) ;
    ...
    // foo1 will be automatically destroyed with fooDestroy at exit
}

If the cleanup function needs additional parameters, possible to model it with a structure that passes all the information in one structure, possible using a compound literal.

struct fooArgs { int verbose ; int timeout ; } ;

void fooDestroy(struct foo *p, struct fooArgs *args)
{
    if ( args->verbose ) {
        printf("Destroying: %s\n", p->name) ;        
    }
    free(p->name) ;
    ...
    free(p) ;
}
void doSomething(void)
{
    struct foo *foo1 = fooCreate("name1") ;
    DEFER_DESTROY_X(fooDestroy, foo1,
        &(struct fooArgs) { .verbose=1, .timeout = 5 }) ;
    ...
    // foo1 will be automatically destroyed with fooDestroy at exit
}

Summary

The goal of this project is not to replace a future standard defer, but to provide a small and consistent cleanup layer that works in current (and past) GCC/Clang compilers and can evolve with the language over time.

In practice, a handful of cleanup macros already eliminate a large amount of repetitive cleanup code in typical C projects, while keeping the behavior explicit and predictable.

Disclaimer

The cleanup attribute is available with GCC (starting with 3.4.x) and Clang (3.x era and later) as a C compiler extension. Many other compilers support this extension - but I did not test them.

The examples in this article, including linked code snippets, are simplified and reconstructed for illustration purposes. They are not taken from any production system, and do not reflect the design or implementation of any specific codebase.

This is a personal approach based on general experience working with C codebases. It does not represent any official guideline or the opinion of my employer.

As with any low-level technique, evaluate carefully before adopting it in production.

Stop Waiting for defer: A Practical Cleanup Layer for C was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Automatic Enum Handling in C — Parsing, Validating and Iterating.

Yair Lenga — Wed, 29 Apr 2026 07:20:22 GMT

Automatic Enum Handling in C — Parsing, Validating and Iterating.

Using debug (DWARF) information to eliminate lookup tables and keep enum logic in sync at build time

Quick recap — enum Stringification

In previous article Automatic Enum Stringification in C via Build-Time Code Generation, I described a process to extract information about enum types (the list of enum identifiers and values) from debug (DWARF) information. This allows us to display symbolic enum labels instead of numeric values in logs, debug output, and more.

enum color { C_NONE, C_RED, C_YELLOW, C_GREEN } ;

// Request enum descriptor for e_color
ENUM_DESCRIBE(e_color, enum color)
void foo(enum color c) {
    printf("Color=%s(%d)\n", ENUM_LABEL_OF(e_color, c), c) ; 
}

The process is fully automated, relies on common tools that are already used in build pipelines, and has (practically) zero runtime cost.

Parsing Strings into Enums

The next logical step is the reverse conversion — from symbolic name to value. This is useful when reading external input keyed by enum values: command-line arguments, configuration files, or user input.

Languages that support reflection (Java, Python, C#, ...) will usually provide lookup functions (EnumClass.valueOf(String), Enum.Parse(...)), but in C, there is no built-in, standard capability.

This article discusses a lightweight solution to parse strings into enum values, so that you can write parsing functions:

ENUM_DESCRIBE(e_color, enum color)

bool parse_color(const char *label, enum color *var)
{
    return ENUM_PARSE_LABEL(e_color, label, var) ;
}

No hard-coded lookup tables. Always kept in sync with the enum definition at build time, using tools you already have.

In practice, this means enum labels can be part of the input interface, not just internal constants.

Download & Quick Start

Download the latest minimal package (~20KB):

https://github.com/yairlenga/c-enum-reflect/releases/latest

See the Releases page for other versions and packages.

Example: Reading input

Let’s assume we have in memory information about each color (e.g., RGB values) and we want to allow the user to choose the RGB by color name, which is an enum. Currently, we will write a lookup table, or inline a list of strcmp to map the input label into the correct enum index.

enum color { RED, YELLOW, GREEN } ;

bool parse_color(const char *label, enum color *var)
{
    if ( strcmp(label, "RED") == 0) *var = RED ;
    else if ( strcmp(label, "YELLOW") == 0) *var = YELLOW ;
    else if ( strcmp(label, "GREEN") == 0) *var = GREEN ;
    else return false ;
    return true ;
}
int rgb[] = {
    [RED] = 0xff0000,
    [YELLOW] = 0xffff00,
    [GREEN] = 0x00ff00,
} ;
int show_rgb(const char *label)
{
    enum color c ;
    if ( parse_color(label, &c)) {
        printf("RGB(%s)=%06x\n", label, rgb[c]) ;
    } else {
        printf("Unknown Color: '%s'\n", label) ;
    }
}

In most cases, we will use a lookup table to make the code easier to maintain.

enum color { RED, YELLOW, GREEN } ;

bool parse_color(const char *label, enum color *var)
{
    static struct { enum color c; const char *label ; } color_lookup[] = {
        { RED, "RED" },
        { YELLOW, "YELLOW" },
        { GREEN, "GREEN" },
    } ;
    for (int i=0 ; i        if ( strcmp(label, color_lookup[i].label) == 0 ) {
                *var = color_lookup[i].c ;
                return true ;
        }
    }
    return false ;
}

Either way, we will have the same problem that we had with enum stringification:

Requires repetitive work.
Easy to miss updates, or introduce incorrect fixes.
Hard to maintain if enum is defined in external packages.

The good news is that we can build directly on the existing enum metadata to support parsing.

API for Parsing Enum

The enum_desc module provides a basic API to translate a string into the enum value. It builds on the ENUM_DESCRIBE macro which we use to create the enum descriptions.

// Define a tag that reference the enum
ENUM_DESCRIBE(tag, enum_type)

// Function-like macro.
// If the label matches one of the enum labels:
//    store the matching enum value into the variable that p_enum points to.
//    return true
// On error:
//    does NOT modify the variable pointed to by p_enum
//    return false
bool ENUM_PARSE_LABEL(enum_tag, const char *label, enum_type *p_enum) ;

Our parsing function is now simple, short:

ENUM_DESCRIBE(e_color, enum color)

bool parse_color(const char *label, enum color *var)
{
    return ENUM_PARSE_LABEL(e_color, label, var) ;
}

No strcmp, no lookup tables to create, and no updates needed when enum values change.

Example: parsing values from a config file

Using the new API, we can now read a configuration file that describes RGB values for colors, leveraging the symbolic names.

// colors.txt
FOREGROUND = 0xff0000
BACKGROUND = 0xffff00
HIGHLIGHT = 0x00ff00
BOLD  = 0xff0000

And the code is relatively simple.

enum color { FOREGROUND, BACKGROUND, HIGHLIGHT, BOLD, LAST } ;

int rgb[LAST] ;
void read_rgb(FILE *fp)
{
    char line[256] ;
    while ( fgets(line, sizeof(line), fp)) {
        enum color c ;
        char color_name[30] ;
        int rgb_value ;
        if ( sscanf(line, "%29s = %x", color_name, &rgb_value) == 2 &&
            parse_color(color_name, &c) ) {
                rgb[c] = rgb_value ;
        }
    }
}

Later, we will discuss other options to make the code more flexible — e.g., resizing to the maximum values of the enum (or the actual number of entries).

Validation and Error Reporting

When the ENUM_PARSE_LABEL() call fail to match the label, it will return false, and will not modify the enum variable. This can be used to introduce defaults, and error logging as needed. For example:

enum color { FOREGROUND, BACKGROUND, HIGHLIGHT, BOLD, LAST } ;

int rgb[LAST] = {
    [FOREGROUND] = 0x000000,  // Black
    [BACKGROUND] = 0xffffff,  // White
    [HIGHLIGHT] = 0xffff00,   // Yellow
    [BOLD] = 0xff0000,        // Red
} ;
ENUM_DESCRIBE(e_color, enum color)
bool parse_color(const char *label, enum color *var)
{
    return ENUM_PARSE_LABEL(e_color, label, var) ;
}
void read_rgb(FILE *fp)
{
    char line[256] ;
    while ( fgets(line, sizeof(line), fp)) {
        char color_name[30] ;
        int rgb_value ;
        if ( sscanf(line, "%29s = %x", color_name, &rgb_value) != 2 ) {
            fprintf(stderr, "Bad Config line: '%s'\n", line) ;
            continue ;
        }
        enum color c ;
        if ( !parse_color(color_name, &c) ) {
            fprintf(stderr, "Unknown color: '%s'\n", color_name) ;
            continue ;
        }
        rgb[c] = rgb_value ;
    }
}

Once enum metadata is available, parsing is not the only operation we can support. We can also iterate over all enum values, enabling generic processing and introspection.

Iterating Over Enum Values

Enumeration API

Given that the enum metadata is stored in a simple to iterate format, it’s possible to iterate over all the values of a single enum. The API provides a few functions to query the enum list

The enum_desc_item_count(), returns the number of enumerators.
The enum_desc_label_at() returns the label (const char *) of the enumerator, based on position. Return NULL on bad position.
The enum_desc_value_at() returns the integer value (int) of the enumerator, based on position. Return 0 on bad position.

Both function take an enum_desc_t object. The function-like macro ENUM_DESC(tag) can be used to get the descriptor. The following code can list all the enumerators of an enum type:

enum color { NONE, BACKGROUND, FOREGROUND, HIGHLIGHT, BOLD, LAST } ;

ENUM_DESCRIBE(e_color, enum color)
    // Showing all colors, using count
void show_color_enum()
{
    enum_desc_t ed = ENUM_DESC(e_color) ;
    for (int i=0, count=enum_desc_item_count(ed) ; i        printf("Color #%d: %s = %d\n", i, enum_desc_label_at(ed, i), enum_desc_value_at(ed, i)) ;
    }
}

Dumping all enum values

The reflection API can work on any enum - the above code is generic. The enum_desc_print dump all the data about an enum in human readable format:

// ISO country codes.
enum country3 {
    ...
    ISO3_BGR = 100, ISO3_MMR = 104, ISO3_BDI = 108,
    ISO3_BLR = 112, ISO3_KHM = 116, ISO3_CMR = 120,
    ISO3_CAN = 124, ISO3_CPV = 132, ISO3_CYM = 136,
    ...
}

Will print:

Enum 'country3' 191 items, dynamic=TRUE, custom=FALSE, offset_sz=2, value_sz=4
...
#12: 44 (ISO3_BHS) meta=-
#13: 48 (ISO3_BHR) meta=-
#14: 50 (ISO3_BGD) meta=-
#15: 51 (ISO3_ARM) meta=-
#16: 52 (ISO3_BRB) meta=-
#17: 56 (ISO3_BEL) meta=-
#18: 60 (ISO3_BMU) meta=-
...
Range: [ 4 - 894 ] , unused=700

Using iteration for generic tools

The iteration API can also be used to provide custom behavior. For example, to lookup for enumerator using case insensitive match. This will allow referencing the enum using uppercase, lowercase of mixed-case.

bool enum_parse_case_cmp(enum_desc_t ed, const char *label, int *var)
{
    int count = enum_desc_item_count(ed) ;
    for (int i=0 ; i        if ( strcasecmp(label, enum_desc_label_at(ed, i)) == 0) {
            *var = enum_desc_value_at(ed, i) ;
            return true ;
        }
    }
    return false ;
}

How it works

This is a summary — see more details in previous article — including Makefile for CI, pre-requisitistes, …

The process happens entirely at build time.

Compile the source file with debug information.
Scan the object file and extract the enum definition (via DWARF)
Generate a C source file containing enum descriptors.
Compile and link the generated code into the final binary.

Example:

gcc -c -g test2.c
enum_dwarf_query --format=c test2.o > gen_test2.c
gcc -c -g gen_test2.c
gcc -g -o prog.exe test2.o enum_desc.o gen_test2.o

The final binary contains only plain C data structures, and a small runtime support module (from enum_desc.c). There is no runtime dependency on DWARF tools or libraries, or on a debugger.

Summary

Enum metadata is generated automatically from debug (DWARF) information.
The same data supports parsing, validation, and iteration.
No manual lookup tables or duplicated definitions.
Always in sync with the compiled enum — including external libraries.
A small step toward practical reflection in C.

Notes

In addition to the generic API, typed helper functions such as enum_desc_label_of_color() and enum_desc_parse_color() can be generated for convenience. These wrappers improve type safety and readability, but are not required — the generic API is sufficient in most cases.

Disclaimer

This is a personal approach based on general experience working with C codebases. It does not represent any official guideline or the opinion of my employer.

As with any low-level technique, evaluate carefully before adopting it in production.

Usage and License

The supporting files (enum_desc.c, enum_desc.h, enum_dwarf_query.py) are provided under the MIT license and are intended to be copied and used as-is in your own projects.

You can simply copy and/or modify them into your project and integrate the extractor into your build process — no special packaging or setup is required

Automatic Enum Stringification in C via Build-Time Code Generation

Yair Lenga — Thu, 23 Apr 2026 15:29:05 GMT

Leveraging compiler debug metadata (DWARF) to generate enum mappings with zero runtime overhead

The followup article: Automatic Enum Handling in C — Parsing, Validating and Iterating Covers additional topics related to using enum metadata.

If you maintain C code, you’ve probably written enum-to-string conversion functions by hand. They work — until someone adds a new enum value and forgets to update them.

When the enum values are assigned sequential values, it is possible to perform fast lookup with arrays, using designated initializers:

enum ConnectionState {
    STATE_NONE,
    STATE_DISCONNECTED,
    STATE_CONNECTING,
    STATE_CONNECTED,
    STATE_ERROR,
    STATE_LAST
};

const char *connectionStateStr(enum ConnectionState state) {
    static const char *labels[] = {
        [STATE_NONE] = "STATE_NONE",
        [STATE_DISCONNECTED] = "STATE_DISCONNECTED",
        [STATE_CONNECTING] = "STATE_CONNECTING",
        [STATE_CONNECTED] = "STATE_CONNECTED",
        [STATE_ERROR] = "STATE_ERROR",
    } ;
    return state >= 0 && state < STATE_LAST ? labels[state] : NULL ;
}

In the other case (e.g. the enum values span a sparse range), you might have implemented it with a switch statement (or some form of lightweight hash table)

enum errorCode {
    E_NOT_FOUND = -1,
    E_PERMISSION = -2,
    E_OUT_OF_MEMORY = -3,
    ...
} ;

const char *errorCodeStr(enum errorCode code) {
    switch (code) {
        case E_NOT_FOUND: return "E_NOT_FOUND" ;
        case E_PERMISSION: return "E_PERMISSION" ;
        case E_OUT_OF_MEMORY: return "E_OUT_OF_MEMORY" ;
        ...
    } ;
    return NULL ;
}

Those lookups are commonly used to create log records, parse configuration options, and print debug output. This implementation has a few limitations:

Requires repetitive work.
Easy to miss updates, or introduce incorrect fixes.
Hard to maintain if enum is defined in external packages.

Languages that support reflection (Java, Python, C#, ...) will usually provide a stringification function, but in C, there is no built-in, standard capability.

This article discusses a lightweight solution to create stringification functions, so that you can write:

printf("Connection State=%s\n", ENUM_LABEL_OF(ConnectionState, state)) ;

No hard-coded lookup tables. Always kept in sync with the enum definition at build time, using tools you already have.

Quick Start

Download the latest minimal package (~20KB):

https://github.com/yairlenga/c-enum-reflect/releases/latest

See the Releases page for other versions and packages.

Solution — automatic stringification of enum values

If C had reflection, we would have the option to write something like:

enum color { C_NONE, C_RED=2, C_YELLOW=6, C_GREEN } ;
void foo(enum color c)
{
    printf("Color=%s(%d)\n", color.to_string(c), c) ;
}

The bad news is that C does not provide this capability directly. The good news is that the compiler already has all the information needed to implement it. When code is compiled with debug (-g), the full definition of each (referenced) enum is captured in the object file. We can reuse it !

We can verify this using a debugger such as gdb. gdb will show the symbolic value of each enum variable (with print), and the full enum description with ptype.

(gdb) print c
$1 = C_RED
(gdb) ptype c
type = enum color {C_NONE, C_RED = 2, C_YELLOW = 6, C_GREEN}

See appendix Background: Full example for enum metadata with gdb at the end.

Instead of using this metadata only for debugging, we can extract it at build time and generate lookup tables automatically. This effectively provides reflection for enums in C - without any runtime cost.

Minimal example — using generated enum descriptions

// test2.c
#include 
#include "enum_desc.h"

enum color { C_NONE, C_RED, C_YELLOW, C_GREEN } ;
// Request enum descriptor e_color that will describe enum color
ENUM_DESCRIBE(e_color, enum color)
void foo(enum color c) {
    printf("Color=%d\n", c) ;
    // print stringified label for c
    printf("Color=%s\n", ENUM_LABEL_OF(e_color, c)) ; 
}
int main(void) {
    foo(C_RED) ;
    return 0 ;
}

How it works

The process happens entirely at build time.

Compile the source file with debug information.
Scan the object file and extract the enum definition (via DWARF)
Generate a C source file containing enum descriptors.
Compile and link the generated code into the final binary.

Example:

gcc -c -g test2.c
enum_dwarf_query --format=c test2.o > gen_test2.c
gcc -c -g gen_test2.c
gcc -g -o prog.exe test2.o enum_desc.o gen_test2.o

The final binary contains only plain C data structures, and a small runtime support module (from enum_desc.c). There is no runtime dependency on DWARF tools or libraries, or on a debugger.

Notes on ENUM_DESCRIBE, ENUM_LABEL_OF

The ENUM_DESCRIBE macro marks enum types that we want to generate metadata for.

The first argument (e_color) is a unique identifier for the descriptor.
The second argument must resolve to a valid enum type, in the current translation unit.
The enum can be defined in the same file or in an included header file.
Important: the descriptors are generated as global symbols. Using the same identifier (for the same or for different enums) will result in link-time error.

The ENUM_LABEL_OF macro is expanded to a call to retrieve the generated metadata.

The first argument is the unique identifier.
The second argument is an enum value to be described.
Returns NULL, if enum value does not have a label.

You can view the definition of those macros in GitHub Gist: enum_desc.h. You will see that the implementation is defining multiple identifiers - all follow the pattern enum_desc_*. Some identifiers have static scope, some are global objects (functions, variables). If you inspect the objects/binaries, you will see those symbols.

Integration into CI pipeline

In practice — most projects use a build system (Make or CMake). So the generated files are rebuilt automatically when the source object file is rebuilt. This will ensure that the descriptor is up-to-date, even for enums that are defined in header files (current project, dependent objects, or system header files).

Adding the following to your Makefile will automate the build:

# ENUMDESC_DIR - Source location where enum_desc source files are (.c, .h and python parser)
# ENUMDESC_SRCS - list of source files that have call ENUM_DESCRIBE
# OBJDIR - directory where generated files (objects and sources) will be stored.
# PROG - path to binary, which should link generated enum_* objects.

CFLAGS += -I $(ENUMDESC_DIR)
ENUMDESC_SRCS = file1.c file2.c ...

$(OBJDIR)/gen_%.o: $(OBJDIR)/%.o
    $(ENUMDESC_DIR)/enum_dwarf_query --format=c $< > $(OBJDIR)/gen_$*.c
    $(COMPILE.c) -o $@ $(OBJDIR)/gen_$*.c

$(OBJDIR)/enum_desc.o: $(ENUMDESC_DIR)/enum_desc.c
    $(COMPILE.c) -o $@ $^

ENUMDESC_OBJS += $(OBJDIR)/enum_desc.o $(ENUMDESC_SRCS:%.c=$(OBJDIR)/gen_%.o)
$(PROG): ... $(ENUMDESC_OBJS)
    $(LINK.c) -o $@ ... $(ENUMDESC_OBJS)

Note that this pipeline requires (reasonable modern) python3 runtime, including the pyelftools:

sudo apt install python3-pyelftools
sudo python3 -m pip install pyelftools

There is no runtime dependency on DWARF, debug information, or external tools. The binary can be fully stripped — as if nothing unusual ever happened.

Why this approach

Before settling on this approach, I’ve experimented with a few other alternatives. The main challenges were

Keeping definitions in sync as enum values evolve.
Minimizing effort when adding new enums.
Maintaining single source of truth.
Avoiding unnecessary complexity.

The options that I’ve considered were

X-Macros: Require rewriting enums into a custom format, which many codebases cannot or will not adopt.
DSL-style (IDL files, proto): Do not work for enums defined outside your control (external libraries, system headers).
Manual Lookup Tables: sooner or later, the mapping falls out of sync with the enums.
Parsing Source files (AST tools, Clang toolkit, regex): Parsing C correctly is hard: anything less than a full parser is fragile, and may fail in the future.
Compiler Plugins: Powerful, but tie the solution to single compiler/toolchain, require significant effort to develop and maintain.

While using the DWARF metadata is not perfect for all situations, it avoids the challenges of the other alternatives:

It stays in sync with the source of truth — the way the compiler understands the enum.
It requires no changes to existing source code and introduces no runtime dependency.
It is built on tools already common in C development (gcc, clang, gdb), established and widely-used standard format (DWARF), and simple open source components (python, pyelftools)

As an extra bonus — the generated “C” code can be easily inspected/reviewed — no magic, no complex runtime, no black boxes.

Summary

Enum stringification in C is a common problem, typically solved with manual lookup tables, or custom definitions — both have a maintenance cost and tend to drift out of sync over time. This approach addresses this problem using a different path: reusing the debug metadata already produced by the compiler to generate enum descriptors at build time.

The result is straightforward:

No changes to existing enum definitions in the source code.
No duplicate definitions.
No run-time dependency on external tools or libraries.
Always in sync with the enum as compiled.

In practice, the compiler does the heavy lifting — we just reuse it.

This is just the first step — once the metadata is available, it can also be used for parsing configuration files, validation and more. A follow-up article will explore those use cases.

Appendices

Background: Full example for enum metadata with gdb

Here is a small program that shows enum metadata:

// color.c
#include 

enum color { C_NONE, C_RED=2, C_YELLOW=6, C_GREEN } ;
void foo(enum color c) {
    printf("Color=%d\n", (int) c) ;                          // print as integer
}
int main(void) {
    foo(C_RED) ;
    return 0 ;
}

We can compile with debug information (gcc -g), and inspect the enum with gdb:

$ gcc -g color.c
$ gdb a.out
Reading symbols from a.out...
(gdb) b foo
Breakpoint 1 at 0x1158: file color.c, line 6.
(gdb) run
Starting program: /home/user/github/articles/a.out 
Breakpoint 1, foo (c=C_RED) at color.c:6
6           printf("Color=%d\n", (int) c) ;
(gdb) print c
$1 = C_RED
(gdb) ptype c
type = enum color {C_NONE, C_RED = 2, C_YELLOW = 6, C_GREEN}

The information comes from the DWARF debug metadata that is embedded in the object file, and is available to the debugger (usually, it’s embedded in the executable).

Disclaimer

This is a personal approach based on general experience working with C codebases. It does not represent any official guideline or the opinion of my employer.

As with any low-level technique, evaluate carefully before adopting it in production.

Usage and License

The supporting files (enum_desc.h, enum_desc.c, enum_dwarf_query.py ) are provided under the MIT license and are intended to be copied and used as-is in your own projects.

You can simply copy and/or modify them into your project and integrate the extractor into your build process — no special packaging or setup is required

Optimizing Chained strcmp Calls for Speed and Clarity — Without Refactoring

Yair Lenga — Mon, 13 Apr 2026 19:35:11 GMT

Optimizing Chained strcmp Calls for Speed and Clarity

From memcmp and bloom filters to 4CC encoding for small fixed-length string comparisons

While working on a financial modeling system, we started noticing a gradual degradation in performance. In the beginning — nothing dramatic, just a steady increase in runtime, as the code evolved over years of use.

After some profiling, the issue was traced to a core module where a critical part of business logic was encoded. The structure of the code was not an accident — it started small, and grew in complexity over the years — conditions, edge cases, and date-based complexities.

The Code We Started With

At the center were functions that took action based on currency codes (ISO 3-letter codes), and an as-of date. One specific function was used to load the correct modeling parameters into a configuration structure.

bool model_ccy_lookup(const char *s, int asof, struct model_param *param)
{
    // Major Currencies
    if ( strcmp(s, "USD") == 0 || strcmp(s, "EUR") == 0 || ...) {
        ...
    // Asia-Core
    } else if ( strcmp(s, "CNY") == 0 || strcmp(s, "HKD") == 0 || ... ) {
        ...
    } else if ( ... ) {
        ...
    } else {
        ...
    }
}

The above is a simplified version. The actual logic was more involved, but the structure was the same — long chains of strcmp, grouped by common business rules. See full code — Github Gist.

When It Became a Bottleneck

The slowdown wasn’t caused by a single change. It was gradual — over the years. Each added currency introduced a little more work. The chain was organized by “popularity” — major currencies ended up with 2–3 strcmp calls, and that was never changed. But the average cost went up, users of the system for 'non-major' currencies noticed a significant and growing performance penalty.

Profiling quickly revealed the issue. The block was executed frequently, and most of the work was repeated strcmp. In the worst case scenario, none of the conditions match, which meant repeated sequences of failed strcmp.

Refactoring the logic into lookup structure was not a possibility. The conditions were tied to business rules that had to stay explicit, visible and auditable. Specifically, each block had nested if, based on the asof date parameter. So many sub-conditions looked like the example below - clean business logic, but expensive to execute.

if ( strcmp(s, "USD") == 0 || strcmp(s, "EUR") == 0 || ...) {
        *param = major_ccy_param ;
        if ( strcmp(s, "USD") == 0 ) {
            if ( asof < USD_CUTOFF ) {
                param->p1 = ... ;
            }
        } else if ( strcmp(s, "EUR") == 0 ) {
            if ( asof < EUR_CUTOFF ) {
                param->p2 = ... ;
            }
        }
    }

Making strcmp Less Expensive

Re-coding the conditions

The first step was not about performance. It was about making the code easier to change.

Instead of writing:

if (strcmp(s, "USD") == 0 || strcmp(s, "EUR") == 0 || ...)

we introduced a small helper:

#define CCY_EQ(x, y) (strcmp((x), (y)) == 0)

This didn’t make things faster, but it gave us a single place to experiment with different implementations.

Reducing function calls

The first observation was simple: most of these checks fail.

A typical path would evaluate several CCY_EQ calls — sometimes 3–5, sometimes more than 10 — before finding a match or falling through. In those cases, we were paying the cost of a full strcmp call each time. (Note: the code was optimized for the "USD" case, which requires only 2 calls, both returning true)

Since most comparisons fail early, and a significant part of the cost is the function call itself, we tried a small change: (tagging it as strcmp).

static inline bool CCY_EQ(const char *x, const char *ccy) 
{
    return x[0] == ccy[0] && strcmp(x+1, ccy+1) == 0 ;
}

This turns each comparison into:

a cheap first-character check
followed by a strcmp only if needed

In practice, this brought the cost of a failed comparison much closer to a single character test.

Faster compare with memcmp

The next observation was also easy — all currency codes are short, and have fixed size — 4 characters (including the terminating NUL). There is no need to use strcmp to find the string end. Instead, we compare with fixed number of bytes.

static inline bool CCY_EQ(const char *s, const char *ccy)
{
    return memcmp(s, ccy, 4) == 0;
}

As a bonus, memcmp with fixed size is typically inlined by the compiler into a fast sequence of load/compare, avoiding function call overhead entirely.

This alone gave noticeable improvement, on top of the char compare + strcmp approach.

Full inlining

As an experiment, we pushed the idea of single byte compare, and expanded the comparison into explicit character checks:

static inline bool CCY_EQ(const char *s, const char *ccy)
{
   return s[0] == ccy[0] && s[1] == ccy[1] && s[2] == ccy[2] && s[3] == ccy[3] ;
}

This had an interesting side effect. Both gcc and clang were able to optimize these expressions quite aggressively — reordering comparisons, and even combining conditions across different branches.

For example, currencies with a common prefix (like “INR” and “IDR”) would share part of the decision path, reducing redundant checks.

At this point, we had significantly reduced the cost of each comparison. But the structure of the code was still the same — long chains of conditions — and the total cost still grew with the number of entries.

Summary — improving on strcmp

The performance gains are impressive — small, local changes to the implementation resulted in up to 5.7X speedup over strcmp. Leveraging compiler optimization takes those improvements further - up to 8X in the best case scenario.

The benchmark approximates a realistic distribution: most calls target a small set of major currencies, and fewer calls to the other currencies. A smaller number of calls (<1%) were made with currencies that were not handled in the lookup logic.

The test was executed with various compilation flags — both for gcc and for clang: -Og, -O, -O2 and the aggressive -O3.

The table below summarizes relative performance. The baseline case is strcmp, compiled with -O2, normalized to 1.0. Higher scores are faster.

GCC             -Og   -O    -O2    -O3
--------------- ----  ----  ----   ----
ccy-01-strcmp   0.95  1.00  1.00   0.93
ccy-02-strcmp1  1.11  1.97  2.34   6.27
ccy-03-memcmp   0.81  0.87  5.71   5.47
ccy-04-charcmp  1.09  2.57  2.62   8.01
CLANG
ccy-01-strcmp   1.00  1.00  0.99  0.95
ccy-02-strcmp1  3.22  3.22  4.43  4.40
ccy-03-memcmp   5.58  5.34  5.62  5.19
ccy-04-charcmp  4.03  4.68  8.18  7.76

Key takeaway:

Even without changing the structure of the code, the cost of each failed comparison can be reduced dramatically.

Cleaning It Up — and Breaking Performance

We managed to address the performance problem. Now, it’s time to address the code quality problem — make it easier to maintain, audit, and read. The preferred solution was to replace the chained-if with single calls: CCY_IN: (See full code as GitHub Gist.).

// BEFORE - chained IF
if (CCY_EQ(s, "USD") || CCY_EQ(s, "EUR") || CCY_EQ(s, "JPY") ||
        CCY_EQ(s, "GBP") || CCY_EQ(s, "CHF") || CCY_EQ(s, "CAD") ||
        CCY_EQ(s, "AUD")) {
            ...
        }

// AFTER- Similar to SQL "IN" clause.
if (CCY_IN(s, "USD", "EUR", "JPY", "GBP", "CHF", "CAD", "AUD")) {
    ...
}

The initial implementation was simple:

static inline bool ccy_in(const char *s, const char **ccy_list)
{
    while (*ccy_list) {      
        if (CCY_EQ(s, *ccy_list))
            return true;
        ccy_list++;
    }
    return false;
}

#define CCY_IN(ccy, ...) ({ \
    static const char *ccy_list[] = { __VA_ARGS__, NULL } ; \
    ccy_in(ccy, ccy_list) ; \
    })

But the results were poor — worse than the equivalent chained-if implementation.

Exploring Alternatives

We tried different directions — and they all hit the same wall.

Applying the strcmp speed-up 'tricks' (the strcmp1 and memcmp variants), improved the results but still significantly below the chained-if approach. Getting the best outcome was challenging - required fine-tuning the various compiler options - which is not ideal for maintainability.
We tried to “flatten” the data structure — moved from const char ** to const char [][4] - expecting the reduced indirection to improve performance. Result: No impact (test ccy-24-strin-4).
As an experiment, the code was modified to use a bloom filter to reduce the number of comparisons — making the code more complex. The net effect of higher “fixed cost” associated with every call was lower performance.

The table below summarizes relative performance of various CCY_IN implementations. The baseline case is strcmp, compiled with -O2, normalized to 1.0. Higher scores are faster. The best score (4.8X for the 'memcmp) is almost 2X slower vs the best score of the chained-if approach (8X).

GCC                   -Og   -O    -O2    -O3
--------------------- ----  ----  ----   ----
strcmp (BASELINE)     0.95  1.00  1.00   0.93
ccy-21-strin          0.75  0.91  0.91   0.88
ccy-22-strin-4        0.72  0.90  0.89   0.88
ccy-23-strin-cmp1     0.87  2.03  2.08   2.66
ccy-24-strin-memcmp   0.63  0.83  3.41   3.38
ccy-25-strin-filter   1.30  2.83  3.75   3.60
ccy-26-strin-filter4  1.29  2.70  3.74   3.58

CLANG
ccy-21-strin          0.81  0.87  1.01   0.87
ccy-22-strin-4        0.85  0.81  1.01   0.98
ccy-23-strin-cmp1     1.99  1.67  3.00   3.00
ccy-24-strin-memcmp   3.27  2.75  4.80   5.05
ccy-25-strin-filter   2.96  2.40  3.69   3.44
ccy-26-strin-filter4  2.81  2.85  3.35   3.47

Notes:

Tests ‘filter’ and ‘filter4’ were using Bloom filter to skip strcmp.
Tests ‘strin-4’ and ‘strin-filter4’ were using flat data structure.
Tests ‘strcmp1’ and ‘memcmp’ used the strcmp speedup 'tricks' described before.

Key Takeaway:

All of those approaches were still comparing strings, one character at a time.

Stop Comparing Strings

Realizing that strcmp is the bottleneck, we looked at alternatives. We already observed that all strings are short, have the same length, and fit into a 32-bit integer. So we decided to try to use the FourCC (4cc) Encoding. The basic idea is to pack 4 bytes into an integer, and replace repeated char compares with a single integer comparison. For example: "USD" becomes 0x00534455 (or 0x41524400 on big-endian architectures).

// Before
strcmp(s, "USD")

// After
*(int *) s == 0x00534455

The CCY_EQ is now reinterpreting the 4-byte strings as an integer:

#define CCY_EQ(x, ccy) (*(int *)x == *(int*) ccy )

This turns each comparison into a single integer load and compare.

Notes:

On modern X86/X64_86/ARM, it’s OK to fetch an integer via an unaligned pointer — a key enabler for this approach — sometimes with a minor performance cost.
Possible to write standard-compliant implementation — slightly noisier.
No explicit conversion needed.

Even in this basic form, the macro outperforms all previous strcmp-based implementations—without requiring compiler optimization. The raw speed gain from comparing 4cc codes as integers is higher than the gains from optimizing the number of calls to strcmp.

At this point, we combined the 4cc encoding with the various CCY_IN implementations. This provides better performance vs. the previous CCY_IN that were based on strcmp.

Hint from charcmp

The latest version was still failing to match the performance of the charcmp approach, where the strcmp was unrolled into a series of single-character comparison. This gave us a hint - unrolling. Our code was using loops for membership tests. Can we combine all the findings from the various tests into a clean, performant implementation?

We tried:

#define CCY_EQ(x, ccy) (*(int *)x == *(int*) ccy )

#define CCY_EQ0(ccy, x) (x && CCY_EQ(ccy, x))

#define CCY_IN_8(ccy, x1, x2, x3, x4, x5, x6, x7, x8, ...) \
    CCY_EQ0(ccy, x1) || CCY_EQ0(ccy, x2) || \
    CCY_EQ0(ccy, x3) || CCY_EQ0(ccy, x4) || \
    CCY_EQ0(ccy, x5) || CCY_EQ0(ccy, x6) || \
    CCY_EQ0(ccy, x7) || CCY_EQ0(ccy, x8)

#define CCY_IN(ccy, ...) CCY_IN_8(ccy, __VA_ARGS__, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

Combining all the findings from the previous benchmarks:

CCY_IN is unrolling the compare.
CCY_EQ performs a fast integer comparison.
The compiler short-circuit the chained-if, when less than 8 values are provided.

The result: we have clean source conditions: if (CCY_IN(s, "USD", "EUR", "JPY", "GBP", ...)), that get expanded into efficient and highly optimizable code:

int sv = *(int *) s;
if ( sv == 0x00445355 || sv == 0x00525545 || sv == 0x0059504A || sv == 0x00504247 || ...)

STR_IN with 4cc encoding:

The table below summarizes the results using 4cc encoding. In all cases, comparing 4cc encoded strings outperforms character by character comparison. Combining it with the CCY_IN construct did NOT have negative performance — it actually improves performance — the final code is running 10X faster vs the initial implementation, and 20% faster vs the chained-if approach.

GCC                   -Og   -O     -O2    -O3
--------------------- ----  ----  -----  -----
strcmp (BASELINE)     0.95  1.00   1.00   0.93
ccy-21-strin          0.75  0.91   0.91   0.88
ccy-31-4cc            4.86  6.12  11.26  10.79
ccy-32-4cc-in         2.31  3.90   3.38   3.50
ccy-33-4cc-in4        2.36  4.11   3.55   3.53
ccy-34-4cc-filter     2.41  4.03   5.08   4.86
ccy-35-4cc-filter4    2.43  3.98   5.32   5.63
ccy-36-4cc-opt        4.83  6.42  10.50  10.79
CLANG
ccy-24-strin-memcmp   3.27  2.75   4.80   5.05
ccy-31-4cc            8.56  8.46   8.66   8.76
ccy-32-4cc-in         3.20  3.50   7.68   8.37
ccy-33-4cc-in4        3.59  3.42   9.53  10.18
ccy-34-4cc-filter     5.02  4.48   6.66   6.66
ccy-35-4cc-filter4    4.03  4.51   6.07   6.49
ccy-36-4cc-opt        8.46  8.37   8.66   8.56

What This Taught Me

The project provided an opportunity to explore the topic of strcmp performance, which we sometimes treat as a black box. It was also a good experience of trying to balance performance requirements and non-functional requirements (readability, maintainability, auditability):

On the learning side:

Start with local optimization, change structure when you hit a wall.
Clean code and high performance are not always in conflict — sometimes it is possible to achieve both.

Recap of ideas:

The strcmp may be expensive, but it does not have to be. When strcmp sits at the core of hot code, it's worth asking: what are the alternatives? Few options we visited in the article:
Reduce the number of strcmp calls by performing some comparison at call site. This is extremely effective if most calls are likely to fail.

static inline bool CCY_EQ(const char *s1, const char *s2)
{
    return *s1 == *s2 && strcmp(s1+1, s2+1) == 0 ;
}

Fixed size strings are opportunities for performance improvements. Treating those strings as raw data (as opposed to NUL-terminated strings) unlock strategies for performance improvements:

Consider memcmp to replace strcmp for fixed size strings - the difference is big.

FourCC (and its big brother EightCC) enable efficient processing of strings (and other data items) — without low-level bit tricks, SIMD wizardry, or hard-to-maintain code. They are simple to implement, and do not require dependency on 3rd party libraries (Reminder: See note about portability).

Disclaimer

This is a personal approach based on general experience working with C codebases. It does not represent any official guideline or the opinion of my employer.

As with any low-level technique, evaluate carefully before adopting it in production.

Optimizing Chained strcmp Calls for Speed and Clarity — Without Refactoring was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

Safer Casting in C — With Zero Runtime Cost Making casts visible, auditable, and harder to misuse

Yair Lenga — Sun, 05 Apr 2026 14:18:21 GMT

Safer Casting in C — With Zero Runtime Cost

Making casts visible, auditable, and harder to misuse

C gives us two types of casting:

Implicit casting — happens automatically in expressions and function calls
Explicit casting — with the cast operator (T) v (e.g. (int) x)

Both are powerful — and both introduce risks — making subtle, hard-to-detect bugs easy to introduce.

The problem is not that casting exists — it’s that it’s too easy to misuse, and too hard to audit.

Aggressive Explicit Casting

The explicit casting will happily convert almost anything into anything else:

T* to/from T**
Pointers and integers
Qualifiers stripped silently.

Another problem: cast precedence

C casts have very high precedence — higher than most operators. This means they bind tightly to the expression that follows.

Example:

char *p = ... ; 
long x = *(long *) p + 1;

is parsed as

long x = (*(long *) p) + 1;                // Option A

A developer unfamiliar with exact precedence rules might expect something closer to one of the following:

long x = *((long *) (p+1)) ;               // Option B (1 byte forward)
long x = *((long *) (p+sizeof(long))) ;    // Option C (next long)

These expressions look similar, but behave very differently:

The first (Option A) reads a long from p, then adds 1
The second (Option B) reads a long from p+1
The third (Option C) reads a long from p+sizeof(long)

Because casts bind tightly, small changes in parentheses can silently change behavior — making such bugs hard to spot in review.

Safer Solution ?

Compare this to SQL’s CONVERT(type, value) which makes conversions obvious and searchable. In this case one would write:

long x = *CAST(long *, p) + 1 ;            // Add one to the long from p
long x = *CAST(long *, p+1) ;              // Pick the long starting one byte after p.
long x = *CAST(long *, p+sizeof(long)) ;   // Pick the second long from p

The goal is not to prevent all invalid casts — but to make incorrect ones fail early, and valid ones easy to audit.

A Simple Idea:

Replace (T) v with function-like macros that:

Make casts visible
Enforce basic correctness at compile time
Add zero-runtime cost

This is not “Type-Safe C” — just “harder-to-misuse C”.

This is essentially introducing semantic casts — each cast encodes intent, not just type conversion.

Example — Implicit conversion bug:

The following small program should print the absolute value of the first argument. Calling the wrong absolute value function results in implicit truncation and bad calculation.

#include 
#include 

int main(int argc, char **argv)
{
        double v = atof(argv[1]) ;
        double abs_v = abs(v) ; 
        printf("ABS(X)=%f\n", abs_v) ;
}

It compiles, but is wrong — abs() expects and returns int, so v is implicitly converted: double-> int -> double. This truncates the fractional part.

Output — expecting 3.14, getting 3.0

./a.out -3.14
ABS(X)=3.000000

Structured CAST macros

Design Goals

Make casts visible and searchable.
Catch common mistakes
Zero runtime cost
Simple syntax and drop-in usage

Proposed API

T CAST(T, v)           // generic entry point
T CAST_VAL(T, val)     // scalar / arithmetic values
T CAST_PTR(T, ptr)     // any pointer
T CAST_PTR1(T, ptr)    // Same as CAST_PTR, limit to single-level pointers
UNCONST_PTR(ptr)       // remove qualifiers from the pointee types, which must be const*
UNCONST_PTR1(ptr)      // same as UNCONST_PTR, limit to single-level pointers
T CAST_CPTR1(T, ptr)   // Same as CAST_PTR1, ptr MUST be const

The full header file (~100 lines of code) can be downloaded from GitHub Gist

Before / After

/* before */
int n = (int) x;
char *buf = ... ;
long *lp = (long *) buf + 1;
free((void *) s);

/* after */
#include "safe-cast.h"
int n = CAST_VAL(int, x);
long *lp = CAST_PTR1(long *, buf) + 1;
free(UNCONST_PTR1(s));

In practice, the biggest benefit is not just safety — it’s visibility.

All conversions are now:

Visible
Structured
Easy to grep and audit
Will fail on “trivial” mistakes

API Description

CAST(T, v)

Generic entry point. This is the macro to use when the code should read like a normal cast, but still go through the structured API. The main value of CAST is readability: all explicit conversions now go through one visible, searchable API.

Usually, this is the first step — replace the hidden casts (T) v, with CAST(T, v).

Example:

int x = CAST(int, y);
void *p = CAST(void *, buf);

CAST_VAL(T, val)

Used for value conversions: integers, floating-point values, enums, booleans, and similar arithmetic cases.

Example:

int n = CAST_VAL(int, d);
double x = CAST_VAL(double, count);

This macro should reject pointer expressions, so pointer-to-integer or pointer-to-float mistakes do not silently slip through. The macro is similar to C++ static_cast(v), so if your team maintain both C++ and C code, you may even name it STATIC_CAST.

CAST_PTR(T, ptr)

Used for pointer conversions where the only requirement is that ptr is a pointer expression.

Example:

void *p = CAST_PTR(void *, src);
char *s = CAST_PTR(char *, p);

This macro is intentionally lightweight. Its main job is to separate pointer casts from value casts, making them easy to audit. It will reject argument which is NOT a pointer. The macro is somewhat similar to C++ reinterpret_cast(p), which allows reinterpretation of raw pointer, so if your team maintains both C++ and C code, you may name it REINTERPRET_CAST.

CAST_PTR1(T, ptr)

Used for single-level pointers casts only. In other words, ptr must be a T* -style pointer, not T **, and not void *. Example usage will be to convert long *to int *, char * to struct foo *, etc.

Example:

struct foo *buf = ... ;
char *s = CAST_PTR1(char *, buf);

char *work = ... ;
struct node *n = CAST_PTR1(struct node *, work);

Typical invalid cases:

char **pp = ... ;
char *p = CAST_PTR1(char *, pp);   /* should fail */

This is useful because a common hard-to-detect cast mistake in real code is accidentally mixing T * and T ** (or other level of nesting, array-ness)

UNCONST_PTR(ptr)

Removes the ‘const’ qualifier from the pointee type of a pointer expression.

Example:

const char *p = get_text();
    // We know we can modify the value, get UNCONST_PTR pointer
char *q = UNCONST_PTR(p) ;

This is intentionally narrow: it is for pointer-to-data cases such as const T * -> T *. It does not try to be a general-purpose “remove const from anything” macro.

That restriction is intentional. Applying “unconst” to plain values or even structs by value is not useful; it handles the common case when we want to change the mutability of referenced data. Common use case is when a function return a value that must be freed — but we want to treat the value itself is immutable.

const char *x = get_value() ;
do_something(x) ;
free(UNCONST_PTR(x)) ;     // free rejects const *

UNCONST_PTR1(ptr)

Explicit form of UNCONST_PTR for the common T * case. This makes the intent even clearer: remove qualifiers from a first-level pointer, not from nested pointers.

Example:

const struct header *h;
struct header *mh = UNCONST_PTR1(h);

In general, for most cases, we want to use UNCONST_PTR1, which indicate that we expect the pointer to non-mutable object. The macro rejects nested pointers.

CAST_UNCONST1

The UNCONST_PTR1 and UNCONST_PTR macros depend on C23 typeof_unqual (or the gcc/clang typeof_unqual). If those are not available (gcc <=13, clang <=18), possible to use CAST_UNCONST1 which is similar to the pointer CAST_PTR1 macros - with the additional test that ptr points to const object.

const char *cp = read_token(...) ;
     // We want to update cp
char *mut_cp = CAST_UNCONST1(char *, cp) ;
mut_cp[0] = '@' ;

Combining with static checkers

Beside making the conversion more readable/searchable, it is possible to combine those macros with strict checking by enabling the conversion warnings with gcc/clang. Basic idea:

compile with gcc/clang -Wconversion, which warns on: Implicit narrowing (long -> int, int->short), signed/unsigned conversions and float/int.
Review each conversion. Fix broken conversions, add explicit conversions with CAST, CAST_PTR1, UNCONST_PTR1 as needed.
Raise conversion to warning to errors with -Werror=conversion

At this point, any code changes that result in implicit or illegal conversions (as per Macro restrictions) will be flagged as an error — preventing unexpected surprises at run time.

For even stronger validation — consider enabling gcc/clang cast related warnings (-Wcast-qual, -Wcast-align, ...). You can also get additional mileage using gcc/clang analyzer, and additional tools like clang-tidy or commercial tools like coverity.

Implementation

Each of the macro is implemented with an expression that will force compile time checks, with zero run-time checks. Because “C” does not have standard for some of the checks (e.g. is_pointer) — we are using compiler extension that exists in gcc/clang. If you use other compilers that do not support gcc/clang extensions — It’s possible to implement SOME of the restrictions with C23.

It’s important to highlight one aspect of those macro — they capture the intended usage of the cast. While current implementation/compilers might not be able to fully enforce the restriction — future implementation and compiler versions might include new extensions, and future updates to the C standard might support stronger enforcement.

You can retrieve implementation of those macro for gcc 13, and clang 18, that I validated on Ubuntu 2024 (under WSL) from GITHUB gist, and drop it into your code base. single header file.

Below is short description for the implementation of CAST, CAST_VAL. Other macros implementation follow the same line.

Implementing the basic CAST

#define CAST(T, v) ((T) v)

Implementing the CAST_VAL(T, val)

The implementation of all macros that enforce restriction on the converted value is to use 2 expressions:

The first expression enforce the restriction at compile time, but has no run-time effect.
The second expression perform the conversion, using the basic CAST

#define CAST_VAL(T, v) (CAST_REQUIRE_VALUE(v), CAST(T, v))

static inline void cast_require_value(double v) { (void) v; }

#define CAST_REQUIRE_VALUE(v) ((void)sizeof(cast_require_value(v)))

The cast_require_value function does nothing - but it will only accept double value. In "C", scalar numeric values - integer, floating-point, enum, ... will be promoted to double as needed. Therefore, the call to cast_require_value will succeed if a value is passed, and will fail with non-value - specifically - pointers, unions, structures, ...

We mentioned that we want ZERO run-time effect. This is achieved by applying the sizeof of the restriction. Per C language rules - sizeof does NOT evaluate the expression - it calculate (at compile time) the size (both gcc and clang evaluate sizeof(void) to 1).

Summary

C casting is extremely powerful — and that’s exactly the problem. Both implicit and explicit casts can silently introduce bugs that are hard to detect and even harder to trace.

This article proposes a simple approach:

Replace (T) v with function-like macros
Make conversions visible and searchable
Catch common mistakes at compile time
Keep zero runtime cost

The goal is not to make C “type-safe”, but to make casting:

Explicit
Auditable
Harder to misuse

Some of the biggest improvements in C code don’t come from new language features — but from better discipline, encoded in small reusable patterns.

Caveats and Limitations

This approach is intentionally lightweight and pragmatic, but it comes with a few important caveats:

Compiler support

The implementation relies on GCC/Clang extensions (e.g. type checks via expressions that trigger compile-time errors).
It has been tested with GCC 13, 14 and Clang 18, 19, but may not work — or may require adjustments — on other compilers.

Not a complete type system

These macros do not make C type-safe.
They only enforce a limited set of structural constraints (e.g. pointer vs value, pointer depth).

Incorrect casts are still possible — but many common mistakes become compile-time errors.

Error messages

Some invalid uses will produce compiler errors that are not always intuitive, especially when triggered through macro expansion.

This is a trade-off for zero runtime cost and portability.

Requires discipline

The benefits come from consistent usage:

Replace (T)v with CAST(...)
Enable strict warnings (-Wconversion, -Werror)

Partial adoption reduces effectiveness.

Test before adoption

This approach should be validated in your codebase:

Check compatibility with your compiler/toolchain
Evaluate error messages and developer experience
Ensure it integrates well with existing coding guidelines

Disclaimer

This is a personal approach based on practical experience working with C codebases.
It does not represent any official guideline or the opinion of my employer.

As with any low-level technique, evaluate carefully before adopting it in production.

Temporary Memory Isn’t Free: Allocation Strategies and Their Hidden Costs

Yair Lenga — Mon, 30 Mar 2026 12:08:24 GMT

This article completes a short series on temporary memory in C:

First, we introduced stack-based allocation (using VLA) as a practical alternative to malloc for short-lived data
Then, we showed how to safely estimate and manage available stack space
Here, we measure the payoff: how allocation strategy impacts performance in a realistic workload

Introduction

In many discussions, memory allocation is treated as an O(1) operation — a constant-time primitive that can be safely ignored in performance-critical code. In practice, that constant can be surprisingly large. Allocating memory may involve managing free lists, splitting or merging blocks, or occasionally requesting additional pages from the operating system. These costs are usually hidden behind a fast path, but when allocation happens repeatedly inside tight loops, the “constant time” assumption starts to leak.

This becomes particularly relevant in financial analytics, where temporary memory is not just a convenience but often a requirement. Models are typically structured as independent components, each producing intermediate results that must be preserved for auditability, explainability, or reuse in downstream calculations. A typical valuation may generate multiple time series — such as scheduled principal, prepayments, interest, and losses — which are then aggregated or fed into other models. This separation improves modularity and traceability, but it also means that even relatively simple calculations rely on temporary arrays that are allocated and discarded repeatedly.

While this article uses VLA as a concrete mechanism, the underlying point is broader: stack-based allocation has fundamentally different performance characteristics than heap allocation.

As a result, allocation patterns that might seem avoidable in theory become common in practice — especially when applied across large portfolios, where these temporary structures are created thousands or millions of times.

A Simple Use Case: Loan Portfolio PV Calculation

A typical loan valuation computes cashflows over a time grid (often daily or monthly) from origination to maturity. For each time step, the model derives scheduled principal, interest, and expected losses based on the current balance and assumptions such as prepayment and default rates. These values are usually stored in arrays, both to support multi-pass calculations (e.g., aggregation, stress adjustments) and to provide a full audit trail of intermediate results. The discounted present value is then obtained by applying a corresponding discount factor curve to each time step.

Per-loan processing

Each loan is evaluated independently over a time grid (daily, using a simplified 30/360 convention) from origination to maturity. At each step, the model updates the outstanding balance and computes the corresponding cashflows based on contractual terms and behavioral assumptions (e.g., prepayments, defaults).

Arrays over time: S, I, U, L, DF

The calculation typically materializes several time series, coming from different models.

S[t] — Scheduled principal payments - Cash Flow model
I[t] — interest payments - Cash Flow model
U[t] — Unscheduled payments (pre-pays) - Prepay model
L[t] — losses (defaults / write-offs) - Loss model.
DF[t] — discount factors - Interest rate model

These arrays are indexed by time and span the full horizon of the loan. In our example, we will cap the horizon to 50 years.

Temporary workspace per loan

Even when only the final present value is required, intermediate results are usually stored in arrays. This supports:

multi-pass calculations (generation → aggregation → adjustments)
scenario or stress overlays
auditability and explainability of results

In practice, this means allocating a working set proportional to the number of time steps for each loan.

A Reasonable Implementation

This implementation is representative of real-world code:

Each loan is processed independently
Temporary arrays (S, I, U, L, DF) are allocated per loan
Discount factors are computed once and reused

Nothing here looks unusual or inefficient — and that’s exactly the point. The expectation is that allocation overhead is small compared to the numerical work.

This assumption is what we test next.

static struct portfolio_result
port_pv_heap_per_loan(int loans, int sim_days)
{
    struct portfolio_result res = {0};
    double *DF = xmalloc(STATIC_MAX_DAYS * sizeof(*DF));
    calc_DF(sim_days, DF, 5.0);
    for (int loan = 0; loan < loans; ++loan) {
        struct loan_info info = get_loan_info(loan, sim_days);
        int loan_days = info.days;
        double *U = xmalloc(loan_days * sizeof(*U));
        double *S = xmalloc(loan_days * sizeof(*S));
        double *I = xmalloc(loan_days * sizeof(*I));
        double *L = xmalloc(loan_days * sizeof(*L));
        model_loan(&info, S, U, I, L);
        res.pv += loan_pv(loan_days, S, U, I, L, DF);
        free(L);
        free(I);
        free(P);
        free(S);
    }
    free(DF);
    return res;
}

The loan modeling code

The modeling logic itself is straightforward and operates over the provided workspace:

static void model_loan(
    const struct loan_info *loan,
    double *S,
    double *P,
    double *I,
    double *L
) {
    // Setup
    double bal = ...
    // Cash flows
    for (int t = 0; t < loan->days; ++t) {
        // Calculate scheduled_principal, prepay, interest and losses
        S[t] = scheduled_principal(...) ;
        P[t] = prepay(..);
        I[t] = interest(...);
        L[t] = loss(...);
    }
}

Actual code (single file, build instruction on the top) available as GitHub Gist.

Allocation Strategies

To understand the impact of allocation, we compare several strategies that differ only in how temporary memory is managed. The computation itself is identical in all cases.

Static reusable buffers (reference)

A single set of static buffers is allocated once and reused across all loans.

This approach has effectively zero allocation overhead during the benchmark and serves as the reference point. It represents the best-case scenario where memory management is fully amortized and removed from the hot path.

Stack allocation (VLA and fixed-size)

Temporary arrays are allocated on the stack per loan, either using fixed-size arrays or Variable Length Arrays (VLA).

This avoids heap allocation entirely and keeps allocation cost very low and predictable. However, it is constrained by stack size and may require safeguards for large problem sizes.

Heap allocation per loan (malloc / free)

Each loan allocates its own working arrays using malloc and releases them after processing.

This is the most straightforward and modular approach, but it introduces allocation overhead directly into the hot loop and stresses the allocator under repeated use.

Heap reuse (per portfolio)

Buffers are allocated once per portfolio (or per thread) and reused across loans. Per-loan data is allocated to the maximum possible size.

This removes most allocation overhead while preserving flexibility. It is a common compromise in performance-sensitive systems, but makes the code less modular — callers must anticipate workspace requirements.

Bulk allocation

A single large block is allocated and partitioned into the required arrays (S, P, I, L, DF).

This reduces the number of allocation calls and improves locality, but still relies on the heap allocator and may incur setup cost.

In all cases, the only difference is how memory is obtained and released.
This allows us to isolate the cost of allocation itself.

Benchmark Design

To capture allocator behavior across realistic environments, we run the same benchmark under several common compiler and allocator combinations:

GCC + glibc (default Linux)
Baseline configuration used in most production Linux systems.
Clang + glibc
Same allocator, different compiler — highlights code generation effects independent of allocation.
GCC + musl
Lightweight allocator commonly used in containers (e.g., Alpine); known for different performance trade-offs.
GCC + mimalloc
Modern allocator optimized for fast paths and low fragmentation, widely used in performance-sensitive systems.
GCC + jemalloc
Mature allocator with strong scalability and fragmentation control, used in databases and large-scale services.
GCC + tcmalloc
Google’s allocator, optimized for high-throughput multi-threaded workloads.

All tests are run in release mode with optimizations enabled, with minimal run-time checks. No debugging, tracing, or instrumentation features are active in any allocator — using default “out-of-the-box” setting.

All measurements are performed in a single-threaded application to keep the analysis focused and comparable. This isolates allocation costs without introducing contention or synchronization effects. In multi-threaded workloads, allocator behavior becomes more complex, and additional overheads — such as thread-local cache management, cross-thread frees, and synchronization — can introduce further runtime penalties. As a result, the single-threaded results presented here should be viewed as a lower bound on allocation cost.

Results

The choice of allocation strategy has a measurable impact on performance — even for relatively simple computations.

In our benchmark, per-loan allocation using malloc is up to 2.5x slower than reusing memory, while stack-based approaches (VLA) remain close to the optimal baseline. When using "simple" allocators (musl) the cost of using malloc can be as high as 6x slower.

Throughput

The following table summarizes relative throughput, when running the simulation on a portfolio of 1000 loans with durations between 3000 and 12000 days (approximately 9 and 33.5 years). Average result, as reported by the program over 10 runs each.

To replicate ./alloc-bench 0 1000 12000

Normalized to static = 100% (higher is better)

Compiler →  |                  gcc                      |  clang
Allocator → | glibc  jemalloc  mimalloc  musl  tcmalloc |  glibc
----------------------------------------------------------------
static      | 100%     100%      100%    100%     100%  |   100%
heap/bulk   | 101%      96%       97%     19%      99%  |    99%
heap/loan   |  28%      98%       83%     16%      82%  |    44%
heap/reuse  |  97%     102%       85%     89%      84%  |   100%
vla         |  96%     101%       98%     99%      98%  |   101%

The clang/glibc result for per-loan allocation stands out and may reflect differences in code generation or allocator interaction.

Additional runs (including shorter durations) show similar patterns/trends and are available in this GitHub Gist.

Key Observations

Several patterns stand out from the results:

Stack allocation remains consistently close to the theoretical limit
Across all configurations, stack allocation (fixed-size arrays and VLA) performs very close to the static baseline — effectively the lower bound where allocation cost is eliminated.
Allocators are tuned, not universal
Each allocator performs well under certain allocation patterns, but it is easy to fall outside its “comfort zone.” — even in otherwise reasonable implementations. Repeated allocation/free in tight loops can expose slow paths and lead to significant performance penalties.
Simple allocators can struggle under pressure
The musl allocator, while lightweight and predictable, shows the weakest performance in allocation-heavy scenarios. Its simplicity becomes a disadvantage when faced with frequent, repeated allocations. This is particularly relevant in cloud environments (e.g., Alpine-based containers), where lightweight allocators like musl are common.
Optimized allocators still have weak spots
More sophisticated allocators (mimalloc, jemalloc, tcmalloc) generally perform better, but not uniformly. Some allocation patterns are handled almost for free, while others incur noticeable overhead. Performance is highly pattern-dependent.
The compiler matters too
Even with the same allocator (glibc), compiler choice has an impact. Clang shows slightly better and more stable performance compared to GCC, suggesting that code generation and optimization influence how allocation costs are expressed.
Variability increases with allocation size
As allocation sizes grow — especially near thresholds where allocators switch to mmap — runtime becomes more variable. This reflects transitions between fast-path allocation and slower OS-backed paths, introducing both latency and unpredictability.

Overall, allocation cost is not just about the allocator — it is the interaction between allocator, allocation pattern, and compiler.

Caveats

Results are from a single-threaded benchmark; multi-threaded workloads may introduce additional allocator overhead
Stack-based approaches (VLA) are limited by available stack size and may require safeguards for large inputs
The model materializes intermediate arrays for clarity and auditability; other designs may reduce allocation at the cost of complexity

Practical Takeaways

The performance upside is not academic.

In simulation-heavy workloads — financial models, risk scenarios, Monte Carlo, or any system that repeatedly builds temporary state — allocation sits directly in the hot path. Small per-call overheads accumulate quickly, and differences of 2–3× at the allocation level can translate into meaningful end-to-end impact.

Stack-based allocation (fixed-size and VLA) offers a simple way to approach the lower bound: allocation cost that is effectively constant and close to zero. VLA provides a practical mechanism to apply this pattern to size-dependent data.

Used responsibly, VLA does not have to be an all-or-nothing choice. A common pattern is to use a conditional approach:

allocate on the stack for small, bounded sizes
fall back to heap allocation for larger inputs

This provides predictable performance while avoiding stack overflow risks.

It is certainly possible to tune allocator behavior, adjust parameters, or carefully shape allocation patterns. However, these approaches add complexity and are often allocator-specific. In many cases, a simple conditional VLA/heap strategy achieves comparable or better results with significantly less effort.

To Summarize:

If temporary memory sits in your hot path, how you allocate it matters — and simple strategies can go a long way.

Avoiding allocation is often the best optimization.
When that’s not possible, controlling it explicitly is the next best thing.

A follow-up article will present a small library that encapsulates these patterns, making stack-based and conditional allocation easier to apply in real code.

Temporary Memory Isn’t Free: Allocation Strategies and Their Hidden Costs was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Missing Metric on Medium: Article-to-Article Readership Overlap

Yair Lenga — Sat, 21 Mar 2026 21:23:50 GMT

Medium gives good per-article stats: views, reads, read ratio. But once you publish more than one article, an important question becomes impossible to answer:

Are the same people reading my articles, or am I reaching entirely new audiences each time?

The gap

Right now, stats are per article only. I’m not aware of away to understand how two articles relate in terms of readership.

For example, if I publish:

Article A (stack allocation)
Article B (allocation strategies)

I want to know:

Did readers of A also read B?
Did B reach a different audience?
Am I building depth, or just breadth?

Proposal: a 3×3 readership matrix

For any pair of articles, show a matrix like this:

Article B:
        Article A:  Read. View. None
   read:            ??    ??   ??
   view:            ??    ??   ??
   none:            ??    ??   ??

Where

Read = member read (as defined by Medium)
View = opened but not read
None = did not see the article

Why this matters

This single matrix answers several high-value questions:

1. Audience overlap

High (Read A, Read B) → strong continuity
Low overlap → different audiences

2. Funnel insight

(View A, Read B) → second article performs better
(Read A, View B) → drop-off or weaker hook

3. Content strategy

Are follow-up articles actually reaching the same readers?
Are readers exploring your profile, or just reading one piece?

4. Series validation

For multi-part topics:

Is the audience progressing through the series?
Or are parts disconnected?

Why this is more useful than claps

Claps and followers are coarse signals.

This matrix shows reader behavior across content, which is far more actionable:

Should I write a continuation?
Should I reframe the topic?
Are my articles connected or isolated?

Implementation notes (high-level)

This can be computed from existing Medium data
Only aggregate counts are needed (no privacy concerns)

Could be limited to:

author’s own articles
last N articles
optional time window

Example insight

If I see:

High (Read A, None B)
Low (Read A, Read B)

Then:

My second article is not reaching my existing readers

That’s something I cannot detect today.

Closing

Medium already helps writers understand individual articles. This would help us understand relationships between articles — which is where real content strategy lives.

Curious if others would find this useful.

How Much Stack Space Do You Have? Estimating Remaining Stack in C on Linux

Yair Lenga — Mon, 16 Mar 2026 14:06:00 GMT

Practical techniques for estimating remaining stack space at runtime on Linux systems.

In a previous article (Avoiding malloc for Small Strings in C With Variable Length Arrays (VLAs)) I suggested using stack allocation (VLAs) for small temporary buffers in C as an alternative to malloc().

One of the most common concerns in the comments was:

“Stack allocations are dangerous because you cannot know how much stack space is available.”

This concern is understandable. If a program accidentally exceeds the stack limit, the result is usually a segmentation fault.

While the C language and standard library do not expose stack information, modern operating systems — including Linux — expose enough information to estimate available stack space.

This article explores a few practical techniques to answer the question:

How much stack space does my program have left?

The goal is not perfect precision, but good enough estimates to guide decisions such as whether to allocate memory on the stack or the heap.

The Stack on Modern Linux

On most modern Linux systems the default stack size for a process is typically around 8 MB. You can confirm this using ulimit, which will report the result in 1024 byte units.

$ ulimit -s
8192

On modern Linux platforms (X86–64, ARM, RISC-V, PowerPC) the stack grows downward in memory, meaning that as functions are called and local variables are allocated, the stack pointer moves toward lower addresses.

Higher addresses
┌───────────────────────────┐
│  stack start / top        │
│  main() frame             │
│  caller frames            │
│  local variables          │
│  current stack pointer    │
│                           │
│  unused stack space       │
│                           │
│  stack limit (guard page) │
└───────────────────────────┘
Lower addresses

When the stack grows beyond the guard page, the operating system raises a segmentation fault.

To estimate remaining stack space we need two pieces of information:

Stack boundaries (base and size)
Current stack pointer

Once we know those, the remaining stack can be approximated by measuring the distance between them.

stack_base = lowest stack address
stack_top  = highest stack address
stack_remaining = current_stack_pointer - stack_base
stack_inuse = stack_base + stack_size - current_stack_pointer

Getting the Current Stack Pointer

C does not provide an official API to query the stack pointer.

In practice, the address of a local variable provides a very good approximation of the current stack position, since local variables are typically stored in the current stack frame.

Example:

static StackAddr stack_marker_addr(void)
{
    char marker;
    return (StackAddr) ▮
}

This address is usually close to the current position of the stack pointer.

When compiling with higher optimization levels, the compiler may rearrange stack layout or inline helper functions in ways that make the measurement less predictable. To reduce this effect, it helps to:

take the address of a volatile local variable
place the logic in a noinline helper function

A small helper like the following works well in practice:

[[gnu::noinline]]
static StackAddr stack_marker_addr(void)
{
    volatile char marker;
    return (StackAddr) &marker ;
}

The next step is to get the stack boundaries so that we can estimate the remaining (and inuse) stack space. For the remaining stack space we need the lower address of the stack (stack_base). There are several ways to obtain this address. We will cover:

Method 1: Query the Stack Limit with getrlimit
Method 2: Using pthread_getattr_np
Method 3: Capturing the Stack Position at Program Startup

Method 1: Query the Stack Limit with `getrlimit

Linux exposes the maximum stack size through the getrlimit() system call.

const char *get_stack_base(void)
{
    struct rlimit stack_limit ;
    getrlimit(RLIMIT_STACK, &stack_limit) ;
    stack_size = stack_limit.rlim_cur ;
    // get stack_top from stack_marker_addr
    stack_base = stack_top - stack_limit.rlim_cur ;
    return stack_base ;
}

This returns the maximum stack size configured for the process.

By capturing the stack pointer early in the program and combining it with maximum stack size, we can estimate the stack base, and the remaining stack space:

Conceptually:

stack_top = stack_marker_addr()      // at program start.
stack_size = ... // from getrlimit
stack_base = stack_top - stack_size

This method is simple and portable across many Linux systems, but it has few limitations, in particular: It requires capturing the stack position early in the program to establish a reference point.

If the first opportunity to capture the stack address occurs after significant stack allocations have already occurred, we might over-estimate the remaining stack space as there is no easy way to estimate the space already been used. In those cases, an alternative method exists.

Complete Implementation (build instruction in comments) as GitHub GIST

Note that RLIMIT_STACK gives the maximum allowed stack, not necessarily the mapped stack. The actual stack memory is usually grown lazily by the kernel as needed.

See working example (gist-2603-stack-getrlimit.c)

Method 2: Using pthread_getattr_np

Linux systems using glibc provide a convenient non-standard extension, pthread_getattr_np(), which allows a thread to query its own stack attributes, including the stack base address and stack size.

Example Usage:

pthread_attr_t attr ;
    void *stack_base ;
    size_t stack_size ;
    pthread_getattr_np(pthread_self(), &attr) ;
    pthread_attr_getstack(&attr, &stack_base, &stack_size) ;

From this we can obtain the stack_base, which can now use for estimating the remaining stack, and inuse stack, as discussed above.

This method has several advantages:

Works in multi-threaded programs (different threads may have different stack size!)
Does not require change to program startup
Provides direct access to stack boundaries

For Linux programs that already use pthread, this is often the cleanest approach. Using this technique on single threaded program requires the program to link with the pthread library, but does not launch extra threads, or introduce thread-safety issues into code that does not otherwise launch additional threads.

See working example (gist-2603-stack-pthread.c)

Method 3: Capturing the Stack Position at Program Startup

The previous method uses pthread_getattr_np() to query stack boundaries directly. While convenient, it requires linking with the pthread library and relies on a non-standard GNU extension.

In many programs/libraries, especially single-threaded utilities, it may be desirable to estimate stack usage without introducing a dependency on pthread.

One simple technique is to capture the stack position very early in the program’s lifetime, before additional call frames are created. On systems using GCC or Clang this can be done using a constructor function.

Functions marked with the constructor attribute are executed automatically before main(). The attribute is commonly used by runtime libraries (including C++ runtimes) to perform initialization before main. It can also be used in C programs/functions.

Example:

static StackAddr stack_base;
static size_t stack_size ;

__attribute__((constructor))
static void capture_stack_region(void)
{
    StackAddr stack_top = stack_marker_addr() ;
    struct rlimit stack_limit ;
    getrlimit(RLIMIT_STACK, &stack_limit) ;
    stack_size = stack_limit.rlim_cur ;
    stack_base = (StackAddr) stack_top - stack_limit.rlim_cur ;
}

Because this function runs during program startup, the recorded address is typically very close to the top of the initial stack. Combining this address with the configured stack size provides a good approximation of the stack base.

Later in the program we can compare this value with the current stack position to estimate stack usage, and remaining stack

size_t stack_space = stack_marker_addr() - stack_base ;

This approach avoids the need for pthread, and the measurement can be implemented entirely inside a helper module without requiring any changes to main().

See working example (gist-2603-stack-constructor.c)

Like the other techniques presented here, this method provides an estimate rather than an exact measurement, but it is often sufficient to guide decisions such as whether a temporary buffer should be placed on the stack or the heap.

Turning This into a Small Utility

Once the stack boundaries are known, it is easy to wrap the calculation into a small helper functions. The stack_remaining helper also tracks the lowest observed stack address to estimate max usage of stack space.

Conceptually:

static size_t stack_remaining(void)
{
    StackAddr sp = stack_marker_addr() ;
    if ( sp < stack_low_mark ) stack_low_mark = sp ;
    return sp - stack_base - safety_margin;    
}

static size_t stack_inuse(void)
{
    return stack_base + stack_size - stack_marker_addr()
}

A Small Stack Inspection Utility

Note that the stack_remaining also tracks the lowest stack marker. This will allow us to expose "stack_info" Similar in spirit to "mallinfo", with the following attributes:

struct stack_info {
    StackAddr base ;
    size_t size ;
    size_t max_inuse ;
    size_t margin ;
    StackAddr low_mark ;
    ...
}
struct stack_info get_stack_info(void) ;

Using Stack Estimates to Guide Allocation Decisions

The practical motivation for estimating stack space is simple:

Some allocations are small enough that placing them on the stack is faster and simpler than using the heap.

However, we want to avoid risking stack overflow.

A simple strategy is to allocate on the stack only when sufficient space remains.

Example logic for a function that needs double[n] temporary storage.

function foo(int n, double x)
{
    size_t need_mem = n * sizeof(double) ;
    bool use_vla = need_mem < stack_remaining() ;
    double y_vla[use_vla ? n : 1] ;
    double *y = use_vla ? y_vla : malloc(need_mem) ;
    // Use y as needed
    // Cleanup
    if ( !use_vla ) free(y) ;
}

This allows the program to use the stack when it is safe — avoid malloc calls and fall back to the heap otherwise.

How Accurate Are These Estimates?

These methods provide estimates, not guarantees.

A few factors can influence stack usage:

deep call stacks
recursion
large local variables
compiler optimizations
thread stack sizes

Because of this, it is wise to leave a safety margin when making decisions based on remaining stack space.

In practice, leaving a few kilobytes (8–32) of buffer is usually sufficient. The sample code allocate 2 pages.

Decision Chart

Conclusion

Although the C language itself does not expose stack information, modern Linux systems provide enough primitives to estimate stack usage.

Using APIs and features such as:

getrlimit()
pthread_getattr_np()
GCC/CLANG constructor attribute.

a program can determine stack limits and approximate the remaining stack space at runtime.

This does not eliminate the need for careful programming, but it does make stack allocation decisions far more informed than commonly assumed.

In a follow-up article we will explore a more experimental approach: actively probing the stack itself to discover its limits.

Disclaimer

The views expressed in this article are my own and do not necessarily reflect those of my employer.

Some of the code examples in this article were generated with the assistance of AI tools and have not been tested in production environments. They are provided for illustration and experimentation only.

This article focuses on practical techniques for Linux systems and does not attempt to provide a portable or fully general solution.

The code and techniques described in this article are provided for educational purposes only and are not guaranteed to be correct or suitable for production use. The author makes no warranties regarding accuracy or fitness for any particular purpose.

If this article was useful, please clap so other C developers can find it.

Avoiding malloc for Small Strings in C With Variable Length Arrays (VLAs)

Yair Lenga — Tue, 10 Mar 2026 13:01:02 GMT

A simple stack-first buffer technique that reduces heap allocations

Temporary string buffers are everywhere in C code.

We allocate them to build log messages, join paths, format JSON, construct SQL fragments, or prepare protocol messages. Most of the time, the code looks harmless:

char *buf = malloc(strlen(s1) + strlen(s2) + 1);
strcpy(buf, s1);
strcat(buf, s2);
/* use buf */
free(buf);

It is simple, but not free.

Even when the strings are small, this pattern still pays for:

heap allocation
heap free
allocator bookkeeping
potential fragmentation over time
In hot code paths, these costs add up.

C has an underutilized feature that can be very useful: Variable Length Arrays (VLA). It is possible to use VLA for small temporary strings, and fall back to malloc only when the buffer becomes larger than some threshold.

VLAs were introduced in C99, and are supported by all major Linux compilers (GCC, Clang, …). They are often overlooked today, but in carefully bounded situations they can remove unnecessary heap allocations and improve performance.

A complete runnable example is available on GitHub Gist:

The idea

The strategy is simple:

small temporary string → allocate on the stack
large temporary string → allocate on the heap

This gives a useful hybrid behavior:

Size   Allocation      Mode       VLA
small  stack          (fast)      YES
large  heap           (safe)      NO

Stack allocation is extremely cheap — usually just adjusting the stack pointer — while malloc requires allocator bookkeeping.

Few helper macros

I wrapped the pattern in a small set of macros.

  // Set default threshold to choose between stack and heap
  // Override with -D, or by doing #define before using

#ifndef FLEX_STR_MAX
#define FLEX_STR_MAX 64
#endif

  // Create string variable named 'var_' pointing to VLA OR malloced
  // memory based on the size. If using malloced, VLA size is 0.
  // See below for explanation of the hidden variables (var_##_sz,
  // and var_##_vla.

#define FLEX_STR_INIT(var_, sz_) \
  int var_##_sz = sz_ ; \
  char var_##_vla[var_##_sz >FLEX_STR_MAX ? 0 : var_##_sz] ; \
  char *var_ = sizeof(var_##_vla) ? var_##_vla : malloc(var_##_sz)

  // Free (if needed) the memory associated with var_. NO-OP is vla
  // was used.

#define FLEX_STR_FREE(var_) \
  if ( !sizeof(var_##_vla) ) { free(var_) ; var_ = NULL ; }

  // Access the size of the allocated buffer
#define FLEX_STR_SIZE(var_) ((int) (var_##_sz))

  // Construct two parameters buf, sizeof(buf) for function calls
  // e.g., snprintf, etc.
#define FLEX_STR_BUF(var_) (var_),FLEX_STR_SIZE(var_)

The macros creates 2 helper variable for each temporary storage variable var. The first, named var##_sz stores the actual size of the buffer so that other functions (malloc, snprintf) can retrieve the available string capacity. The second var##_vla is the VLA buffer (size 0 if not needed).

Usage is straightforward:

// Create/Allocate
// result will point to stack or heap buffer, based on size_limit
FLEX_STR_INIT(result, size_limit);

...
// use result, guarenteed have enough space as per INIT
// use FLEX_STR_SIZE(result) to retrieve actual size from init.
printf("Result(sz=%d)=%s\n", FLEX_STR_SIZE(result), result) ;

// When done ...
FLEX_STR_FREE(result);

If the required size is small enough, the buffer lives on the stack. Otherwise it comes from malloc.

The calling code does not need to care which one was used.

Safe and efficient concatenation

For this kind of operation I avoid strcat or strlcat, because those functions scan the destination buffer to find its end. Since the result buffer is fresh, we already know exactly what we want to copy.

A simple helper like this works well:

static inline void concat(char *result, int sz, const char *s1, const char *s2)
{
  int l1 = strlen(s1) ; if ( l1 >= sz ) l1=sz-1 ;
  memcpy(result, s1, l1) ; result += l1 ; sz -= l1 ;
  int l2 = strlen(s2) ; if ( l2 >= sz ) l2=sz-1 ;
  memcpy(result, s2, l2) ;
  result[l2] = 0 ;
}

Example usage:

static void test1(bool show, const char *s1, const char *s2)
{
  FLEX_STR_INIT(result, strlen(s1) + strlen(s2) + 1);
  concat(FLEX_STR_BUF(result), s1, s2)) ;
  if (show) printf("result(%zu)=%s\n", FLEX_STR_SIZE(result), result);
  FLEX_STR_FREE(result);
}

A quick benchmark

To get a rough idea of the impact, I ran a simple microbenchmark. The test concatenates two strings, with stack/heap threshold set at 64 bytes:

First is about 60 bytes
Second is either 3 or 5 bytes (to trigger switch stack/heap).
repeated 1 million times

Results (CPU time) on my system (AMD Ryzen 5 7640H), Ubuntu/WSL.

Method          Optimized(-O)  Debug (-g)    Time (-Ofast)
VLA/Stack       0.013 sec      0.015 (sec)    0.009 (sec)
malloc/free     0.021 sec      0.023 (sec)    0.014 (sec)

This represents roughly 35% reduction in total runtime (or one half speedup). Another way to interpret the number — each 1M of malloc/free pairs add ~0.005 seconds to the execution time.

This is only a microbenchmark — real workloads will vary — but it shows that allocator overhead is not negligible even for fairly small operations.

Why this helps

Stack allocation is extremely cheap. In many cases it compiles down to something like:

sub rsp, N

while malloc/free involves:

function calls — malloc and free
metadata updates
possible locking
heap bookkeeping

Avoiding heap traffic for small temporary buffers reduces both CPU overhead and heap churn.

When this pattern is useful

This technique is most helpful in code that creates many short-lived strings:

logging
JSON generation
File Paths & URL manipulation
command construction
protocol formatting

In these situations, most buffers are small and temporary, so the stack path becomes the common case.

When VLAs are NOT appropriate

VLAs are useful for small temporary buffers, but they should not be used in every situation.

Avoid VLAs when:

The size may will exceed available stack space most of the time.
The function is (deeply) recursive
The buffer must outlive the function.

For these cases, heap allocations remains the safer choice.

Caveats

A few caveats are worth mentioning.

Stack space is limited

On a single-threaded Linux program, a VLA of a few hundred kilobytes may be acceptable. The default stack size on many modern Linux systems (for example Ubuntu) is typically around 8 MB — and can be configured based on the available memory. Therefore, VLA of 100KB or even 512KB can be acceptable.

In multi-threaded programs, however, each thread has its own stack, and the default stack size MAY be smaller — This can be configured on a per-process, or even per thread. On x86–64–8MB is typical default, but upper limits will be usually lower, depending on the number of threads.

On embedded systems, stack sizes can be dramatically smaller, sometimes only a few kilobytes.

Because of this, VLAs should generally be limited to reasonable, bounded buffers, which is why the helper macro in this article falls back to malloc once the requested size exceeds a configurable threshold.

Not all compilers/environments support VLAs

They were introduced in C99 and are supported by GCC and Clang, but Microsoft Visual C does not implement them (to my best of my research).

Also wanted to highlight that the examples above use GCC13/Clang extensions. Code is not meant to be ISO compliant.

Stack buffers cannot escape their scope

If the buffer came from the stack, it becomes invalid when the function returns. If a some strings need a life-time beyond a function life time — use (conditional) strdup to move those strings to the heap/static or global storage.

Final thoughts

This is not a framework a library or a new language feature. It is simply a small pattern:

stack for small buffers
heap for large buffers
minimal code overhead

In the right places, that can reduce allocator traffic, avoid fragmentation pressure, and make temporary string handling more efficient.

Sometimes the fastest memory allocator is simply the stack.

Discussion

Do you use VLAs in production code, or are they avoided in your codebase?

I’m curious how other C developers handle temporary string buffers in performance-sensitive code.

Complete Example

A complete runnable example is available on GitHub Gist: build/run instructions in commentd on the top of the text

Follow up article

Medium (no paywall): How Much Stack Space Do You Have? Estimating Remaining Stack in C on Linux

Disclaimer

The views expressed in this article are my own and do not necessarily reflect those of my employer.

Unless otherwise noted, the code snippets may be used freely for any purpose without warranty of any kind.

If this article was useful, please clap so other C developers can find it.

Avoiding malloc for Small Strings in C With Variable Length Arrays (VLAs) was originally published in Stackademic on Medium, where people are continuing the conversation by highlighting and responding to this story.