Unions in C

Paul J. Lucas
7 min readJan 14, 2024

--

Introduction

A union is syntactically just like a struct and is used to store data for any one of its members at any one time. For example:

union value {
long i;
double f;
char c;
char *s;
};

union value v;
v.i = 42; // value is now 42
v.c = 'a'; // value is now 'a' (no more 42)

union value *pv = &v;
pv->s = malloc(6); // -> works too
strcpy( pv->s, "hello" );

The size of a union is the size of its largest member.

A common use-case for a union would be in a compiler or interpreter where a token is any one of a character literal, integer literal, floating-point literal, string literal, identifier, operator, etc. It would be wasteful to use a struct since only one member would ever have a value.

Initialization

Since all members have the same offset, their order mostly doesn’t matter — except that the first member is the one that is initialized when an initializer list is used so the value given must be the same type:

union value v = { 42 };    // as if: v.i = 42

Although 0 can initialize any built-in type.

Alternatively, you can use a designated initializer to specify a member:

union value v = { .c = 'a' };

Which Member?

One obvious problem with a union is, after you store a value in a particular member, how do you later remember which member that was? With a union by itself, you generally can’t. You need some other variable to “remember” the member you last stored a value in. Often, this is done using an enumeration and a struct:

enum token_kind {
TOKEN_NONE,
TOKEN_INT,
TOKEN_FLOAT,
TOKEN_CHAR,
TOKEN_STR
};

struct token {
enum token_kind kind;
union { // "anonymous" union
long i;
double f;
char c;
char *s;
};
};

struct token t = { .kind = TOKEN_CHAR, .c = 'a' };

When a union is used inside a struct, it’s often made an anonymous union, that is a union without a name. In this case, the union members behave as if they’re direct members of their enclosing struct except they all have the same offset.

Anonymous unions (and structs) are only supported starting in C11.

Type Punning

Type punning is a technique to read or write an object as if it were of a type other than what it was declared as. Since this circumvents the type system, you really have to know what you’re doing. In C (but not C++), a union can be used for type punning. For example, here’s a way to get the value of a 32-bit integer with the high and low order 16-bit halves swapped:

uint32_t swap16of32( uint32_t n ) {
union {
uint32_t u32;
uint16_t u16[2];
} u = { n };
uint16_t const t16 = u.u16[0];
u.u16[0] = u.u16[1];
u.u16[1] = t16;
return u.u32;
}

The union members u32 and u16[2] “overlay” each other allowing you to read and write a uint32_t as if it were a 2-element array of uint16_t. (You could alternatively write a version that used uint8_t[4] and reversed the entire byte order depending on your particular need.)

You can also use unions to do type punning of unrelated types, for example int32_t and float allowing you to access the sign, exponent, and mantissa individually. (However, this is CPU-dependent.)

Restricted Class Hierarchies in C

Another use for unions is to implement class hierarchies in C, but only “restricted” class hierarchies. A “restricted” class hierarchy is one used only to implement a solution to a problem where all the classes are known. Users are not permitted to extend the hierarchy via derivation.

This can be partially achieved via final in C++ or fully achieved via sealed in Java or Kotlin.

Of course C doesn’t have either classes or inheritance, but restricted class hierarchies can be implemented via structs and a union.

The token example shown previously is simple example of this: all the kinds of tokens are known and there’s one member in the union to hold the data for each kind. But what if there’s more than one member per kind?

For a larger example, consider cdecl that is a program that can parse a C or C++ declaration (aka, “gibberish”) and explain it in English:

cdecl> explain int *const (*p)[4]
declare p as pointer to array 4 of constant pointer to integer

During parsing, cdecl creates an abstract syntax tree (AST) of nodes where each node contains information for a particular kind of declaration. For example, the previous declaration could be represented as an AST like (expressed in JSON):

{
name: "p",
kind: "pointer",
pointer: {
to: {
kind: "array",
array: {
size: 4,
of: {
kind: "pointer",
type: "const",
pointer: {
to: {
kind: "built-in type",
type: "int"
}
}
}
}
}
}
}

For this example, let’s consider a subset of the kinds of nodes in a C++ declaration (to keep the example shorter):

enum c_ast_kind {
K_BUILTIN, // e.g., int
K_CLASS_STRUCT_UNION,
K_TYPEDEF,
K_ARRAY,
K_ENUM,
K_POINTER,
K_REFERENCE, // C++ reference
K_CONSTRUCTOR, // C++ constructor
K_DESTRUCTOR, // C++ destructor
K_FUNCTION,
K_OPERATOR, // C++ overloaded operator
// ...
};
typedef enum c_ast_kind c_ast_kind_t;

And declare some structs to contain the information needed for each kind:

struct c_array_ast {
c_ast_t *of_ast; // array of ...
unsigned size;
};

struct c_enum_ast {
c_ast_t *of_ast; // fixed type, if any
unsigned bit_width; // width when > 0
char const *enum_name; // enumeration name
};

struct c_function_ast {
c_ast_t *ret_ast; // return type
c_ast_list_t param_ast_list; // parameters
};

struct c_operator_ast {
c_ast_t *ret_ast; // return type
c_ast_list_t param_ast_list; // parameters
c_operator_t const *operator; // operator info
};

struct c_ptr_ref_ast {
c_ast_t *to_ast; // pointer/ref to ...
};

struct c_typedef_ast {
c_ast_t const *for_ast; // typedef for ...
unsigned bit_width; // width when > 0
};

Notice that, of the AST information declared thus far, there are similarities, specifically:

  1. The nodes point to one other node and the pointer is declared first.
  2. Functions and operators both have return types and parameter lists and the parameter lists are declared second.
  3. For nodes that have bit-field widths, the width is alternatively declared second.

The fact that the same members in different structs are at the same offset is convenient because it means that code that, say, iterates over the parameters of a function will also work for the parameters of an operator. Having noticed this, we can make an effort to keep the same members in any remaining structs at the same offsets. For example, the information for K_BUILTIN could be declared as:

struct c_builtin_ast {
unsigned bit_width; // width when > 0
};

because that’s all the information that’s needed for a built-in type. However, the bit_width member wouldn’t be at the same offset as the same member in either c_enum_ast or c_typedef_ast. To fix that so code that accesses bit_width can do so for any type that has it, we need to insert an unused pointer (a void pointer will do):

struct c_builtin_ast {
void *reserved; // instead of for/to
unsigned bit_width; // width when > 0
};

If you think inserting unused members might waste space, remember that, once all these structs are put into the same union, the union will be the size of the largest member anyway; hence inserting unused members doesn’t waste space.

While using a named member like reserved is fine, if you want to help guarantee that the member can never be accessed directly, you can employ a macro:

#define DECL_UNUSED(T) \
_Alignas(T) char UNIQUE_NAME(unused)[ sizeof(T) ]

struct c_builtin_ast {
DECL_UNUSED(c_ast_t*); // instead of for/to
unsigned bit_width; // width when > 0
};

See here for details on UNIQUE_NAME.

We can apply the same fix for the information for K_CONSTRUCTOR so param_list is at the same offset as in c_function_ast and c_operator_ast (constructors don’t have return types):

struct c_ctor_ast {
DECL_UNUSED(c_ast_t*); // instead of ret_ast
c_ast_list_t param_ast_list; // parameter(s)
};

And again apply the same fix for the information for K_CLASS_STRUCT_UNION so csu_name is at the same offset as enum_name in c_enum_ast:

struct c_csu_ast {
DECL_UNUSED(c_ast_t*); // instead of for/to
DECL_UNUSED(unsigned); // instead of bit_width
char const *csu_name;
};

Given all those declarations (assume that for any struct c_X_ast, there’s a typedef struct c_X_ast c_X_ast_t), we can now put them all inside an anonymous union inside a struct for an AST node:

struct c_ast {
c_ast_kind_t kind;
char const *name;
c_type_t type;
// ...

union {
c_array_ast_t array;
c_builtin_ast_t builtin;
c_csu_ast_t csu;
c_ctor_ast_t ctor;
c_enum_ast_t enum_;
c_function_ast_t func;
c_operator_ast_t oper;
c_ptr_ref_ast_t ptr_ref;
c_typedef_ast_t tdef;
// ...
};
};

Safeguards

One problem with this approach is that, if you modify any of the structs, you might inadvertently change the offset of some member so that it no longer is at the same offset as the same member in another struct. One way to guard against this is via offsetof and _Static_assert:

static_assert(
offsetof( c_operator_ast_t, param_ast_list ) ==
offsetof( c_function_ast_t, param_ast_list ),
"offsetof param_ast_list in c_operator_ast_t & c_function_ast_t must equal"
);

static_assert(
offsetof( c_csu_ast_t, csu_name ) ==
offsetof( c_enum_ast_t, enum_name ),
"offsetof csu_name != offsetof enum_name"
);

// More for other members ....

Now you’ll get a compile-time error if any of the offsets change inadvertently.

Conclusion

Take-aways for unions in C:

  • They can be used either for storing data for any one member at any one time or for type punning.
  • For type punning very different types, the order of bytes is CPU-dependent.
  • They can be used to implement restricted class hierarchies.

--

--

Paul J. Lucas

C++ Jedi Master. I am NOT available for advice, consultation, recommendations, nor individual training. No, I don't want to write for your publication or site.