Adding peephole optimization to GCC

I have been working on GCC for some time around, and would like to share some of my excursions within the GCC land -;) Adding a peephole optimization is a good starting point as it’s one of the easiest ways to modify GCC and also gives me an opportunity to talk a bit about my first non-trivial GCC contribution, which was also my GSoC 2014 project.

A view of GCC from 10k feet

The architecture of GCC and it’s build process merits a post of it’s own, this is a very simplified overview of the process. Roughly GCC is divided into three high level components — The front end, the so called “middle end” and the back-end.

The front-end deals with language issues, like lexical analysis and parsing, and ends up constructing a common abstract syntax tree, which is called “GENERIC” in GCC’s parlance. GENERIC is a tree-based high-level intermediate representation that is common to all the GCC front-ends.

The GENERIC IR is then lowered down to another lower-level IR called GIMPLE, which is GCC’s version of three-address code. A statement in three-address code can atmost have three operands. This makes it similar to structure of assembly while remaining target agnostic, which makes it suitable for machine independent optimizations. To illustrate the point, consider the following C expression:

x = (a * b) + (c * d) - e;

In a three-address code format, it would look as:

_1 = a * b
_2 = c * d
_3 = _1 + _2
_4 = _3 - e
x = _4

Where the variables _1, _2, _3, _4 are compiler-generated temporaries.

The next step in the compiler is to build a control-flow graph (CFG for short) of the function. A node in the CFG, represents a basic-block (a sequence of straight-line GIMPLE statements), and control flow is represented as edges between the basic-blocks. The compiler automatically adds dummy entry and exit blocks for purpose of data-flow analysis.

The representation is then lowered to SSA form, which is the choice of IR for modern compiler optimizations. The notable distinction of SSA form compared to traditional three-address-code format is that an assignment of a variable is only allowed once. This simplifies implementation of several optimizations by making use-def chains explicit in the IR and also a singleton (since there’s only one def). For more details on SSA, check these CMU slides.

The compiler then does most of the machine independent optimizations on this SSA form, and then lowers it to the third and final intermediate representation in GCC called “register transfer language” or RTL for short. RTL is a low-level IR used for target-dependent optimizations and code-generation.

Peephole Optimization

A peephole optimization / folding is one of the simplest optimizations done in the compiler. One example would be to convert a * 2 to a << 1 assuming that left-shift is faster than multiplication which is true of most processors. Folding of expressions is done on all the IR’s within the compiler. In this post, I would mainly focus on folding at GENERIC and GIMPLE level.

GCC has a robust GENERIC folder in fold-const.c, that works on folding a single expression. However there was no real GIMPLE folder, which would fold statements at the granularity of basic block level. To illustrate the point, consider this expression:

x = (a / 2) * 2;  // assume 'a' is unsigned

It would be represented as a single expression in GENERIC and would get folded to a & ~1.

However if it’s written in the following form explicitly in source:

t1 = a / 2;
t2 = t1 * 2;
x = t2

The GENERIC folder would no longer be able to fold it because in GENERIC, these would be represented as three different expressions. Hence the need for a GIMPLE based folder, that can do foldings at a basic-block level.

GIMPLE shares some internal data-structures with GENERIC, most notably ‘tree’, for representing operands. For sake of convenience, when there was no real GIMPLE folder, the GIMPLE statements in basic block were converted back to GENERIC representation for folding and then again lowered down to GIMPLE! This was both a design kludge and computationally expensive process in the compiler. But writing an entire GIMPLE folder from scratch was a huge task! Also implementing two different versions of same pattern wasn’t pretty from a maintainence perspective.

To address these issues, Richard Biener, a prolific GCC developer, posted a proposal on the mailing list, to create a new domain specific language for writing folding patterns called “match.pd”, and a tool (genmatch.c) that would read these patterns and emit C code that does folding on both GENERIC (generic-match.c) and GIMPLE (gimple-match.c). Thus implementing a pattern once, enabled it to be available for both IR’s and also in a friendlier format. Targeting RTL at the same time would have been much harder. My GSoC project under Richard’s guidance was to create an initial version of genmatch.c.

Aside — Building GCC

Before we proceed further with the code, here’s a quick primer on building a modern version of GCC on a recent Ubuntu distro. I expect it to be similar for other distros, but I have never tried so far.

sudo apt install build-essentials flex bison
sudo apt-get build-dep gcc
mkdir gnu-toolchain
cd gnu-toolchain
git clone
cd gcc
# Make a *separate* build dir, which is sibling to gcc/
cd ..
mkdir stage1-build
cd stage1-build
../gcc/configure --enable-languages=c,c++ --disable-bootstrap --disable-multilib
nohup make -j$(nproc) >err 2>&1 &

If all goes well, there should be an executable called “cc1” within stage1-build/gcc/ directory. cc1 is the actual C compiler binary. We would be using it directly (and not the driver) for testing purposes.

Writing a match.pd pattern

Let’s add a simple transform to gcc: sin²(x) + cos²(x) -> 1.

double f(double x)
return (__builtin_sin(x) * __builtin_sin(x))
+ (__builtin_cos(x) * __builtin_cos(x));
double g(double x)
double t1 = __builtin_sin (x);
double t2 = t1 * t1;
double t3 = __builtin_cos (x);
double t4 = t3 * t3;
    return t2 + t4;

Note: I am just using the above contrived pattern as an example. In practice it’s highly unlikely that a user would write it this way, so it wouldn’t make sense to have it implemented within the compiler.

The rationale for writing two different versions of same test-case, f and g is that the folding in ‘f’ should be done by the GENERIC folder since it would be represented as a single expression in GENERIC, while the folding in ‘g’ would need to be done by GIMPLE folder.

Let’s verify that the pattern isn’t already implemented within the compiler. We can do that by inspecting the .optimized dump generated by GCC, which is the final output of the middle-end.

cc1 -O2 -fdump-tree-optimized-details shows (file ending with .optimized):

;; Function g (g, funcdef_no=1, decl_uid=1909, cgraph_uid=2, symbol_order=1)
g (double x)
double t4;
double t3;
double t2;
double t1;
double _6;
complex double sincostmp_8;
<bb 2> [local count: 1073741825]:
sincostmp_8 = __builtin_cexpi (x_1(D));
t1_2 = IMAGPART_EXPR <sincostmp_8>;
t2_3 = t1_2 * t1_2;
t3_4 = REALPART_EXPR <sincostmp_8>;
t4_5 = t3_4 * t3_4;
_6 = t2_3 + t4_5;
return _6;

The dump for ‘f’ would also be the same. The curious occurrence of __builtin_cexpi above is the sincos transform, which replaces calls to combined occurrence of sin(x) and cos(x) by call to cexpi(x) to take advantage of Euler’s identity: cexpi(x) = cos(x) + i*sin(x) and thus compute sin and cos with a single function call. But that’s irrelevant to our pattern.

To implement the transform, add the following pattern to match.pd:

(plus (mult (SIN @0) (SIN @0))
(mult (COS @0) (COS @0)))
(if (flag_unsafe_math_optimizations)
{ build_one_cst (TREE_TYPE (@0)); }))

Several things to note about the above pattern:

  1. simplify denotes, that the thing is a pattern, which consists of two parts: match, condition and replace.
  2. match: The match condition is the expression starting with (plus …). The compiler matches the incoming TREE expression or statements in basic-block against this pattern, and if pattern matching succeeds, it would replace that by the sequence from replacement part.
  3. mult denotes the multiplication operator written in lower-case. Built-in functions like sin and cos are denoted by upper case.
  4. @0, is a so-called “capture”, which is used to represent an operand, or the “leaf” of the expression. In our case, @0 would be x.
  5. The condition (if …) is an actual C condition embedded inside the pattern. This is a kludge, since ideally we want pattern definitions to be only defined with the DSL, but getting down to C becomes necessary when the pattern language isn’t expressive enough, or we neeed to access some other stuff like flag_unsafe_math_optimizations in above case. The transform is gated on the flag to be conservative.
  6. Finally coming to replacement part, this can either be a pattern language construct enclosed within (…) or a C expression enclosed within braces { … }. In above example, there’s no construct in pattern language to build tree node, so we “fall back” to C code to construct a tree node of constant 1 with function build_one_cst. The TREE_TYPE macro yields the type of ‘x’, which in our case is double, but could also be other float variants.

During GCC build time, the tool genmatch.c will read match.pd and generate equivalent C code for GENERIC folding in generic-match.c and GIMPLE folding in gimple-match.c. For instance the following function gimple_simplify_273, is auto-generated C function implementing the “condition and replace” part of the above pattern in gimple-match.c:

static bool
gimple_simplify_273 (gimple_match_op *res_op, gimple_seq *seq,
tree (*valueize)(tree) ATTRIBUTE_UNUSED,
const tree ARG_UNUSED (type), tree *ARG_UNUSED (captures)
, const combined_fn ARG_UNUSED (SIN), const combined_fn ARG_UNUSED (COS))
/* #line 5004 "../../gcc/gcc/match.pd" */
if (flag_unsafe_math_optimizations)
gimple_seq *lseq = seq;
if (dump_file && (dump_flags & TDF_FOLDING)) fprintf (dump_file, "Applying pattern match.pd:5005, %s:%d\n", __FILE__, __LINE__);
tree tem;
tem = build_one_cst (TREE_TYPE (captures[0]));
res_op->set_value (tem);
return true;
return false;

The “match” part is combined with “match” parts of other patterns in the form of a decision tree, to avoid duplicate checks.

Let’s try to check if the pattern works. To verify if the GENERIC folder did it’s job, we would need to inspect, the “gimple” dump, ie, the GIMPLE output produced by the “gimplifier” which lowers GENERIC to GIMPLE IR.

cc1 -O2 -funsafe-math-optimizations -fdump-tree-gimple-details shows:

f (double x)
double D.1917;
  D.1917 = 1.0e+0; 
return D.1917;

From the above dump, we can clearly see that GENERIC folder did it’s work by replacing sin²(x) + cos²(x) by 1.

For function g, we can see that the sparse conditional constant propagation pass (SCCP for short), did the folding as part of it’s operation.

cc1 -O2 -funsafe-math-optimizations -fdump-tree-ccp-details:

Visiting statement:
t5_6 = t2_3 + t4_5;
which is likely CONSTANT
Match-and-simplified t2_3 + t4_5 to 1.0e+0
Lattice value changed to CONSTANT 1.0e+0. Adding SSA edges to worklist.
marking stmt to be not simulated again
g (double x)
double t5;
double t4;
double t3;
double t2;
double t1;
<bb 2> :
t1_2 = __builtin_sin (x_1(D));
t2_3 = t1_2 * t1_2;
t3_4 = __builtin_cos (x_1(D));
t4_5 = t3_4 * t3_4;
return 1.0e+0;

From the above dump we can check that the return value has been simplified to return 1.0. Note that the SCCP pass does it’s own thing — propagate constants, other passes like dead-code elimination would do the job of cleaning up calls to sin and cos. Eventually we see the following optimized dump:

g (double x)
<bb 2> [local count: 1073741825]:
return 1.0e+0;

For more details on the pattern language syntax, please refer to the GCC internals documentation: and these excellent slides by Prof. Uday Khedker and his team: They are a bit dated but provide a good conceptual overview of GCC.

So that was a short intro to GCC and match.pd. I hope you enjoyed it -;)


Thanks to IshKebab reddit user for pointing out a mistake in the (a / 2) * 2 -> a folding pattern in the reddit thread. It is now corrected to (a / 2) * 2 -> a & ~1 if a is unsigned.