Programming Poetry: Duff’s Device

Published in

Hard Mode

5 min readOct 11, 2017

Duff’s Device is a really clever implementation of dynamic loop unrolling. Tom Duff wrote the code in 1983 while working at Lucasfilm. Duff invented his particular implementation of loop unrolling in order to optimize a routine that copied an array of shorts into the Programmed IO data register of an Evans & Sutherland Picture System II. Let’s check out the code first, and then take a look at some background information which will help to understand it.

send(to, from, count)
	register short *to, *from;
	register count;
	{
		register n=(count+7)/8;
		switch(count%8){
		case 0:	do{	*to = *from++;
		case 7:		*to = *from++;
		case 6:		*to = *from++;
		case 5:		*to = *from++;
		case 4:		*to = *from++;
		case 3:		*to = *from++;
		case 2:		*to = *from++;
		case 1:		*to = *from++;
			}while(--n>0);
		}
	}

Magic. Now let’s back up and try to understand it.

The Evans & Sutherland Picture System

The code was written for The Evans & Sutherland Picture System II. Check out this front page of a pamphlet on the Picture System I from 1974. It’s a blast from the past!

a pamphlet on the Picture System 1 from 1974

Duff wrote his code for the Picture System II, but one can safely assume that it was quite similar to version I. The machine was used to visualize and manipulate 3D models in real time. It included a tablet and pen that together served as the standard graphic input device. The images below should give you an idea of how The Picture System was used.

The Impetus

Here is an example of a routine from a program that Duff was working on for the Picture System.

send(to, from, count)
	register short *to, *from;
	register count;
	{
		do
			*to = *from++;
		while(--count>0);
	}

In an email from 1983, where Duff formally introduced his device, he also described the impetus for writing it. He included the routine above and noted the following:

As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%.

The typical solution, as he states in the email, is loop unrolling. In order to understand Duff’s Device, we must first understand the concept of loop unrolling.

Loop Unrolling

Loop unrolling is an optimization technique that trades program size for speed. The binary size of the program is increased, but execution speed is gained. The technique works by transforming a loop in order to eliminate assembly instructions that control the loop. In the case of the VAX C compiler that The Picture System II used, the do-while loop in question would be compiled into two instructions: movw and subleq. The movw instruction makes a copy of one register pair into another register pair, and the subleq instruction specifies “SUbtract and Branch if Less than or EQual to zero”. Let’s consider the do-while loop of interest and examine the assembly instructions.

{
       	do
		*to = *from++;        // compiles to a movw
	while(--count>0);             // compiles to a subleq
}

If count were say, 20, then the loop above would compile down to 20 movw instructions and 20 subleq instructions. Now, let us consider unrolling the loop in order to decrease the number of subleq instructions run by the computer.

int n = count / 5;                // if count is 20, n is 4
{
       	do
	       *to = *from++;     // 5 copies will be executed
               *to = *from++;     // each iteration of the 
               *to = *from++;     // loop
               *to = *from++;
               *to = *from++; 
	while(--n>0);             // the loop will run 4 times 
}                                 // for the same total of 20 copies

By duplicating the body of the loop, it has been unrolled 5 times. The resulting compiled C code includes the same 20 movw instructions as the original loop, but the unrolled loop includes only 4 subleq instructions. You might be asking yourself: what happens if count is not evenly divisible by 5? In this case, programmers often implement a post-processing loop or a switch statement to handle the remainder. A switch implementation might look something like this:

int remainder = count % 5;
switch (remainder)                
{
        case 4 : *to = *from++;      
        case 3 : *to = *from++;
        case 2 : *to = *from++;
        case 1 : *to = *from++;
        case 0 : ;
}

This code will execute the remaining number of register copies not completed in the unrolled loop. This code compiles down to a number of movw instructions equal to remainder (plus the instructions required for the switch calculation and branching to the correct case label). It is important to understand that in C, switch statements do not break automatically before each case label. This means that once code execution begins at the calculated case label, it falls through the remaining case labels executing each line of code in succession. This property of C is instrumental in Duff’s implementation of loop unrolling.

Since movw instructions are executed faster than subleq instructions, the unrolled loop coupled with a switch statement accomplishes the same task as the original loop and runs significantly faster. Now that we understand loop unrolling, let’s take a look at Duff’s clever implementation.

Duff’s Device

Duff essentially wove together a loop unroll and a switch statement to poetically express the same idea that the two code blocks express separately. It is surprising to many, that it is in fact legal C code.

Here it is, Duff’s Device, in all its beauty:

send(to, from, count)
	register short *to, *from;
	register count;
	{
		register n=(count+7)/8;
		switch(count%8){
		case 0:	do{	*to = *from++;
		case 7:		*to = *from++;
		case 6:		*to = *from++;
		case 5:		*to = *from++;
		case 4:		*to = *from++;
		case 3:		*to = *from++;
		case 2:		*to = *from++;
		case 1:		*to = *from++;
			}while(--n>0);
		}
	}

note: There is no post increment on the to pointer, because the device was meant to copy values serially into an IO register. The device could be used for general memory copy if the postfix ++ were to be included.

How it works

Duff’s Device includes a loop unwound 8 times. The loop is embedded in a switch statement such that the switch calculates the remainder of register copies to carry out before the loop is entered. Execution jumps into the loop at the calculated case label, falls through the remaining case labels, and then hits the while keyword which will jump to the beginning of the do-while loop and execute cycles of 8 register copies until n reaches 0.

Disadvantages

This speed increase is not without consequences, however. The static code size is increased, which may be undesirable or excessive and can in turn cause an increase in instruction cache misses. In order to determine whether it is worth manually unrolling a loop, the cost of increased static code size and cache misses must be weighed against the gain in program execution speed. It is worth noting that nowadays it is increasingly rare that a programmer needs to manually unroll a loop since most modern compilers are very good at optimization and will already unroll loops where they deem appropriate.

Sources

https://stackoverflow.com/questions/514118/how-does-duffs-device-work
http://www.lysator.liu.se/c/duffs-device.html
http://www.drdobbs.com/a-reusable-duff-device/184406208
https://en.wikipedia.org/wiki/Duff%27s_device#Performance