# Reversing Bits in C

## A small performance investigation into innocent-looking code.

Heads up, we’ve moved! If you’d like to continue keeping up with the latest technical content from Square please visit us at our new home https://developer.squareup.com/blog

Thanks to readers rjs, meteorfox in the comments, and reader lnxturtle over on the Reddit thread (!) for correcting my poor choice of words. I incorrectly used the phrase Cache Coherence when the correct phrase was (Spatial / Temporal) Cache Locality. Your attention to detail is appreciated!

# The Setting

Interesting lessons can come from unexpected places! I was pleasantly surprised at how something as “simple” as reversing bits in a byte could lead me on an unexpectedly deep exploration: operation vs instruction count, memory access patterns and cache behavior, and low-level CPU instructions. It’s often very easy to make assumptions about the performance of code that we write, and I hope that this article serves as a reminder that the map is never the territory, and that the only way to understand what’s happening inside your code is by observing and measuring it.

`unsigned char ReverseBitsInByte(unsigned char v){    return (v * 0x0202020202ULL & 0x010884422010ULL) % 1023;}`

# The Landscape

Here are all of the algorithms listed on the Bit Hacks page, with their advertised “operation counts”:

# The “obvious” way (O(N) where N is the index of the highest set bit):

`unsigned char ReverseBitsObvious(unsigned char v){    unsigned char r = v;    int s = sizeof(v) * CHAR_BIT - 1;    for (v >>= 1; v; v >>= 1) {        r <<= 1; r |= v & 1; s--;    }    return r << s;}`

# Lookup table (O(1)):

`#define R2(n)     n,     n + 2*64,     n + 1*64,     n + 3*64#define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)#define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )static const unsigned char BitReverseTable256 ={    R6(0), R6(2), R6(1), R6(3)};unsigned char ReverseBitsLookupTable(unsigned char v){    return BitReverseTable256[v];}`

# 3 operations (O(1)) (64-bit multiply and modulus division):

`unsigned char ReverseBits3ops64bit(unsigned char v){    return (v * 0x0202020202ULL & 0x010884422010ULL) % 1023;}`

# 4 operations (O(1)) (64-bit multiply, no division):

`unsigned char ReverseBits4ops64bit(unsigned char v){    return ((v * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;}`

# 7 operations (O(1)) (32-bit):

`unsigned char ReverseBits7ops32bit(unsigned char v){    return ((v * 0x0802LU & 0x22110LU) |            (v * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;}`

# Parallel bitwise approach (O(5log(N)) where N is the number of bits):

`unsigned char ReverseBits5logNOps(unsigned char v){    v = ((v >> 1) & 0x55) | ((v & 0x55) << 1);    v = ((v >> 2) & 0x33) | ((v & 0x33) << 2);    v = ((v >> 4) & 0x0F) | ((v & 0x0F) << 4);    return v;}`

# “Cheating”

Let’s take a step back here. We’re running this code on a specific family of CPUs, the ARMv6 and greater (available in all iPhones). ARMv6 and greater CPUs have an instruction called RBIT. This single instruction does exactly what we need: it reverses the bits in a 32-bit register. Unfortunately, while a lot of CMSIS implementations provide an RBIT intrinsic, it doesn’t look like one is provided with Xcode. It’s easy enough to drop down into inline assembly, though, and call the instruction ourselves:

`unsigned char ReverseBitsRBIT(unsigned char v){    uint32_t input = v;    uint32_t output;    __asm__("rbit %0, %1\n" : "=r"(output) : "r"(input));    return output >> 24;}`

# The Shoot-out

Here are the timings for reversing the bits in a byte, performed 50 million times each.

`Starting tests...    ReverseBitsLookupTable... 10.058261ns per function call    ReverseBitsRBIT...        10.123462ns per function call    ReverseBits7ops32bit...   17.453080ns per function call    ReverseBits5logNOps...    20.054218ns per function call    ReverseBits4ops64bit...   21.203815ns per function call    ReverseBitsObvious...     65.809257ns per function call    ReverseBits3ops64bit...  509.621657ns per function call`

# The Lesson

So what’s the takeaway here? What have we learned from this experiment?

Written by