Fast Colour Analysis

A look at achieving iTunes11 colour matching in realtime

iTunes 11 has a nice feature where it will match the colours used in the album details section to the colours used in the album cover art. For better or worse, I wanted to add this feature to TunesBar+. Googling for some ideas led me to the Panic Blog where Wade Cosgrove had written some example code that counted the colours in the image and picked some colours that went well together.

The code works well at picking the colour combinations, but is quite resource hungry and not the fastest. This isn’t a criticism of the code, Wade makes it clear in the blog post that it’s a proof of concept and isn’t an optimised algorithm. The example code passes the work of analysis off to a background thread so as not to block the main UI thread, and I didn’t want to introduce this method into TunesBar+. For my use, I was hoping for close to realtime performance, so I could call it on the main UI thread without any noticeable performance loss.

I created an application to stress test the SLColorArt code by analysing 100 cover art images of various sizes (between 30x30 to 1500x1500). These images were taken from my music folder, so could be considered a real world test of the sorts of images that I might encounter. What I wanted to test was the speed of the image analysis, and as SLColorArt scales images even if it doesn’t need to I modified it slightly to remove the initial image scaling.

With this I was able to set a base mark to work with — 64 seconds for the 100 images.

The first thought I had was to reduce the number of pixels to analyse somehow — other people had had the same thought, a number of comments on the blog talked about speeding things up by scaling the images to 50x50, but simply scaling didn’t seem right, colours can change depending on the algorithm, and scaling can take time.

Instead I decided to use CoreImage’s CIPixellate filter, meaning I could let the GPU quickly analyse the image to replace each area with its most common colour and then I would only need to look at the first colour in each large pixel. Unfortunately, this only decreased the time by about 3 seconds, not as much of a gain as I had hoped for. I noticed a few other things happening.

Looking at the individual timings for each image I saw that the initial image always took 2 or 3 seconds longer than the others. It didn’t matter which image was first, or if I analysed the same image twice — the first one was always slower. I wondered if this was the time needed to initialise the GPU for doing the work. Whatever it was, it was complicating my desire for realtime analysis.

The second thing I noticed was the Energy Impact bar in Xcode was always reading ‘Low’ when I used the GPU. Previous runs that didn’t use the GPU would read ‘High/Low’ when actually doing work, but ‘Zero’ when the app was sitting idle. Now the app never got to ‘Zero’. Anecdotally it also seemed like my laptop battery was running out faster while I was testing it. Using CoreImage wasn’t seeming like such a good idea, as TunesBar+ is supposed to be a fairly lightweight app.

But I wondered; if my hypotheses were correct, and the extra time for the first image was due to the time needed to initialise the hardware GPU context, and the extra power being used was because my app had started using the hardware, what would happen if I used the software renderer for my CIContext instead of the default of using the hardware.

Using the software renderer is as simple as passing the kCIContextUseSoftwareRenderer option when creating the CIContext:

CIContext* ciContext = [CIContext contextWithCGContext:context options:@{kCIContextUseSoftwareRenderer: @YES}];

And I was right, the extra 2 seconds startup time was gone, and the Energy Impact meter would read ‘Zero’ again when the app was idle. It still wasn’t what anyone would call realtime though.

Looking through the code I could see another potential bottleneck: a lot of NSColor objects were being used, especially in the NSColor DarkAdditions category code. Objects are great, but they’re not completely free of cost, they have allocation, initialisation and destruction costs, and creating a lot of them, especially in a tight loop like the SLColorArt code was doing is not great for performance.

The obvious replacement for the DarkAdditions functions would be to operate on the raw RGB values rather than accepting any NSColor object and then needing to do colourspace conversion every time. Because I had used NSCalibratedRGBColorspace to create the NSBitmapImageRep from the CIImage, I knew that all the NSColors that I had would be RGB.

Porting the DarkAdditions code to use the pure RGB values was mostly trivial as most of the methods used the RGB values internally anyway, so it was just a case of removing the NSColor conversions from them, and for the methods that required conversion between RGB and HSB, I consulted my aging copy of Computer Graphics, Principles and Practices which had the necessary algorithms.

At this point, I also made the methods into pure C functions so that they wouldn’t need to depend on NSColor for anything. I also knew that calling a lot of Objective C methods in a tight loop harms performance.

Instruments confirmed this: the largest portion of the application running time (14.7%) was spent in objc_msgSend, and the heaviest caller of objc_msgSend in the code I had control over was findEdgeColor:imageColors. With this in mind, I set about looking for ways to reduce the number of objects needed to perform the analysis.

All the code really needed from NSColor was the RGB values, and some way to pass them around together. I could have made arrays for each of the component values, but trying to sort and count multiple arrays would have been a nightmare. I could have used a struct, as a lighter weight container than an object, but there would still be costs associated with memory allocation and deallocation, and having to keep track of when memory needed freed. Neither of these solutions were acceptable, so I looked for another technique.

The components come out from NSColor as floats between 0.0 and 1.0, but they can be represented as integers by multiplying them by 255. This means each component takes 8bits, and so RGB and the associated alpha value (even though we weren’t using it) can fit into a 32bit integer with some simple bit shifting.

If we shift the red component left by 24bits, shift the green component left by 16bits, shift the blue component left by 8bits and bitwise OR everything together with the alpha, then we get one number that represents RGBA. The reverse is similar: take our RGBA number, bitwise AND the component we want with 0xFF and shift it right so that it is an 8bit integer again.

Some trivial C macros are all that I needed to do this. I created a SquashedColor type as an alias to UInt32 so that I would know when the value I was passing around was an RGBA colour.

typedef UInt32 SquashedColor;
#define SQUASH_RGBA(v) ((SquashedColor)(v[0] << 24 | v[1] << 16 | v[2] << 8 | v[3]))
#define UNSQUASH_R(v) (v >> 24)
#define UNSQUASH_G(v) ((v & (0xFF << 16)) >> 16)
#define UNSQUASH_B(v) ((v & (0xFF << 8)) >> 8)
#define UNSQUASH_A(v) (v & 0xFF)

The colour analysis algorithm is split into two parts: the first part is to look at every pixel and create a list of the colours used, sorted by their frequency in the image, the second part uses that list to select the 4 colours we need, one background, and 3 text colours.

To carry out the first part SLColorArt uses an NSCountedSet, a derivative of an NSSet. A set can only have one instance of each set member stored in it, but NSCountedSet adds the ability to get the count of the number of times the object was added to the set, even though it is only contained once in it. After it does this, the colours need to be sorted according to their count, but a set is an unordered collection, so SLColorArt iterates over the set members creating a PCCountedColor object, which connects the NSColor to its count. These PCCountedColors are placed in an NSArray and sorted.

The problem here is twofold: NSCountedSet and NSArray only store objects. I could box the SquashedColor in an NSNumber, and while NSNumber is probably slightly lighter than NSColor, it’s still an object so it doesn’t really help us to reduce the object count. I need to find a different collection type for my needs. I need something that has fast insertion and lookup, but also allows me to store integers in it.

Foundation has a number of lesser known collection types. Everyone knows NSArray and NSDictionary, NSSet and NSCountedSet are used less often. At the more obscure end lies NSMapTable, which is a very versatile object. It acts in the same way as an NSDictionary, mapping keys to values, but it can be configured to operate on raw C pointers instead of just objects. I can use this feature to store my SquashedColor integers by casting them to a pointer, because after all, what is a pointer but a very large integer, and indeed, NSMapTable has the NSIntegerMapKeyCallbacks/NSIntegerMapValueCallbacks to help to do just that.

One other interesting feature of NSMapTable is that even though it is an Objective C object, it has both an Objective C interface, and a pure C interface. As I’m trying to reduce the number of objc_msgSend calls the application is making, I’ll use this C interface, although it does mean I need to keep track of the NSMapTable’s memory management myself, rather than letting ARC do it. This isn’t such a chore, I’m only creating two NSMapTables and they have a very specific lifespan, they don’t need to hang around for very long.

NSMapTable *backgroundMap = NSCreateMapTable(NSIntegerMapKeyCallBacks,NSIntegerMapValueCallBacks,maxNumberOfColors);

The counting feature of NSCountedSet is still missing from NSMapTable however, so we’ll have to implement that ourselves by checking if a colour is already in the map table, and if it is, incrementing the count of the colour, otherwise, just insert it with a count of 1.

The next part of the algorithm which creates NSArrays of PCCountedColors could still be used if we modified PCCountedColor to use the SquashedColor type rather than NSColor, but that’s still creating objects to associate one number (our SquashedType) with another number (the number of times it appears in the image).

Again I didn’t want the hassle of trying to maintain multiple arrays, or allocating structs, but I could use the same trick I used earlier to create the SquashedColor: shift the SquashedColor left by 32bits, which would then leave me with the lower 32bits of a UInt64 to store the count which can be manipulated with the standard C operators.

Again, some macros simplified the code, and I created another type, CountedColor.

typedef UInt64 CountedColor;
#define TAG(v) (((CountedColor)v) << 32)
#define UNTAG_RGBA(v) ((SquashedColor)(v >> 32))
#define UNTAG_COUNT(v) ((UInt32)v)

This meant I could carry out the counting with a short C function that checks to see if we’ve already got the SquashedColor in the map table, and increments it if we have, otherwise it adds it with a count of 1.

static void insertIntoMapOrIncrement(NSMapTable *mapTable, SquashedColor key)
UInt64 key64 = (UInt64)key;
void *value;
    value = NSMapGet(mapTable, (const void *)key64);
    if (value == NULL) {
// Create a CountedColor with an initial value of 1
CountedColor taggedValue = TAG(key) + 1;
NSMapInsert(mapTable, (const void *)key64, (const void *)taggedValue);
} else {
UInt64 taggedValue = (UInt64)value;
        NSMapInsert(mapTable, (const void *)key64, (const void *)taggedValue);

Like NSCountedSet, NSMapTable is unordered, so I still needed to gather the CountedColors into an array and sort them manually. I just needed to allocate a chunk of memory for an array of CountedColors and then sort them with the standard libc qsort function.

static int
compare_tagged_ints_reversed (const void *a_pointer, const void *b_pointer)
CountedColor tagged_a = *(CountedColor *)a_pointer;
CountedColor tagged_b = *(CountedColor *)b_pointer;
UInt32 a = UNTAG_COUNT(tagged_a);
UInt32 b = UNTAG_COUNT(tagged_b);
    // Returns inverted as we want to sort biggest first
return b — a;
static void
fillArrayFromMapTable(NSMapTable *mapTable,CountedColor **array)
int count = 0;
void *key, *value;
    NSMapEnumerator mapEnumerator = NSEnumerateMapTable(mapTable);
while (NSNextMapEnumeratorPair(&mapEnumerator, &key, &value)) {
(*array)[count++] = (CountedColor)value;
    qsort(*array, count, sizeof(CountedColor), compare_tagged_ints_reversed);

With that, all the NSColor and PCCountedColor objects were gone, except for one place. The initial time the colour was obtained from the image using the -colorAtX:y: method on NSBitmapImageRep. But there is a similar method -getPixel:atX:y: which returns the RGBA value as a 4 member array.

The rest of the SLColorArt code was changed to use the various macros to access the RGBA data to feed it into the colour analysis functions to select the best colours.

Looking at Instruments though, it seemed that getPixel:atX:y: did lots more work than I was expecting, it was now the highest entry in the profile, and unfolding the stacktrack from the brilliantly obscurely named _platform_bzero$VARIANT$Ivybridge showed that the method wasn’t as simple internally as I’d thought it would be.

It looks like NSBitmapImageRep’s -getColorAtX:y: and -getPixel:atX:y: methods are good if you need to get a single colour out occasionally, but if you want to get all the colours out then getting your hands dirty with the raw bitmap data is necessary.

Bitmap data is stored as a large block of unsigned chars, with each group of 4 bytes representing the R, G, B and A components. NSBitmapImageRep’s -bitmapData method is used to return it, and then to actually work with it, we need to know the width, height and the number of bytes per row (sometimes called the row span). The number of bytes per row might be different from the width of the image * 4, because sometimes padding is added to the end of each row for memory alignment. Again, a macro can get the memory location for each pixel:

#define GET_PIXEL_AT_XY(data, x, y, rowSpan) (data + (x * 4) + (y * rowSpan))

And the value returned from this macro can be converted to component values using pointer arithmetic:

unsigned char *pixel = GET_PIXEL_AT_XY(bitmapData, 0, y, rowSpan);
red = *(pixel);
green = *(pixel + 1);
blue = *(pixel + 2);
alpha = *(pixel + 3);

I made a few other tweaks to the algorithm, for example, choosing the background colour first before even making the array for the text colours, so that I could exclude any colours that I knew wouldn’t be chosen based on the background colour. This means smaller arrays, faster sorting times and less time spent choosing the colours. Minor differences maybe, but every little helps.

I had thought about splitting the workload over multiple threads, with each thread counting the pixels in a smaller portion of the image, but it seemed that the length of time required to set up and then perform a much more complicated merge and sort of the result sets would be greater than the time actually saved.

The difference in stress test results after all these changes are huge: the time to analyse 100 images is around 3seconds, down from 64 seconds initially. If I turn on the code that uses CIImage to pixellate the image before analysing it, the time drops even further to about 0.8seconds, with no significant difference in the colours chosen. At an average time of 0.008s per image, I think I can count that as realtime. Peak memory usage is down as well, because using less objects and allocations leads to less memory fragmentation.

In summary:

  • Try not to use objects inside a tight loop if you don’t need to. Look for other types that can be used to represent the data.
  • C functions are more lightweight than ObjC messages.
  • NSBitmapImageRep’s accessor methods are fine for occasional access to the data, but if you need to get every pixel then you’re better off stepping through the raw bitmap data.

The FVColorArt code can be found at my Github repo:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.