On the Implementation of Portable Convolution Layers

Published in

The Startup

25 min readSep 20, 2019

In this post I’ll attempt to describe an implementation of neural network convolution layers that we found to perform most decently across a wide variety of GPU architectures from different vendors (including Intel, Nvidia, Qualcomm, ARM, and Apple). Such best-average-performance implementations are by definition not the absolute best for individual devices — with the most performance loss coming from the non-use of highly proprietary systolic/neural engines. However, when the attained performance is sufficient for the target application, such generic implementations can enable more rapid deployment of novel neural networks in the highly fragmented mobile/edge off-line application market.

[If such generic implementations for neural networks were to be proliferated in common applications and code bases, it may steer the industry away from the current state of extreme fragmentation and instead towards optimizing hardware and software to better accommodate generic solutions. And that will ultimately streamline the deployment of novel neural network architectures and techniques — leading to more innovation.]

The specific application of interest in this effort was mobile fine-grain classification, where accuracy, adaptability and portability is more valuable than raw inference-per-second speed (as long as a minimum performance is met). And thus our pursuit of a single uniform convolution implementation that will perform sufficiently well across a variety of graphics platforms. Among the platforms used for evaluation are ARM (Mali), Qualcomm (Adreno), and Apple iOS devices for inference. And for training focus was on Intel (Gen), AMD and NVidia graphics.

I will next attempt to describe the step-by-step kernel optimizations from concise to advanced optimizations. The code presented here is in OpenCL, but it was easily ported to iOS Metal line-by-line (for app deployment). A later post may describe the exact OpenCL to Metal conversion process used. In what follows it is assumed that the reader is familiar with basic terminology of graphics compute, OpenCL and convolutional neural networks.

The Concise Implementation

This version is often referred to in the literature as the “naive” solution. I’ll avoid using that term as there’s no naivete in it. It’s concise and readable — how API level code should be. A good compiler should be able to understand what it is intended to do, and would generate machine binary for it which maps well to the architecture at hand. But alas we live in a time where that is not really possible, and thus someone needs to hand-modify the kernel code at the API level to get better performance.

The first figure below illustrates the ranges of input and weight values needed to produce outputs in such a 3D convolution. The second figure flattens the output range to illustrate both the size of the input and weight tensors needed for each output value, and how their ranges are dependent on the output position. It shows for instance that the input tensor is independent of output depth index. That means that at a given output width and height position, all outputs along depth take the same exact input tensor.

visualization of data accesses for full-depth convolution
(different output depths of same spatial position)

Illustration of the dimensions of input and weights for each output, and how the ranges are dependent on output index (with a flattened output range)

The concise code for full-depth convolution is shown in the code snippet below, and the figure above illustrates what it is doing. To reduce scope we’ll focus on full-depth convolutions, which are used in VGG/Resnet types of networks — leaving out grouped and depth-wise convolutions for another time. Also while the examples below are for full float tensor values, the general concepts apply to other data formats.

In the kernel header, input_buf, weight_buf and output_buf, are as their names suggest, the input, weight and output buffers respectively. The b_buf holds the bias values (input-independent learned values added to the output). We consider three bias modes determined by the value of the _bias_mode macro: 0) no bias, 1) per-output-depth bias, and 2) per-output-position bias. Using the per-output-position bias is more costly, in that it adds to the size of the saved network parameters (and the amount of data needed to perform a convolution). The per-output-depth bias notably reduces the size of the parameters. However implementing its adjustments during backprop is more difficult without sacrificing parallelism (this will be addressed in a future convolution backprop walk-through).

__kernel void Conv_Forward(
__global float* input_buf,
__global float* weight_buf,
__global float* b_buf,
__global float* output_buf) {
    uint out = 0; uint batch = 0;
    uint w_index = get_global_id(0)*out_stride;
    uint h_index = get_group_id(1)*out_stride;
    if(-)
        … // other conditions here
    else
    {
        if (get_group_id(2) >= out_depth * batch_size) return;
        out = get_group_id(2) % out_depth;
        batch = get_group_id(2) / out_depth;
    }
    float sum = 0;
    uint pad = out_pad/2;
    if(-)
        … // other conditions here
    else
    {
        for (uint in = 0; in < in_depth; in++) {
            float input_buf_sub[$KERN_SIZE*$KERN_SIZE];
            float weight_buf_sub[$KERN_SIZE*$KERN_SIZE];
            
            // load input and weight arrays
            for (uint y = 0; y < $KERN_SIZE; y++)
                for (uint x = 0; x < $KERN_SIZE; x++)
                {
                    input_buf_sub[y*$KERN_SIZE+x] = input_buf[(batch*in_depth+in)*(in_width*in_height)+(h_index+y)*in_width+x+w_index];                    weight_buf_sub[y*$KERN_SIZE+x] = weight_buf[in*out_depth*size+out*size+y*$KERN_SIZE+x];                }
            for (uint i = 0; i<$KERN_SIZE*$KERN_SIZE; i++) {
                sum = mad(input_buf_sub[i], weight_buf_sub[i], sum);
            }
    }
}uint o_index = out * out_width*out_height + (h_index/out_stride + pad) * out_width + (w_index/out_stride) + pad;
uint p_index = out;
uint out_index = batch * out_depth *out_width*out_height + o_index;
const float bias = (_bias_mode == 2) ? b_buf[o_index] : (_bias_mode == 1) ? b_buf[p_index] : 0;
output_buf[out_index] = ACTVTN(sum + bias);
}

Here each work item calculates one output neuron, and global id 0 corresponds to the width position, id 1 to the height and id 2 to output depth and batch. A dropout condition on width is added at the top of the kernel, in case the global width is bumped up to a multiple of the architecture simd width. The input and output buffers are assumed to be in NCHW memory format (Width then Height forming the lowest dimensions, followed by Channel depth and batch Number). While there may be alternative data layout formats that can attain better performance in cases (e.g. NHWC), they do tend to be more sensitive to the cache architecture — which we consider to be an unknown here. Therefore, we will stick with NCHW for now.

The meat of the code is between lines 28 to 43. It loops over the input depth and for each position loads up two private arrays; one for a portion of the input at that depth corresponding to the output spatial position (i.e. along height and width) and another for the corresponding weights at that input depth. Both arrays are equal in size to the square of the convolution kernel width. It then calculates the dot product of the two private arrays. The results of these dot products are accumulated across input depth positions (the outermost loop), to arrive at the final sum. Before writing this sum to the output buffer, it optionally adds bias and applies an activation function to it, depending on JITed in layer parameters.

Weight Vectorization

The first source of inefficiency with this implementation (or I should rather say the first problem that most compilers have with handling this implementation) is that the input and weight values are accessed as individual scalars. This wastes data fetch bandwidth in most graphics architectures, as they tend to be designed with wide data paths to accommodate vector loads (common in graphics rendering applications). Despite the fact that these accesses are in a loop that may appear to clearly be accessing consecutive addresses that could have automatically been combined into vector loads, most compilers are not capable of performing this optimization by themselves.

The suboptimality is particularly bad for the loading of the weight values, as these are uniform across subgroups (as we are assuming here subgroups are laid across tensor spatial width). Thus there’s no implicit vectorization across the SIMD lanes, as there is with the input loads. So they end up wasting considerable bandwidth by fetching small amounts of data over much wider data buses. So the first real optimization step is to spoon-feed the compiler the fact that these loads can be performed as vector loads — which fetch chunks of consecutive data in one go and make better use of the wide data buses . In order to accommodate this we’ll need to breakout the specific kernel width being handled. Going with a 3×3 convolution (common in VGG and Resnet), the main loop ends up looking like below (note: leaving in variations of the kernels in the if-then-else block enhances readability and has no runtime overhead when the conditions are based on JITed parameters).

if(-)
    … // other conditions here
else if (kernel_size == 3)
{
    const int size = 3 * 3;
    for (uint in = 0; in < in_depth; in++) {
        float input_buf_sub[9];
        // load input and weight for this sub area
        __global float* nextf = &(weight_buf[in * out_depth * size + out * size]);
        union {
            const float s[8];
            const float8 v;
        } uweights0 = { .v = vload8(0, nextf) };        nextf = &(weight_buf[in * out_depth * size + out * size + 8]);
        float uweights1 = *nextf;
        for (uint y = 0; y < 3; y++)
            for (uint x = 0; x < 3; x++)
                input_buf_sub[y * 3 + x] = input_buf[(batch * in_depth + in) * (in_width * in_height) + (h_index + y) * in_width + x + w_index];        // compute the convolution
        sum = mad(input_buf_sub[8], uweights1, sum);
        for (uint i = 0; i<8; i++) {
            sum = mad(input_buf_sub[i], uweights0.s[i], sum);
        }
    }
}

Here we’ve hoisted the loading of the weights out of the loop. Its 3*3=9 scalar loads are now performed via one vload8 and one scalar load. This notably enhances performance with pretty much all common graphics architectures and compilers, compared to the scalar version, as it significantly reduces the amount of data bus wastage. Note that here the vload8 return value is stored in a union to enable later iterator access to its elements when doing the MADs (there may be other ways to achieve this). Also the one remaining scalar access means that we don’t entirely get rid of the bus wastage.

The image below shows how the weight tensor is mapped into memory in IOHW format (filter Width then Height forming the lowest dimensions, followed by Output then Input depth). The dimension enumerations represent their order in each, with the length of the dimension 1 of memory layout representing the data bus width (or cache line size). The point is that when the weight values are fetched individually as scalars (and the corresponding assembly generated by compilers will also tend to do), only 1/8th of that data bus width will be effectively used in each access — wasting the rest of the bus.

Mapping weight tensor to memory space in IOHW format (consecutive dots represent vload ranges)

One thing to pay extra attention to here is that the input and output depths have been flattened into dimension 3 of the weight tensor. And we had an option for their order. By choosing IOHW format we have gone with the option of Output depth forming the lower of the two dimensions. This means increments along dimension 3 increment the corresponding output depth first, and the input depth next (when crossing multiples of the output depth size). The alternative would have been the reverse of that. And there is a bit of a complicated architecture-dependent tradeoff here that we’ll come back to in the future. But for now a simplified explanation for going with IOHW is that (as we’ll see shortly) there is a more inherent impediment to efficient vectorized fetching of input values, than there is for weight values (without changing the input layout in ways that are themselves architecture dependent). So it is better to just target reuse of the fetched input values as much as possible. And fetching chunks of weight values that span multiple output depths (but the same input depth), allows us to do exactly that.

Another thing to note in the vectorization above is that the one remaining scalar access shifts the starting address of consecutive vector loads across by an element. And as shown with the dotted locations in the memory layout, that makes the start of later vector loads misaligned across cache lines (for cache configurations in common architectures). We will shortly see that larger vectorization, that spans multiple output depths in one go, can also alleviate this problem. But first lets see how the inputs values can also be fetched more efficiently.

Input Vectorization

One difference with access to the input values, compared to weights, is that fetching beyond the kernel width does not wrap around into the next height position of the input tensor (unless in the rare case where the input tensor feature dimensions match the convolution kernel width). Therefore, there is less benefit in explicit vector loading here. This is illustrated in the figure below — when compared to that for weight tensor mapping above.

Mapping input tensor to memory space in NCHW format (consecutive dots represent explicit vload ranges)

Another difference is that the fetches of input values are not SIMD uniform and, as we placed SIMD lanes across the lowest dimension of the workgroup (which maps to the input width), there is implicit vectorization across SIMD lanes. In other words there is inherently less data bus wastage in input accesses (at least in the sense that stuff fetched over the bus is used). The figure below shows how the 2 next-over neighboring work items (with the data they access marked with dark-blue and purple diagonal lines) along SIMD width add to the vectorized data fetch width.

How neighboring work items extend vectorization (consecutive dots represent SIMD-implicit vload ranges)

Nevertheless vectorization of input loads across width is more efficient than fetching them as individual scalars — and gives increasing benefit as the kernel width increases. The code snippet below shows how the input loading loop can be modified to benefit from that.

for (unsigned int y = 0; y < 3; y++)
{
    __global float* nexti = &(input_buf[(batch * in_depth + in) * (in_width * in_height) + (h_index + y) * in_width + w_index]);
    const float3 nextv = vload3(0, nexti);
    input_buf_sub[y * 3] = nextv.x;
    input_buf_sub[y * 3 + 1] = nextv.y;
    input_buf_sub[y * 3 + 2] = nextv.z;
}

So unlike the loading of weights which we mostly converted to one big vector load across kernel width and height, here we can only do vector loads across width. leaving it fragmented across height. More important however is the fact that here efficiency improvement is coming from reusing data across SIMD lanes, rather than avoiding data bus wastage. We next look at how to take this concept of data reuse further.

To recap, going back to the flattened output tensor range illustration from earlier, so far through simple vectorization we’v managed to make our access to the weight tensor 4.5x more efficient and access to the input tensor 3x more efficient (in 3×3 convolutions).

How vectorization optimizes input and weight tensor access over the concise implementation.

Combining Across Output Depth

Until now each work item has been calculating one output neuron, with work items corresponding to same spatial position fetching the same input values (input locality), and those corresponding to the same output depth fetching the same weight values (weight locality). The orange annotations added to the figures below illustrate what data can be reused, across what dimension, depending on what dimension is fixed. So there is a lot of re-access to data involved, and it is generally desirable to minimize re-access to data, or at least the cost of re-access.

Input locality: for a fixed spatial position the input is reused across output depth

Weight locality: for a fixed output depth the weights are reused across different spatial positions

One approach to minimizing such data re-access is to combine work items. This allows data to be pulled into the register file once, with one thread performing the work that multiple threads originally did. But it does come at the cost of reduced parallelism, so it needs to be employed judiciously. The question becomes which work items to combine and to what degree. The option we primarily go with is to combine across output depth, thus targeting input locality. The reason for this is that it enables us to additionally improve the vectorization of the weight fetches (as explained earlier fetching larger chunks of weights that span multiple output depths is more efficient). The degree to which this combining should be done is slightly architecture dependent, but it is a parameter adjustable with mostly formulaic code changes. And we have found a 16x combining along output depth to work well for VGG/Resnet layers on most architectures. The code below illustrates how it can be implemented.

if(-)
    … // other conditions here
else if ((kernel_size == 3) && (out_depth % 16 == 0) && (out_stride==1))
{
    out *= 16;
    const int size = 3 * 3;
    float sum8[16] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 }; 
    for (unsigned int in = 0; in < in_depth; in++) {
        float input_buf_sub[9]; 
        // load input and weight for this sub area
        union {
            float s[8];
            float8 v;
        } uweights0[16];        float uweights1[16];
        for (int j = 0; j<16; j++) 
        {
             __global float* nextf = &(weight_buf[in * out_depth * size + (out + j) * size]);
             uweights0[j].v = vload8(0, nextf);  
             nextf = &(weight_buf[in * out_depth * size + (out + j) * size + 8]);
             uweights1[j] = *nextf;
        }        for (unsigned int y = 0; y < 3; y++)
        {
            __global float* nexti = &(input_buf[(batch * in_depth + in) * (in_width * in_height) + (h_index + y) * in_width + w_index]);
            const float3 nextv = vload3(0, nexti);
            input_buf_sub[y * 3] = nextv.x;
            input_buf_sub[y * 3 + 1] = nextv.y;
            input_buf_sub[y * 3 + 2] = nextv.z;
        }        // compute the convolution
        for (int j = 0; j<16; j++)
        {
            sum8[j] = mad(input_buf_sub[8], uweights1[j], sum8[j]);
            for (unsigned int i = 0; i<8; i++) { 
                sum8[j] = mad(input_buf_sub[i], (uweights0[j]).s[i], sum8[j]);
            }
        }
    }
    for (int j = 0; j<16; j++)
    {
        unsigned int o_index = (out + j)*out_width*out_height + (h_index+pad) * out_width + w_index + pad;
        unsigned int p_index = (out + j);
        const float bias = (_bias_mode == 2) ? b_buf[o_index] : (_bias_mode == 1) ? b_buf[p_index] : 0;
        unsigned int out_index = batch * out_depth *out_width*out_height + o_index;
        float actatn = sum8[j] + bias;
        if (storePreAct) pre_act_output_buf[out_index] = actatn;
        actatn = ACTVTN(actatn);
        output_buf[out_index] = actatn;
    }
    return;
}

Note that the condition on the code block now limits it to cases where the output depth is a multiple of 16. We consider output depth to be a JITed parameter, making it a build-time condition with no run-time overhead. The work looks very similar to how it was before, except now each work item calculates 16 sums for 16 different output depths. Therefore the number of threads can be reduced by a factor of 16. Note that even though now 16 times more weight values are fetched per work item, the input fetching has remained unchanged. This is because we are reusing the input values, and it means the input is in total fetched into register files 1/16th as many times as before. That in itself can give a significant performance improvement on most architectures (depending on how much of the locality was being captured by low level caches).

Additionally, since the 16x weight fetches are from consecutive addresses, they can now be combined into large vector loads, making it possible to avoid scalar accesses that waste data bus bandwidth — even further improving performance. However most compilers are again not able to identify this as a possibility, and the programmer needs to spoon-feed it at the API level. The snippet below shows how the weight loading part (highlighted above) can be replaced to achieve this.

union {
    float s[16];
    float16 v;
} tempw0, tempw1;__global float* nextf0 = &(weight_buf[in * out_depth * size + (out * size) + (0 * 16)]);
tempw0.v = vload16(0, nextf0);
uweights0[0].v = (float8)(tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7]);
uweights1[0] = tempw0.s[8];
tempw1.v = vload16(1, nextf0);
uweights0[1].v = (float8)(tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0]);
uweights1[1] = tempw1.s[1];
uweights0[2].v = (float8)(tempw1.s[2], tempw1.s[3], tempw1.s[4], tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9]);
uweights1[2] = tempw1.s[10];
tempw0.v = vload16(2, nextf0);
uweights0[3].v = (float8)(tempw1.s[11], tempw1.s[12], tempw1.s[13], tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2]);
uweights1[3] = tempw0.s[3];
uweights0[4].v = (float8)(tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11]);
uweights1[4] = tempw0.s[12];
tempw1.v = vload16(3, nextf0);
uweights0[5].v = (float8)(tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0], tempw1.s[1], tempw1.s[2], tempw1.s[3], tempw1.s[4]);
uweights1[5] = tempw1.s[5];
uweights0[6].v = (float8)(tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9], tempw1.s[10], tempw1.s[11], tempw1.s[12], tempw1.s[13]);
uweights1[6] = tempw1.s[14];
tempw0.v = vload16(4, nextf0);
uweights0[7].v = (float8)(tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6]);
uweights1[7] = tempw0.s[7];
uweights0[8].v = (float8)(tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15]);
tempw1.v = vload16(5, nextf0);
uweights1[8] = tempw1.s[0];
uweights0[9].v = (float8)(tempw1.s[1], tempw1.s[2], tempw1.s[3], tempw1.s[4], tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8]);
uweights1[9] = tempw1.s[9];
tempw0.v = vload16(6, nextf0);
uweights0[10].v = (float8)(tempw1.s[10], tempw1.s[11], tempw1.s[12], tempw1.s[13], tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1]);
uweights1[10] = tempw0.s[2];
uweights0[11].v = (float8)(tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10]);
uweights1[11] = tempw0.s[11];
tempw1.v = vload16(7, nextf0);
uweights0[12].v = (float8)(tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0], tempw1.s[1], tempw1.s[2], tempw1.s[3]);
uweights1[12] = tempw1.s[4];
uweights0[13].v = (float8)(tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9], tempw1.s[10], tempw1.s[11], tempw1.s[12]);
uweights1[13] = tempw1.s[13];
tempw0.v = vload16(8, nextf0);
uweights0[14].v = (float8)(tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5]);
uweights1[14] = tempw0.s[6];
uweights0[15].v = (float8)(tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14]);
uweights1[15] = tempw0.s[15];

In this manner with a total of 9 vload16s we manage to read in the weight values for 16 output depths without any bus wastage. This removes the need for any scalar loads, by employing vector loads that cross output depths. Moreover, it removes the cache line misalignment of the accesses (assuming the weight tensor is allocated aligned to begin with). The consecutive dots in the illustration below, when compared to the same from earlier, shows how this is now achieved. This should notably improve performance on most architectures.

Mapping weight tensor to memory space
with vector accesses that cross depth boundaries
(consecutive dots represent vload ranges )

One caveat to mention here is with regards to the interleaving of weight fetching and the MAD operations. First loading the weights for 16 output depths and then spinning over a 16x loop for the accumulations at each input depth results in code that exceeds the code motion optimization limits of most compilers, and can thus result in register spilling (writing out values to memory and later reading them back). It is therefore often necessary to spoon-feed the compiler code where the MAD loop is manually unrolled and intermingled with the fetching of weights. how this intermingling is done can be slightly architecture-dependent. But it is a localized optimization. We will demonstrate such an intermingling below, along with another optimization that targets weight locality.

To recap, going back to the flattened output tensor range illustration, through combining work along output depth we managed to cut re-access to the input tensor to 1/16th (since it is independent of output index). At the same time it further double the efficiency of access to the weight tensor (taking it from 4.5x to 9x over the concise implementation) for 3x3 convolutions.

How combining 16x across output depth further optimizes input and weight tensor access over the concise implementation.

Combining Across Width

We’ve seen how simply laying SIMD lanes along feature width (with the same output depth) can help capture some of the weight locality, with weights fetched once into the register file and shared across lanes, (and how vectorizion helps avoid inefficient bus utilization while at it). Before we look at how this concept can be extended, there is a caveat worth mentioning with regards to register usage. While it may seem trivial, some compilers are not able to identify that SIMD-uniform values can indeed be retained in shared registers. So while the data may be fetched efficiently, a separate per-lane copy of it ends up getting stored in the register file. This increases register pressure and may limit how much further work items can be combined before register spills occurs. There may be ways to spoon-feed the compiler how to split up the storage of SIMD-uniform data across lanes (e.g. sub_group_block_read functions) and later explicitly access that data from the lanes (e.g. sub_group_broadcast functions), but these solutions are messy and highly vendor specific. Needless to say we will not approach them here, and will leave it to compilers to improve in the future.

One way to extend the concept of weight reuse is to somehow cache the data close to the execution units. The illustration below shows the different forms of weight sharing between threads corresponding to the same output depth. Local sharing provides faster access to the weight data, to at least minimize the cost of re-access. While such sharing can often be accommodated with software-managed shared local memory (and vendor specific solutions often promote such solutions), the exact implementations tend to be messy and highly influenced by architecture details. Therefore, we won’t pursue such solutions here, and instead rely on the more ubiquitous tagged cache hierarchy to capture cross-thread weight locality (even if imperfectly).

There tend to be architecture-specific parameters that can be worth tuning to improve the benefit of local sharing, without requiring messy vendor-specific code changes. For instance, it is a good idea to make sure the weight buffers are marked as constant, because many architectures have faster local constant caches that when promoted to can give performance benefits. Also some devices have setup option options for controlling how threads are dispatched, which can affect whether threads corresponding to the same output depth land on execution units with the same local caches or not. This can impact whether weight locality is captured.

Forms of weight sharing achievable across threads corresponding to same output depth.

There is another way to extend weight reuse, and that is to combine threads corresponding to the same output depth along the spatial domain. This approach reduces the total amount of weight traffic pulled into the execution units, by reducing parallelism and doing more work per thread. It also increases register pressure, as it increases the number of accumulations each work item needs to maintain. Therefore, its effectiveness depends on feature size (i.e. how much same-depth parallelism there is), thread count and register file size of the device (i.e how much concurrent parallelism it can support and at what point register spilling occurs). But since it is an easy parameter to change without invasive code changes, it can be viewed as an effective fine-tuning parameter. For instance, we have found the 3×3 convolutions in Resnet/VGG to benefit from a 2x combining of work items across width on most architectures (while extending the depth combining by another 2x does not).

The primary difficulty introduced with width combining is that feature widths tend to vary considerably in size, and assuming a width to be a multiple of any number can restrict usage. The code below shows the final optimized code for 3×3 convolution with 16x combining in depth, 2x combining in width, and intermingling of weight fetching with MAD operations. For generality here we add bound checks to make sure we don’t exceed the feature width when it is odd. But an alternative solution is to have separate code paths for even and odd feature widths (as it is a JITed parameter). This illustrates a difficulty with extending this method to combine more work items in width.

else if ((kernel_size == 3)&&(out_depth % 16 == 0)&&(out_stride == 1))
{
  out *= 16;
  w_index *= 2;
  if (w_index >= out_width-out_pad) return;
  const uint size = 3 * 3;
  float sum8[16] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };
  float sum8_[16] = { 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 };
  for (unsigned int in = 0; in < in_depth; in++) {
  float input_buf_sub[9];
  float input_buf_sub2[9];
  for (unsigned int y = 0; y < 3; y++)
  {
    __global float* nexti = &(input_buf[(batch * in_depth + in) * (in_width * in_height) + (h_index + y) * in_width + w_index]);
    const float4 nextv = vload4(0, nexti);
    input_buf_sub[y * 3] = nextv.x;
    input_buf_sub[y * 3 + 1] = nextv.y;
    input_buf_sub[y * 3 + 2] = nextv.z;
    input_buf_sub2[y * 3] = nextv.y;
    input_buf_sub2[y * 3 + 1] = nextv.z;
    input_buf_sub2[y * 3 + 2] = nextv.w; 
  }

  float8 uweights0[16];
  union {
    float s[16];
    float16 v;
  } tempw0, tempw1;
  float uweights1[16];
 __global float* nextf0 = &(weight_buf[in * out_depth * size + (out * size) + (0 * 16)]);
  tempw0.v = vload16(0, nextf0);
  uweights0[0] = (float8)(tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7]);
  uweights1[0] = tempw0.s[8];
  tempw1.v = vload16(1, nextf0);
  uweights0[1] = (float8)(tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0]);
  uweights1[1] = tempw1.s[1];
  uweights0[2] = (float8)(tempw1.s[2], tempw1.s[3], tempw1.s[4], tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9]);
  uweights1[2] = tempw1.s[10];
  tempw0.v = vload16(2, nextf0);
  uweights0[3] = (float8)(tempw1.s[11], tempw1.s[12], tempw1.s[13], tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2]);
  uweights1[3] = tempw0.s[3];
  uweights0[4] = (float8)(tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11]);
  uweights1[4] = tempw0.s[12];
  tempw1.v = vload16(3, nextf0);
  uweights0[5] = (float8)(tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0], tempw1.s[1], tempw1.s[2], tempw1.s[3], tempw1.s[4]);
 uweights1[5] = tempw1.s[5];
 uweights0[6] = (float8)(tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9], tempw1.s[10], tempw1.s[11], tempw1.s[12], tempw1.s[13]);
 uweights1[6] = tempw1.s[14];
 tempw0.v = vload16(4, nextf0);
 uweights0[7] = (float8)(tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6]);
 uweights1[7] = tempw0.s[7];
 // compute the convolution
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[8], uweights1[j], sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[0], (uweights0[j]).s0, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[1], (uweights0[j]).s1, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[2], (uweights0[j]).s2, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[3], (uweights0[j]).s3, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[4], (uweights0[j]).s4, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[5], (uweights0[j]).s5, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[6], (uweights0[j]).s6, sum8[j]);
 }
 for (uint j = 0; j<8; j++) {
 sum8[j] = mad(input_buf_sub[7], (uweights0[j]).s7, sum8[j]);
 }
 if (!(w_index + 1 >= out_width-out_pad)) {
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[8], uweights1[j], sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[0], (uweights0[j]).s0, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[1], (uweights0[j]).s1, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[2], (uweights0[j]).s2, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[3], (uweights0[j]).s3, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[4], (uweights0[j]).s4, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[5], (uweights0[j]).s5, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[6], (uweights0[j]).s6, sum8_[j]);
 }
 for (uint j = 0; j < 8; j++) {
 sum8_[j] = mad(input_buf_sub2[7], (uweights0[j]).s7, sum8_[j]);
 }
 }
 uweights0[8] = (float8)(tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15]);
 tempw1.v = vload16(5, nextf0);
 uweights1[8] = tempw1.s[0];
 uweights0[9] = (float8)(tempw1.s[1], tempw1.s[2], tempw1.s[3], tempw1.s[4], tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8]);
 uweights1[9] = tempw1.s[9];
 tempw0.v = vload16(6, nextf0);
 uweights0[10] = (float8)(tempw1.s[10], tempw1.s[11], tempw1.s[12], tempw1.s[13], tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1]);
 uweights1[10] = tempw0.s[2];
 uweights0[11] = (float8)(tempw0.s[3], tempw0.s[4], tempw0.s[5], tempw0.s[6], tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10]);
 uweights1[11] = tempw0.s[11];
 tempw1.v = vload16(7, nextf0);
 uweights0[12] = (float8)(tempw0.s[12], tempw0.s[13], tempw0.s[14], tempw0.s[15], tempw1.s[0], tempw1.s[1], tempw1.s[2], tempw1.s[3]);
 uweights1[12] = tempw1.s[4];
 uweights0[13] = (float8)(tempw1.s[5], tempw1.s[6], tempw1.s[7], tempw1.s[8], tempw1.s[9], tempw1.s[10], tempw1.s[11], tempw1.s[12]);
 uweights1[13] = tempw1.s[13];
 tempw0.v = vload16(8, nextf0);
 uweights0[14] = (float8)(tempw1.s[14], tempw1.s[15], tempw0.s[0], tempw0.s[1], tempw0.s[2], tempw0.s[3], tempw0.s[4], tempw0.s[5]);
 uweights1[14] = tempw0.s[6];
 uweights0[15] = (float8)(tempw0.s[7], tempw0.s[8], tempw0.s[9], tempw0.s[10], tempw0.s[11], tempw0.s[12], tempw0.s[13], tempw0.s[14]);
 uweights1[15] = tempw0.s[15];
 // compute the convolution
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[8], uweights1[j], sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[0], (uweights0[j]).s0, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[1], (uweights0[j]).s1, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[2], (uweights0[j]).s2, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[3], (uweights0[j]).s3, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[4], (uweights0[j]).s4, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[5], (uweights0[j]).s5, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[6], (uweights0[j]).s6, sum8[j]);
 }
 for (uint j = 8; j<16; j++) {
 sum8[j] = mad(input_buf_sub[7], (uweights0[j]).s7, sum8[j]);
 }
 if (!(w_index + 1 >= out_width-out_pad))
 {
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[8], uweights1[j], sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[0], (uweights0[j]).s0, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[1], (uweights0[j]).s1, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[2], (uweights0[j]).s2, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[3], (uweights0[j]).s3, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[4], (uweights0[j]).s4, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[5], (uweights0[j]).s5, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[6], (uweights0[j]).s6, sum8_[j]);
 }
 for (uint j = 8; j < 16; j++) {
 sum8_[j] = mad(input_buf_sub2[7], (uweights0[j]).s7, sum8_[j]);
 }
 }
 }
 for (uint j = 0; j<16; j++)
 {
     if (w_index + 1 >= out_width-out_pad)
     {
         unsigned int o_index = (out + j)*out_width*out_height + (h_index+pad) * out_width + w_index + pad;
         unsigned int p_index = (out + j);
         float bias = (_bias_mode == 2) ? b_buf[o_index] : (_bias_mode == 1) ? b_buf[p_index] : 0;
         unsigned int out_index = batch * out_depth * out_width * out_height + o_index;
 
 
         float actatn = (sum8[j] + bias);
         if (storePreAct) pre_act_output_buf[out_index] = actatn;
         actatn = ACTVTN(actatn);// (actatn > 0) ? actatn : 0;
         output_buf[out_index] = actatn;
     }
     else
     {
         unsigned int o_index = (out + j)*out_width*out_height + (h_index+pad) * out_width + w_index + pad;
         unsigned int p_index = (out + j);
         float2 biases = (_bias_mode == 2)?vload2(0, &b_buf[o_index]):(_bias_mode == 1)?(float2)(b_buf[p_index], b_buf[p_index]):0;
         unsigned int out_index = batch * out_depth * out_width * out_height + o_index;
 
         const float2 pre_actatn = (float2)(sum8[j] + biases.x, sum8_[j] + biases.y);
         float2 actatn = (float2)(ACTVTN(pre_actatn.x), ACTVTN(pre_actatn.y));
       vstore2(actatn, 0, &output_buf[out_index]);
     }
 }
 return;
}

The code looks very similar to how it was before, only now we have an addition 16-entry sum_ array, which keeps track of the odd width positions. There is also a new input_buf_sub2 array added to store the input values corresponding to the odd width positions. The main benefit of this version is that it shares the same fetched wight values for both width positions. But there is another benefit from it.

If you compare lines 14–25, where the input values are vector loaded across width, to the equivalent code from the prior version, the fetched input data has not really changed. This is because of the input overlap between neighboring output width positions, and an over-fetching of data in the prior version. Although the over-fetch was due to an OpenCL limitation in supporting a 3-wide vload, even without that limitation (as it is iOS Metal), the fetched data per work item would only have increased by 1/3rd compared to before. Therefore, we are now performing 2x the MAD operations for a very small increase in the aggregate data pulled into the execution units.

It’s important to note that, while the input data pulled in per work item may not have increased (the same amount of weight and input data is fetched), the input width distance between work items in the same simd has increased. Therefore, as illustrated below, the width of input data loaded at the simd level has increased (assuming the simd width remains unchanged).

How increasing width extends implicit vectorization
(consecutive dots represent SIMD-implicit vload ranges)

Doubling the input width covered by each work item means that for the same simd width double the data is now fetched. This makes sense, since the total number of work items have halved, and the total amount of input fetched shouldn’t change. This increase in vectorized fetch can help reduce any remaining data bus wastage, depending on the simd and data bus width of the architecture (and the data size), while having minimal impact in cases where there was no wastage.

Therefore, all in all this last 2x combining in width enables us to acheive 2x reuse in weights fetched, with minimal bounds check overhead, and at the same time improve input bus utilization across more architectures. So, going back to the flattened output tensor range illustration, through combining work along input width by 2x we managed to cut re-access to the weight tensor by half, and at the same time improve efficiency of access to the input tensor from 3x to 4x, over the concise 3x3 implementation.

How combining 2x across input width further optimizes input and weight tensor access over the concise implementation.

Thanks for reading. And let me know what to expand on in the comments.