GPU Speedup

6 min readOct 26, 2019

In our last blog, Concurrency and Parallelism, we demonstrated the speedup for large operations through an example of initializing a two-dimension array with concurrency and limited use of parallelism trough the multiple cores of a CPU.

In this blog, we will move the entire operation to the GPU and utilize its tremendous power of parallelism.

GPU

A form of parallelism known as single instruction multiple data (SIMD) refers to the ability of most modern microprocessors to perform a mathematical operation on multiple data items in parallel, using a single machine instruction. The combination of SIMD and multithreading forms parallelism known as single instruction multiple thread (SIMT), the basis of all modern GPUs.

GPUs are designed specifically to perform data-parallel computations on very large datasets. For computational tasks to be well suited for execution on a GPU, the computations performed on any one element of the dataset must be independent of the results of computation on other elements. The metal kernel example below shows the result matrix r at particular index is independent from other indices in the same matrix

kernel void (device float* r [[buffer (0)]],
 constant float* a [[buffer (1)]],
 constant float* b [[buffer (2)]],
 uint pid [[thread_position_in_grid]]){
 
 r[pid] = a[pid] * b[pid];
 }

GPU vs CPU

One might ask why there is a fundamental difference between the GPU and CPU in terms of parallelism performance. The answer lies in the design philosophy between the two types of processors.

GPU must be capable of moving extremely large amount of data in and out of its main DRAM because of graphics frame buffer requirements, this is called throughput. While a CPU has to satisfy requirements from a legacy OS, applications, and I/O operations make memory bandwidth more difficult to increase, therefore they are designed to minimize the execution latency of a single thread

An important observation is that reducing latency is much more expensive than increasing throughput in terms of power and chip area. Therefore, the prevailing solution is to optimize for the execution throughput of massive numbers of threads.

One can conclude that ideal setup for an application is to be designed with low latency and high throughput. Where low latency allows the CPU to execute commands sequentially without the system being slow or unresponsive; and high throughput lets the GPU process data in parallel.

Coding

Finally the fun part, coding. Many APIs exists for parallel programming such as CUDA by NVIDIA, openCL by Kronos group, DirectX by Microsoft, and most importantly Metal by Apple. As usual on this site, we will focus on Apple Metal API.

We will start by learning the essential tasks of Metal programming for the purpose of GPGPU, remember graphics programming is out of the scope of this post. As always you can download the code from my Github repository (please don’t forget to follow me on Github 😁).

Metal Initilization

Communicating with the GPU, requires the following line:

var device = MTLCreateSystemDefaultDevice()!

device responsibility will be creating directly or indirectly objects that are usable only with that device object. Apps that use multiple GPUs will use multiple device objects and create a similar hierarchy of Metal objects for each.

CommandQueue: this object is responsible for creating and organizing Metal Buffers to send and schedule tasks to the GPU.
Library: similar to the commandQueue it is created by the device to hold a pointer to the metal functions, vertex, fragment, and kernel.
Pipeline: A pipeline specifies the steps that the GPU performs to complete a specific task; By converting the function into executable code on the GPU. Because we are focusing on GPGPU, we use a kernel function.
Buffer: holds the data and commands for the GPU.
Encoder: the commandEncoder is used to write commands into the buffer object.

Summing it up, you will start with an MTLDevice for the device object that, using this object to create one MTLCommandQueue object, and one MTLLibrary object in your app. You will have at least one MTLComputePipeline object and at least one MTLBuffer object.

Threads

Massive parallelism calls for massive threads! In Metal, threads are organized in 1D, 2D, and 3D grids. Metal subdivides the grids into Threadgroups up to 3 dimensions. Threads in a thread group share the same memory space.

Thread counts are part of the pipeline setting.

Code

The code on the CPU side:

import Foundation
import MetalKitlet row : uint = 30000
var column : uint = 4000
var array  = Array(repeating: Array<Float>(repeating: 0, count: Int(column)), count: Int(row))let start = DispatchTime.now() // <<<<<<<<<< Start time//1
var device = MTLCreateSystemDefaultDevice()!
var commandQueue = device.makeCommandQueue()!
var library = device.makeDefaultLibrary()
//2
let commandBuffer = commandQueue.makeCommandBuffer()
let computeEncoder = commandBuffer?.makeComputeCommandEncoder()
//3
var computeFunction = library?.makeFunction(name: "kernel_main")!
var computePipelineState = try! device.makeComputePipelineState(function: computeFunction!)
//4
var matrixBuffer = device.makeBuffer(bytes: &array, length: Int(row*column) * MemoryLayout<Float>.stride, options: [])
//5
computeEncoder?.pushDebugGroup("settingup")
computeEncoder?.setComputePipelineState(computePipelineState)
computeEncoder?.setBuffer(matrixBuffer, offset: 0, index: 0)
computeEncoder?.setBytes(&column, length: MemoryLayout<uint>.stride, index: 1)let threadsPerThreadGrid = MTLSizeMake(Int(row * column), 1, 1)
computeEncoder?.dispatchThreadgroups(threadsPerThreadGrid, threadsPerThreadgroup: MTLSizeMake(1, 1, 1))
//6
computeEncoder?.endEncoding()
computeEncoder?.popDebugGroup()
commandBuffer?.commit()
commandBuffer?.waitUntilCompleted()let end = DispatchTime.now()   // <<<<<<<<<<   end timelet nanoTime = end.uptimeNanoseconds - start.uptimeNanoseconds // <<<<< Difference in nano seconds (UInt64)
let timeInterval = Double(nanoTime) / 1_000_000_000 // Technically could overflow for long running testsprint("Time to execute: \(timeInterval) seconds")let contents = matrixBuffer?.contents()
let pointer = contents?.bindMemory(to: Float.self, capacity: Int(row*column)

Going through the code

Creating a device object which directly creates a commandQueue and library object. These objects will be the same throughout the lifetime of the application.
Creating a commandBuffer to hold the data and commands the CPU will setup for the commandQueue object.
Creating the object that holds the pointer to the file where the function is defined. Next the pipeline object is created and will only execute the function on the current buffer object.
The matrixBuffer object is a GPU memory layout that will be initialized with array variable.
The encoder object will set the pipeline, all objects that needs to be sent to the GPU memory, and the number of threads that the GPU needs to be created and executed.
Finally we will end the encoding and commit the buffer to the GPU for execution. For the purpose of timing we will wait for the GPU.

The following illustration demonstrate the code above:

The code on the GPU

#include <metal_stdlib>
using namespace metal;kernel void kernel_main(device float* factors [[buffer(0)]],
                        constant uint& column [[buffer(1)]],
                        uint pid [[thread_position_in_grid]]){
    factors[pid] = (pid / column) * (pid % column);
}

Going through the parameters of our function:

device memory location allows read and write operations, the factors array will store the result of the multiplication. The attribute buffer is at location 0 where we set it on the CPU side.
The column variable is in the constant memory since it will be a read operation only. The attribute buffer is is at location 1 where it was set on the CPU side.
The last parameter is threads used, here we set every thread in its on workgroup because there was no cooperation between any thread. Each thread computed independently the value for its location in the array.

The body of the function is a single line, our two dimensional array on the CPU side has been passed has a one dimensional array. That is why we need the number of columns to split into the next row.

Metal is low level, very low, all memory’s are C type behavior. Note that all kernel functions are void functions in Metal.

Comparison

From our previous blog, using the CPU with GCD and setting 30000 rows with 4000 columns we get the following out:

Time to execute: 31.385106194 seconds
Program ended with exit code: 0

Compared to utilizing the GPU, the output of the above code is:

Time to execute: 0.488734338 seconds
Program ended with exit code: 0

Conclusion

The performance, speed, and other factors gained by the GPU makes it a must to utilize in modern-day applications. TheIn our next blog we will demonstrate more features and how to utilize the GPU further.

Hopefully, you found this post useful, please clap and follow to encourage me to write more about Metal and Swift. I would greatly appreciate visiting my blog here and following me. Recommending to others is always a great idea, trust me 😀.

Thank you and till next time, Happy Coding!