I've always wanted to learn GPU programming, so when I saw the honours project proposal I knew I had to apply for it. The title of my work is: "Evaluating Neural Networks on Low Powered GPUs". See the whole project here or read this page for the highlights. : )

ARM's line of MALI GPUs is designed to be used in battery powered devices like smartphones, or, if you are my University, even drones. Researchers at the University of Edinburgh have started placing the ODROID XU3 board (which contains a MALI T628 GPU) into drones and they wanted OpenCL kernels specifically optimized for the MALI. Thus my honours project was born. (When I get back to the Uni in September I'll try to get a picture of one of these drones and put it here, can't believe I forgot to take one)

So this is the ODROID board:

It contains one GPU which appears as two distinct OpenCL devices having 4 and 2 OpenCL compute units respectively. Each compute unit can handle a workgroup of 256 work items.

There are several key differences between the MALI GPU and a standard CUDA GPU:

- MALI GPUs do not have dedicated local memory. CUDA kernels often make use of local memory located closer to the compute units but any such optimization would be useless on MALI GPUs as both global and local memory simply map to global memory.
- MALI GPUs have dedicated 128bit vector operation hardware, making them ideal for operating on float4 units (and other vector types). CUDA devices often support vector operations, but under the hood they simulate them by grouping multiple work items together to achieve the same SIMD effect, meaning they do not have to lead to performance gains.
- Unlike many CUDA devices, MALI GPUs have no notion of a "warp" (in terms of concurrent memory access) and each work item executes on an independent Program Counter. Hence many CUDA optimizations for avoiding program divergence are virtually useless.

You may have noticed I compare the MALI GPU to CUDA devices a lot. That is because many neural network frameworks and libraries (and research for that matter) are strongly CUDA-oriented. There are not many OpenCL neural network libraries designed for low-powered GPUs. The closest one I could find was CNNdroid, however, this library did not make use of all possible optimizations and focused primarily on vectorization (we can do better when focusing on one particular device).

I implemented the following kernels:

- Matrix multiplication
- Bias addition
- Sigmoid activation
- ReLU activation

This was a solid start for the first part of my project (Convolution, Pooling, Softmax etc. will come in next year).

The primary accomplishment of this work was demonstrating that using Hybrid Morton Order Memory Layouts can increase performance of matrix multiplication by 12% in comparison with the Blocked Vectorized algorithm presented by ARM. It may seem complicated with all the funky words but the concept is rather simple.

The power of GPUs lies in their concurrency. So if we want to compute the cross product of two matrices, we can have the GPU compute individual elements of the resulting matrix in parallel and 'construct' the resulting matrix all at the same time. The ARM matrix multiplication algorithm rearranged the matrices into row major and column major layout respectively and had individual work items compute a 2-by-2 'patch' of the resulting matrix. This visualization explains it better than words:

The algorithm used vector types to access the two input matrices and used the OpenCL `dot` function to compute their dot product. It kept track of the intermediate results in another vector register and at the end, filled in the blue patch in. The matrix memory layouts for this algorithm were Row Major for the first matrix and Column Major for the second matrix. They looked like this:

With this layout, the ARM algorithm would hit 4 cache misses every 4 iterations of the kernel. Depending on the prefetcher foresight and the speed at which it could prefetch data into the cache, it may not be capable of prefetching all 4 cache lines in time, leading to stalling.

Thus the idea for a Hybrid Morton Order Layout was born. The intent was to redistribute the rate at which cache misses occur by implementing the following memory layouts:

Altering the kernel to run on the top two memory layouts would make the kernel hit a single cache miss every iteration. Using the bottom two layouts would make the kernel hit two cache misses every two iterations.

As it turns out, hitting 2 cache misses every 2 iterations performed the best, outperforming the ARM kernel by 12%.

My project is far from done at this stage. The first part of the project was intended to get me into GPU programming. The second part will entail optimizing the full set of kernels needed to execute neural networks including convolution, pooling, normalization, softmax, other activation functions etc.