Writing High Performance Code on Modern Hardware

Today, with the use of modern hardware combined with optimized high performant code, it is an easy task to process more than 500 million images per day on a single machine. Small improvements in the underlying implementations can have extreme large impacts on the execution time and are therefore fundamentally important to handle the huge amount of data in modern large scale systems. Furthermore, we can dramatically reduce the costs of our infrastructure and stay competitive by optimizing our implementations. To show how this is possible and what optimizations we can use in our everyday programmer life, let us have a look at the following imaginary task:

The Task

Our example task is to calculate the average color of an image as fast as possible. The task is inspired by various applications, like Google Images, which contains many images and shows the average color of the underlying image, before all data is available on the device. This can be used to hide the latency and to give a more fluid feeling to the application. At first glance, there is nothing special about it but if we think about how many images are uploaded to modern large scale systems and how much computing power we would need to process them, things are getting pretty interesting. The calculation of the average color of an image gives a good example, where we can apply fundamental optimizations to our code. To compute the average color, we have to process every pixel of an image which can be a very expensive task, especially, if we have to deal with high resolution images. Unfortunately, for the developers who have to deal with these high resolution images, modern devices have high resolution cameras and displays which increase the computational costs. To get the correct average color, we would have to square the individual values, and calculate the average color based on the squared values. But in our case, it is enough to simply add the values from every color channel together without squaring them, and divide it by the number of pixels. This will give us a slightly darker average color which is no problem for our task and furthermore, we will get performance improvements based on this simplification, as we will find out in the following sections.

Choosing a Language

Today, many programming languages exist, each with its own advantages and disadvantages and it became a challenge to choose the perfect language for a given task. If we want to write high performant code, things are getting even more complicated because for every languages there are different optimization techniques and we can find different test cases, where they can beat other languages. To test which language is the right one for our purpose, I implemented the basic calculation of the average color with the languages I commonly use: C++, Java and Python with Numpy. Although, I already expected that the good old C++ will win this fight, I was surprised about the results:

Calculation time in % when calculating the average color value with different programming languages

One thing should be noted, these results can only be seen as a basic overview about how much improvements we can get, if we choose different languages for specific tasks. These differences should not be transferred to a general comparison between the performances of these languages! The real improvements always depend on the specific task and in this case, the basic C++ implementation is already twice as fast as the Java solution and more than ten times faster than Python without any optimizations!

Know your Data

For further improvements, we have to take a look at our data, because then we can find ways to optimize our memory accesses. In most cases, 8 bit per color channel (Red, Green, Blue) is used to store our data. Additionally, an alpha value is stored as a fourth color channel to represent the transparence of our picture. So all in all, we have 4 color channels per pixel, each containing a number between 0 and 255. Our pixels are stored like the plain-PPM file format, where the RGBA values of all pixels are listed one after another. If we calculate one channel after another, we can get a low performance, based on inefficient memory access. If we use libraries, from which we do not exactly know how they store the data, we can easily have an inefficient memory access without even noticing. The imaginary used API could have stored our images in a different way, whereby the color channels are stored in four separate arrays. This could be useful in some cases, but if we now have a function to access a single pixel, we have created a memory inefficient access. Due to a memory inefficient access, where we calculate one channel after another, the calculation time increases drastically:

Calculation time in % with bad or efficient memory access

We, as programmers, are more familiar with Integer or Float data types, that are in most cases represented by 32 bits. We use them basically everywhere when we have enough available memory, even if we could use smaller data types. We do not care about the small decrease of our memory footprint, but the reduced memory consumption is not everything we can get from more suitable data types. Due to suitable data types, we can get additional performance improvements:

Calculation time in % with different datatypes

Now our compiler has additional information about the data and can use more or even better optimizations. With this small change, we reduce the calculation time by more than 40% and this only by storing our data in a Char with 8 Bits instead of an Integer with 32 Bits!

Know your Hardware

If we know our hardware, we can use further optimization techniques to get the best out of our system. Modern CPUs come with many cores and therefore, with huge performance gains. Additionally, they can use a technique that is called vectorization, whereby our hardware can make multiple calculations in fewer steps and, if this is not enough, we can also utilize the raw computing power of modern GPUs.

Vectorization

Vectorization uses special hardware registers, by which it is possible to make calculations faster. These registers are limited in size, often 128-Bit or 256-Bit, and we can fill them with our data. In our case, we add 4D vectors together. Normally, we have to make four additions, one for each element of the vector but if we use vectorization, this could be done in a single step. First, I implemented a basic SIMD (Single Instruction Multiple Data) vector calculation where we can add two 4D vectors, each stored in 128-Bit, together in a single step. But this simple approach increased rather than reduced the calculation time, how can this be? Our compiler does a great job in optimizing our code, whereby he already tries to use vectorization automatically! This is especially visible in the performance improvements we got by using 8 Bits to store our data, now, the compiler could detect this and could add more values together in a single step with automatic vectorization. It was not an easy task to implement a faster vectorization solution, but we can still get some improvements by using AVX2 (Advanced Vector Extensions) instructions with 256-Bit registers. We could store 32 8-Bit values in these registers but because we need more bits to store our sum, this representation is not enough. The next bigger data type would be 16-Bits where we can add 16 values each with 16 bits together in a single step. With 16 bits we can sum 256 values together if we do not square the values, without losing data and with this knowledge we can get again performance improvements:

Calculation time in % with vectorization techniques

Multiprocessors

Modern CPUs are multiprocessors, to get performance gains by parallelization instead of the nearly impossible increase of clock rate. By using multiprocessors, we can distribute the work over multiple cores and can fully utilize the CPU. For our task, we use six threads corresponding to six hardware cores, where every thread calculates the average color of an individual image. Due to the fact that multiple threads do not access the same data, we are free of race conditions, which makes our life easier. With six hardware cores, we would expect that we also will be able to process six times more images, but starting and waiting for threads also consumes time, so that we end up with an 4.5 times faster implementation than the single threaded version.

GPUs

The next step to get more performance, is to use GPUs. GPUs are a great choice if it comes to raw calculation performance, based on their hardware architecture. To keep it simple, GPUs have way more cores than CPUs but GPU cores are more lightweight than CPU cores, which means we do not have thousand individual CPU cores running concurrently on a GPU. But if we are aware of our hardware architecture, they can be executed nearly concurrently and we can get huge performance improvements especially for calculation intensive tasks. Many programmers have not even touched GPU programming, but today it is quite easy to get good performance, even without heavy optimization or hardware knowledge. For our task, even a very simple and unoptimized OpenCL solution is better than our optimized multicore C++ implementation. We perform a simple parallel sum on our color vectors, by which we start as many GPU-Threads as we have pixels in our image. First, every GPU thread loads a single value into local memory, then we calculate the sum of 256 Elements and store the average color of these elements on our GPU. We can repeat this steps until we have the average color of our whole image, and that’s basically all we have to do on the GPU side to get a 25% faster solution! Another advantage is, that GPUs often scale better with larger data as CPUs. This is very helpful for our high resolution images:

Relative execution time on different resolutions

Conclusion

The used programming language and suitable data types can heavily improve our performance without complicated optimizations. This is a simple way to write more performant code. Furthermore, we can integrate this easy changes in our everyday work to improve our implementations. If we want further optimizations, we get stuck in the endless space of possible techniques and hardware dependent optimizations, but even with common techniques, we can get great performance improvements. We can use vectorization, multicore CPUs and GPU programming to get the best out of our systems. Especially with GPU programming we can get great results, even without heavy optimization and furthermore, GPU programming became more easy in the past years and it is easy to adopt it in our systems. With this techniques, it was possible to reduce the calculation time of our example task to less than 1% compared to the simple Python implementation. It is always hard to talk about general optimization techniques, but I hope that the results of our imaginary task give some motivation and suggestions what can be achieved with modern hardware and optimized code:

Calculation time in % for all optimizations