About using Machine Learning to improve performance of Go programs

This Blogpost contains some thoughts on learning the sizes arrays, slices or maps are going to reach using Machine Learning (ML) to increase  programs’ performances by allocating the necessary memory in advance instead of reallocating every time new elements are appended.

What made me write this blogpost?

Well first of all I had to because it is part of the lecture Ultra Largescale Systems (ULS) I attended past winter term. But as an introduction I’ll tell you what made me choose this topic: I started to learn Golang and coming from mainly Java, Python and JavaScript the concept of Arrays with fixed sizes and Slices wrapped around them for convenience was new to me. When I understood that initializing them with the correct capacity is good for performance and memory usage I always tried to so. Until I came to some use case where I could not know the capacity in advance. At almost the same time we talked about “ML for Systems” in the ULS-lecture. There the power of ML is used to speed up Databases, loadbalance Elastic Search Queries and other things. So I came up with the idea of ML for programming languages in this case for learning capacities in Golang. By the way I wanted to try out ML in Go, which is said to bring some performance advantages compared to python and is easier to deliver. But neither ML in Go (go for ML) nor ML on Go is topic of this post, though both appear at some parts.

The goal in more detail

As explained in various blogposts like here and there, arrays have fixed sizes in Go. For convenient manipulation anyway they can be wrapped by slices. Thus appending to a slice that reached its capacity needs to create a new slice with a larger underling array, copy the contents of the old slice to the new one and then replace the old one by the new one. This is what the append method does. That this process is more time consuming than appending to a slice that has a sufficient capacity can be shown with some very simple code that just appends 100 times to a test slice in a loop. Once the slice is initialized with a capacity of zero and once with 100. For both cases we calculate the durations it takes and compare them. Since those durations can vary for the same kind of initialization we run this 1000 times each and calculate the average duration to get more meaningful results. The averages are calculated by the method printSummary which is left out here in order to keep track of things. However the whole code can be found on GitHub.

As expected the correct initialized version runs with an average of 1714ns faster than the other one with an average of 2409ns. Of course those durations are still just samples and vary if the code runs multiple times. But in over 20 runs each there is only one average value of the bad initialized slice lower than some of the good ones.

If we also take a look at the capacity the slower version ends up with, we see that this is 128 instead of the required 100. This is because append always doubles the capacity if it reaches the limit.

So we can see that it is worth setting the capacity correct in advance for performance and resource consumption reasons. But this is not always as easy as in the example we just saw and sometimes it is not even possible to know the length a slice will grow up to in advance. In those cases it might make sense to let the program learn the required capacities. It could be helpful at initialization with make as well as for growing with append.

A basic example

Setup

To check out feasibility I created a basic example that is a bit more complex than the first one but still possible to calculate as well. It iterates over index j and value s of a slice of random integer samples and for each of them the test slice is created. Then we append s times three values and one value j times. So the final length (and required capacity) of test can be calculated as s*3+j.

Also in this loop training data gets gathered. One sample consists of s and j as input and len(test) as label. Since the main goal of this scenario is to check if it’s worth using a trained ML model to predict the required capacity, this data is collected always to create equal conditions for every test case. Ways to avoid the time expensive training and data collection at runtime are discussed later.

As implementation for the ML part I chose go-deep. I picked it from this list because it looked well documented, easy to use and sufficient for my needs, though not perfect.

I used the collected training data to train a MLP (Multi Layer Perceptron) with two hidden layers containing two and five neurons. Of course I configured RegressionMode to use Identity as activation function in the output layer and MSE (Mean Square Error) as loss function. I also played around with some other hyperparameters but kept a lot from the examples provided as well, because the MSE already decreased very fast and became 0.0000 after three training-iterations. This is not surprising since the function to learn is very simple. Also there is no need to avoid overfitting in this basic example. I kept some of the belonging hyperparameters with low values anyway. In a real world use case one would probably try to keep the model as small as possible to get quickest responses.

Results

The following table shows the test cases I compared along with the average durations in nanoseconds calculated over 1000 tries each. Since those averages vary again from run to run the table contains three of them.

Test caseAvg ns run1Avg ns run2Avg ns run3
Initialize capacity with
zero
12.790.50114.267.92514.321.735
Use s*3+j directly in make5.679.5956.067.9685.943.731
Use a function to
calculate s*3+j
5.242.1826.012.9205.515.661
Use the prediction of the
learned model
10.898.4376.361.9119.056.003
The model’s prediction +16.069.7765.714.3486.144.386
The model’s prediction
on new random data
10.165.7646.096.9299.296.384

Even though the durations vary the results show that not initializing the capacity is worst. Also usually it is best to calculate the capacity, if possible. It does not really matter if the calculation happens in a function or directly. When I took a closer look at the model’s predictions I saw that they are quite often exactly one less than the actual capacity. This is why I also added the prediction+1 test case, which is almost as good as the direct calculations. So investigating a bit deeper in what the model predicts is worth it. Maybe some finetuning on the hyperparameters could also fix the problem instead of adding 1 manually. The results also show that the learned model works on completely new random data as well as on partly known data from the training.

Conclusion

Of course creating such a model for a small performance optimization is heavy overengineered and thus not worth it. It could be worth in cases where you know you have a bottleneck at this place (because your profiler told you) and you cannot calculate the required capacity in any other way in advance. In the introduction I already mentioned that I had a use case where it is not possible to do so. In this case the length of the slice depends on a sql.rows object which doesn’t tell you how many rows it contains in advance. Other examples might be conditional appends where you cannot know how many elements fulfill the condition to be appended to a slice or something else. But also in those cases the required capacity might depend on something else. For example the current time, the size of an HTTP request that caused this action or the length this slice reached the last time. In those cases using a ML model might be helpful to avoid a performance bottleneck. With dependencies to previous lengths especially RNNs (Recurrent Neural Networks) might be helpful. At least they probably could give a better guess than a developer himself.

Looking ahead

As stated above in examples like this the engineering effort is too high. So ways for automating would be desirable. First I thought about a one-size-fits-all solution meaning one pretrained model that predicts for various makes the required capacity. But it would be difficult to find good features because they could change from make to make and just using all sorts of possible features would create very sparse matrices and require larger models if they could work at all.

So we should stick to use case specific models that can be smaller and use meaningful features depending on their environment like lengths of arrays, slices, maps or strings “close” to them or values of specific bools or integers. The drawback is that individual models need individual training maybe with production like data. Training during runtime would cause an overhead that might destroy the benefit the model could bring and slow the program down at least for a while until training can be stopped or paused because the ML model’s performance is good enough. So if possible pure online learning should be avoided and training on test stages or at times with low traffic should be preferred. If the length of a slice depends on the current traffic this is of course not possible. Then one should at least make use of dumping a model’s weights from time to time to the logs to be able to reuse them when starting a new node.

Still we need to solve the overengineering issue and try to build a model automatically at compile time, if the developer demands to do so for example using an additional argument in the call to make. I think that this might be another use case for ML on code: Finding good features and parameters to build a ML model by inspecting the code. Unfortunately I’m not sure what such an ML model on code could look like and what it would require to train it. Using a GAN (Generative adversarial network) to generate the models would probably require already existing ones to train the discriminator. If the automation could be realized the use case also could get broader because then calculating the capacity would be more effort than just saying “learn it”.

Some final thoughts

Using ML would not magically boost performance. It would require developers to remeasure and double check if it’s worth using it. For example it is not only important how often the program needs to allocate memory but also where. So stack allocation is cheap and heap allocation is expensive as explained in this blog post. If using ML to predict the capacity requires the program to allocate on the heap it might be slower even when the predictions are correct. In the test scenario above all the cases instead of initializing with zero escaped to the heap. There it was worth it but it needs to be measured. So the performance should be compared with and without learning for short and for longer running applications. As another example sometimes the required capacities might not be learnable because they are almost random or depend on things that cannot be used as features in an efficient way.

Another drawback of using ML is that your code behaves less predictable. You won’t know what capacity will be estimated for a slice in advance anymore and it will be much harder to figure out why the program estimated exactly what it did afterwards.

I also thought about to train the model to reduce a mix of performance and required memory instead of using the final length as labels. But then it is not that easy anymore to get the training data. In some cases however it might also be difficult to get the “final” length of a slice as well.

The last thing to remember is that it is always helpful to set a learned model some borders. In this case a minimum and a maximum. My test model for example predicted a negative capacity before I got the hyperparameters right, what made my program crash. So if the model for some reason thinks this could be a great idea a fixed minimum of zero should prevent the worst. Also such borders make a program a bit more predictable again.