Autoscaling of Docker Containers in Google Kubernetes Engine

The name Kubernetes comes originally from the Greek word for helmsman. It is the person who steers a ship or boat. Representing a steering wheel, the Kubernetes logo was most likely inspired by it. [1

The choice of the name can also be interpreted to mean that Kubernetes (the helmsman) steers a ship that contains several containers (e.g. docker containers). It is therefore responsible for bringing these containers safely to their destination (to ensure that the journey goes smoothly) and for orchestrating them.

Apparently Kubernetes was called Seven of Nine within Google. The Star Trek fans under us should be familiar with this reference. Having 7 spikes, there might be a connection between the logo and this name. [2]

This blog post was created during the master lecture System Engineering and Management. In this lecture we deal with topics that are of interest to us and with which we would like to conduct experiments. We have already worked with docker containers very often and appreciate the advantages. Of course we have also worked with several containers within a closed system orchestrated by Docker-Compose configurations. Nevertheless, especially in connection with scaling and the big companies like Netflix or Amazon, you hear the buzzword Kubernetes and quickly find out that a distribution of a system to several nodes requires a platform such as Kubernetes.

Who is this blog post for? 

If you know what Docker Containers are and how they are built and used, the next step may be to create multiple instances of them and distribute the load. Of course this is possible with a Docker-Compose configuration, but the containers will only run on one computer at a time. So as soon as the computer reaches its physical limits, it is absolutely necessary to distribute the services and the instances thereof to several nodes. This is exactly what Kubernetes is suitable for, as it is used to orchestrate containers in a distributed system. In this blog post we will start exactly at this point and assume that you have basic experience with Docker and Kubernetes. The focus is on scaling containers within a Kubernetes cluster.

Of course all others who found their way to this blog post are also invited to read the blog post in order to learn more about Kubernetes, the autoscaling of services in docker containers plus the visualization of their utilization.

We will share and discuss our experiences about what we have built, what we have learned and difficulties we have encountered during our experiments. 

What is our goal?

First of all we just wanted to start with Kubernetes, experiment and build something with as little effort as possible and see results. But even the first question can be a challenge. Where to start? 

  • Hosted Kubernetes
    Google GKE, Microsoft Azure, Amazon EKS, IBM Cloud Kubernetes Service, Apache Cloudstack and there are many more. 
  • Test playground
    Magic Sandbox, Play with Kubernetes or Docker Desktop.

One can easily be overwhelmed by all these possibilities just to get started. Anyway, we wanted to expand our knowledge of Kubernetes and experiment a little. So we thought about a small use case. This should be kept as simple as possible so that it can be explained in a comprehensible way in this post.

Use Case

Imagine yourself being a backend developer or in a dev ops position and that you want to solve the following use case: 

Kubernetes Autoscaling

First you want to create a service (e.g. a simple hello world website) and containerize this service with Docker. Then you select a Hosted Kubernetes service of your choice and use it to provision and manage your cluster. Once it is set up and running, it should be able to scale based on CPU/memory usage or any custom metrics. Users requesting your service should then be redirected to the new services created by Kubernetes to balance the load.

For testing purposes and visual feedback it would be great to trigger/emulate high CPU/memory-usage to visually see (e.g. monitoring dashboard) the app scaling in both horizontal & vertical directions.

What are we not doing?

It won’t be a fully functioning production-ready step-by-step setup. We are just experimenting and the goal is to learn as much as possible.

Choice of Cloud Provider

As already mentioned, it can be difficult to choose the right provider. The choice of providers is huge and continues to increase. Among them are “playgrounds” like Magic Sandbox, Play with Kubernetes or Docker Desktop. These are however very limited in their functionality. Therefore, they were not considered for our experiment. To use Kubernetes without restrictions, we have to use “Hosted Kubernetes” providers, which does not make the choice easier. There are many platforms available, e.g.:

Finally, we chose Google Kubernetes Engine because Kubernetes was originally developed by Google, so we assumed that if there were updates or new features, they would be directly supported by the GKE. Furthermore, we have worked a lot with other Google Cloud services like Firebase, Cloud Functions and Cloud Storage in the past. We hope that this will allow us to easily integrate and interact with Google products in the future. GKE delivers the Standard Metric API by default and does not need to be explicitly implemented. Google’s Kubernetes Engine can be tested over 12 months with a budget of $300, which we also took into consideration in our decision.


In order to understand the scaling of Docker Containers within a Kubernetes cluster, different scaling options are presented first, which can be applied to various systems independently of Kubernetes. The Scale Cube is suitable for illustrating the different possibilities. [3, p. 8ff.]

Figure 1: Scale Cube [3 p. 10, fig. 1.4]


A scale on the x-axis is called horizontal scaling.  This involves duplicating an existing application and distributing the requests to the available instances with the help of a load balancer. Figure 2 shows three instances of an application whose requests are distributed with the help of a load balancer.

Figure 2: x-axis scaling [3, p. 10, fig. 1.4]


An extension of a system in Z-direction works similar to scaling along the x-axis, because the application is also duplicated. However, the load balancer does not forward the requests arbitrarily, but based on an attribute of the request such as a userId. Figure 3 also shows three instances of an application.  In contrast to x-axis scaling, the requests against these instances are forwarded by a load balancer based on the userId, which is known to the system through prior authentication of the user.

Figure 3: z-axis scaling [3, p. 10, fix. 1.5]


If a system is functionally broken down into individual services, this is referred to as scaling in the y-direction. These services often have several instances, which also means scaling in the x-axis or the y-axis. Figure 4 shows a system scaled in x- y- and z-direction.

Figure 4: y-axis scaling [3, p. 11, fig. 1.6]

Scale vertically

Besides the previous options of the Scale Cube, there is often talk of vertical scaling. This refers to adding resources to a server, giving it more capacity and performance. However, since the hardware sets the limits, there are limited possibilities. [4]

Classification of the possibilities in relation to the given system

In a Kubernetes Cluster, all of the mentioned scaling possibilities can be used. Since scaling in z-axis is only a modification of the scaling of the x-axis, this concept is not considered. In addition, the application consists of only one service that fulfils one function. Therefore, a differentiation into several microservices currently makes no sense, but can be implemented at a later time to achieve scaling of the y-axis. In the context of this work, only optimizations for scaling in the x-axis and for vertical scaling are made and worked out. Subsequently, the methods will be compared and evaluated.

Creating a Cluster in GKE

When creating a cluster in Google Kubernetes Engine, we had to decide on some settings. The following is a list of the settings we changed. All other settings were taken over unchanged.

When selecting the Kubernetes version there are two options: Master Version or Release Version. If you select Release Version, Kubernetes is automatically updated by GKE. However, we have selected Master Version to make sure that there are no changes to the Kubernetes APIs caused by automatic updates. The setup will select version 1.14.10-gke.17 (default) as default. Unfortunately, the API autoscaling/v2beta2, which we need for scaling with custom metrics, is not available there. Therefore, we have chosen the current stable version 1.15.9-gke.9.

As image we use n1-standard-1 with 1 vCPU and 3.75 GB memory. We first tried to use smaller images like f1-micro with 614 MB memory and g1-small with 1.78 GB memory, but we couldn’t start our application in it.

When creating a cluster, we can decide whether we want to activate Stackdrive or not. Stackdrive is a monitoring solution integrated into the Google Cloud but located outside the cluster. We have therefore chosen to monitor using Grafana and Prometheus, which we operate alongside our other containers within the cluster. This makes us independent of cloud platforms, which implies easy migration between them.

In order for the automatic vertical scaling of Pods to work, it must be activated via the “Enable Vertical Pod Autoscaling” checkbox. If this is forgotten, it can be activated later with the following command:

gcloud container clusters update [CLUSTER-NAME] --enable-vertical-pod-autoscaling

When using the web interface of Google Kubernetes Engine we noticed that it can change from one day to the next. Especially the dialog for creating a cluster was affected. After a short search, however, all items could be found again.


Our cluster contains several applications. On the one hand we have a NestJS application which provides a REST API to generate CPU and memory loads. To track the resulting automatic scaling of the cluster, it contains Prometheus to collect the metrics and Grafana to display them visually.

NestJS Application

Since we need at least one container, we decided to develop a small REST API with NestJS. We use NestJS as we have already gained experience with it. Basically any framework like ExpressJS (JavaScript), Flask (Python), Spring Boot (Java) etc. is suitable to create an easy (REST-)API.

To test the horizontal scaling by memory and CPU, we created two endpoints. The first endpoint tests the memory usage by appending a string to an array a few thousand times. The CPU test calculates roots in a loop to drive up the CPU load.

For scaling using custom metrics, we have created an endpoint that returns the static metric sem_metric with the value 5.


To visualize our monitoring data we used Prometheus and Grafana as mentioned in the previous chapter. Basically Prometheus is there to retrieve/collect and store the metrics. Grafana on the other hand retrieves the data from Prometheus and visualizes the metrics in a dashboard. Thereby the workloads and scales are displayed in a clear and simple way. [5, 6]

Figure 5: Grafana Dashboard

Prometheus belongs to the CNCF (Cloud Native Computing Foundation) Graduates and Grafana CNCF Silver-Member [7]. This was one reason for choosing these services.

Horizontal Scaler

Before we scale horizontally, we need to know how the Horizontal Pod Autoscaler (HPA) works. As the name suggests, the HPA provides the basis for any horizontal scaling within a Kubernetes cluster. This means not only scaling based on CPU and memory usage, but also adjusting the number of pods using custom metrics. Of course, these variants can be combined in any combination using a combined scaler.

Horizontal Pod Autoscaler

Before we can talk about scaling based on CPU/Memory usage, we need to clarify how Kubernetes actually collects these metrics. This will also be important later on for the chapter Custom Scaler.

In order to use auto scaling, the files must be equipped with the API version autoscaling/v2beta2, because the stable version only supports the metric CPU. Only since v2beta2 scaling based on custom and memory metrics was added. In addition, several metrics can be specified simultaneously (type metric is now an array).

The HorizontalPodAutoscaler is a loop that retrieves the resource load specified in the files in a certain period (default value of 15 seconds, can also be set with the flag –horizontalPodAutoscaler-Sync-Period).
The data can be retrieved from various API’s (Metrics API, Custom Metrics API and External Metrics API). We will not go into detail about the External Metrics API, as we did not use it in our example. The standard Metrics API can be used to scale by CPU and memory usage. For custom metrics, you must use the Custom Metrics API. This will be discussed later in the chapter Custom Scaler. [8]

Scaling based on the CPU load

In order to scale based on CPU utilization, we need to create a HPA. The blueprint for this HPA is specified via a yaml file which includes different properties. The following gist shows an example for such a configuration file.

The most important setting is the metrics property (line 12). The name of the resource must be cpu (line 15). Furthermore, the target type must be Utilization (line 17). By defining the averageUtilization, the threshold on when to scale up or down is set. Example: when the calculated average utilization of all running pods exceeds the given value of the property averageUtilization, the HPA starts a new Pod. It is important that the target deployment is defined correctly so that the autoscaler knows where to scale what (line 6-9). In addition, it must be specified how many instances should run at least (line 10) and how many at most (line 11) simultaneously. This prevents the autoscaler from creating an infinite number of instances. Finally, this also protects the user from unnecessarily high bills.

Now we have to differentiate again between Resource, External and Pods. External was not used in this example. Pods are used for automatic scaling with multiple metrics and custom metrics, which is also used later in the chapter Custom Scaler. To access both, CPU and memory, Resource must be used as type. Under the item Type, the name CPU is entered. At this point, the autoscaler knows which metric to look for. 

Scaling based on the RAM usage

The configuration file for scaling based on memory has nearly the same structure. Only the name changes from CPU to memory (line 15 of the following gist). With AverageValue (line 17), the values of each pod returned by the metric API are summed up and divided by the number of Pods. Afterwards, this average is compared with the target average value (value that the user typed into the item AverageValue, see line 18) in order to increase or decrease the number of instances. 

Combined Scaler

Since the item Metrics in the yaml file is an array, several metrics can be specified, as shown in the example below. In this example, we simply combine the previously mentioned examples CPU and memory in one file. Now the autoscaler can use both values in order to decide how many instances should be started. If we now assume that three new instances are needed due to CPU utilization but only one new instance is needed due to memory load, the autoscaler will always choose the larger value and use it to adjust the number of pods.

Custom Scaler

To scale services in a more controlled way, scaling the Pods based on CPU and memory is usually not sufficient. Therefore Kubernetes offers a possibility to create own metrics with the help of the HPA. Afterwards it is possible to scale based on these metrics.

In order for the HPA to access own metrics, these must be exported to the Kubernetes Custom Metrics API. To achieve this, the Kubernetes API must be extended to include the Custom Metrics API. The Aggregation Layer is responsible for extending the Kubernetes API.

In our example, the metrics from the pods are already collected by Prometheus. To export the custom metrics, we use the Prometheus adapter from “directxman”, which we installed over Helm. Helm is package manager for Kubernetes. The installation via Helm has the advantage that the authorization of the adapter in the aggregation layer does not have to be set up manually.

Although the Prometheus adapter from Helm is already preconfigured, it must be adapted to our application. This is done by specifying the URL and port from which the metrics are to be obtained (see line 2 in gist below). We have also adapted the query to capture our metrics (line 8).

After the Prometheus adapter is installed, the custom metrics are collected. The following figure shows the process flow. Our applications (Nest JS Pods) provide metrics that are collected by Prometheus. With the help of the Prometheus Adapter our metrics are exported from Prometheus to the Custom Metrics API of Kubernetes. The Horizontal Pod Autoscaler accesses our metrics via this API and customizes the deployment. Afterwards,, the required number of pods of this deployment are provided.

Figure 6: How Custom Scaler works

To check if the custom metric is used, the command kubectl get --raw /apis/ can be used. If the setup is done correctly, the custom metric will appear in the output:

  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "",
  "resources": [
      "name": "pods/sem_metric",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [

Finally, a new entry for your own metric must be made in the HPA YAML under the item metrics. As type we choose “Pods” (line 13 in the gist below), as we have assigned the metric to a Pod. As name we use sem_metric. Since the type is Pod, only AverageValue can be selected as target type (line 18). The average of all pods is used to calculate the scaling. The AvarageValue (line 19) is the value that should be reached. More details about the scaling algorithm can be found in the Kubernetes documentation.

After creating the HPA, the current status of the autoscaler can be retrieved:

Figure 7: Status of HPA

There are two values under the item Targets. The left value is retrieved from the Pods using the Custom Metrics API, so it is the actual value. The right value defines the state which is declared in the HPA YAML file and therefore expected.

Problems that were noticed

While experimenting especially with the CPU and memory metrics and our NestJS applications that are designed to provoke exactly these loads, we noticed when testing the memory usage that the node server shoots itself off after a certain amount of time. This made testing a bit more difficult, but at this point you can clearly see that Kubernetes simply restarts the pod after it was no longer available.

Experimenting sometimes felt like a black box, because there was a lot of trial and error and the monitoring was very slow. We were never able to see exactly at what point it was scaling, because the visualization of the data only showed a change after an indefinite time. Sometimes faster and sometimes slower than usual.

Even though we were able to take a lot with us, we did not reach a point within this semester where we felt very confident so that we could use Kubernetes in real world projects.

Vertical Scaler

Vertical scaling within a Pod is controlled by the Vertical Pod Autoscaler (VPA). The CPU and memory of a Pod are automatically adjusted if the previous resources are no longer sufficient for the current task or the Pod requires less than it has available. The exact procedure in the cluster is as follows:

  1. The Vertical Pod Autoscaler regularly analyzes the Pod. If a Pod has too many or too few allocated resources, they are adjusted in the next step.
  2. The Pod Autoscaler starts a new pod that is assigned more or less resources than the previous one. This is similar to manually adjusting the limits in the deployment file. However, since these cannot be adjusted at runtime, the Vertical Pod Autoscaler creates a new pod in the Kubernetes Cluster.
  3. As soon as the new pod with the updated resources is available, the old container is shut down.

The gist below is a configuration file for a VPA. Declaring the updateMode within the updatePolicy as Auto, lines 10+11 are the most important ones. The other values of the configuration are very similar to the properties specified in the configuration files for horizontal scaling.

By default, vertical scaling is disabled in the Google Kubernetes Engine and the documentation of the Kubernetes page is outdated. Furthermore, some commands did not work or were renamed. We therefore assume that vertical scaling plays a minor role and we recommend horizontal scaling, especially with custom metrics. These are very flexible and can be adapted efficiently for different use cases.


Kubernetes offers various possibilities to scale containers over several nodes. However, it requires a lot of know-how and a long training period. Since the development of Kubernetes is progressing very fast, you often find outdated or no longer working tutorials. Some things had to be found out via trial and error.

In some places Kubernetes seemed to be sluggish or like a black box. Sometimes we didn’t know whether the scaling had taken place or not because we had to wait for some time until we got updated values in the console or Prometheus.

With regard to REST APIs, one must ask oneself whether scaling is required at all. Within the scope of our small test service, Kubernetes will certainly be overpowered and it will be easier to run a monolithic system on a single server. However, as a backend gets bigger and receives more requests, pushing it to the limits of resources, we can imagine using Kubernetes and splitting a backend into multiple services to get a scalable microservice architecture. However, a microservice architecture has other challenges that we have not considered so far. For example, you have to think about additional features like communication between services.

Considering the huge effort, Kubernetes is unsuitable especially with regard to the time to market. To avoid the effort with Kubernetes and still be able to develop automatically scalable backends, serverless can be used. Google with Firebase or Amazon Lambda, for example, offer corresponding solutions for this purpose, which have become more flexible and straightforward to implement.


[1] Nigel Poulton, 2017 – The Kubernetes Book
[3] Chris Richardson. Microservices Patterns. Manning Publications, 2018.