Queueing Theory and Practice – OR: Crash Course in Queueing

What this blog entry is about

The entry bases on the paper “The Essential Guide to Queueing Theory” written by Baron Schwartz at the company VividCortex which develops database monitoring tools.
The paper provides a somewhat opinion-oriented overview on Queueing Theory in a relatively well understandable design. It tries to make many relations to every day situations where queueing is applied and then provides a couple of annotation methods and formulas as a first approach to actual calculations.
This blog entry forms a summary of that paper but focuses on queueing in the context of ultra large scale web services and adds examples and information from my own knowledge and trivial research.

The goal of this blog entry is to provide fellow computer science and media programmers who consider working in the field of web servers an overview about queueing that might be helpful at some point to understand and maybe even make infrastructure- and design-decisions for large scale systems.

Note that some annotations do not match exactly to the paper from Schwartz because there seems to be no consensus between it and other sources either. Instead the best fitting variants have been chosen to fit this summary.

Queueing and intuitition

The paper sums up the issue as “Queueing is based on probability. […] Nothing leads you astray faster than trusting your intuition about probability. If your intuition worked, casinos would all go out of business, and your insurance rates would seem reasonable.”
Due to the skills required for survival in prehistoric times as well as due to most impressions we make nowadays in everyday life, the human mind is best suited for linear thinking. For example the Neanderthal knew that if he gathers twice as many edible fruits, he will eat off them twice as long (just imagine yourself today with snacks in the supermarket : ) ).

An exception to this is movement prediction where we know intuitively that a thrown object will fly a parabolic path. This part already comes to a limit though when we think for example of the famous “curved free kick” (Keyword: Magnus effect).
However when we try to think theoretically about things in our mind alone, we tend to resort solely to linear proportions between two values.

As we will see and calculate later, in queueing theory, relations are not just parabolic but unpredictably non-linear. A value can grow steadily for a long while until reaching a certain point and then leap rapidly towards infinity.
Let’s look at an example: You have a small web-server providing a website that allows the user to search and display images from a database.

Depending on how broad the search is, preparing and sending the search-results to the user takes a random amount between one and 3 seconds – on average that means 2 seconds. Also on average you expect 25 customers per minute.
Now the question is, how long will the user have to wait on average?

Intuitively (at least if you would not be a software dev ; ) ) you might say: Barely over two seconds. After all, handling 25 requests of 2s average each, requires just 50 seconds and thus only 5/6 (83.3%) of the systems time.
Unfortunately the reality would be a website with 5 seconds of wait time on average and far higher outliers.

The reasons are that with random request processing duration as well as random arrival time, request will naturally overlap and waiting time is automatically wasted. Whenever the server has no request to handle, those seconds are entirely lost. When multiple requests occur simultaneously later, it is not possible to “use up” that spare idle time from earlier. Instead time continues to flow and the requests have to be enqueued and wait.

The following graph shows the relation between “Residence Time” which is the whole time between arriving with a request and leaving with it (aka the site loading time) and “Utilization” of the server.

“Hockey Stick Graph” visualizing the relation between residence time and server utilization

We see a non-linear graph that is sometimes called “hockey stick graph” due to its markant shape. It shows well how the tipping point is somewhere around 80%. Past this amount of utilization, average wait time skyrockets.

The Residence Time and Utilization are only two of several values that we need to define to be able to talk on a common ground about queueing theory.

The common ground

To form a base of words and abbreviation see the following table.

MetricUnitSymbolDescription
Arrival RateRequests per timeAThe frequency of new requests to the system.
Queue LengthWaiting requestsQHow many requests are waiting in a queue on average.
Requests (in concurrency)RequestsRTotal requests currently waiting or being serviced.
Wait timeTimeWTime requests spend waiting in a queue
Service timeTimeStTime a request needs to be serviced on average. For example how long it takes to assemble the website data.
Residence time (latency)TimeRtTotal time from placing the request and returning the output. If ignoring data transfer delays, this naturally is W + St.
UtilizationFractionUPercentage of utilization of the servers. Refers to the quote of busy-time from the total time. It’s reciprocal is the idle-time.

Basic formulas

Many of the parameters listed above are related to each other through a handful of basic formulas that are mostly trivial once you understand them.
The first and most common one is referred to “Little’s Law” as it has been formulated and proved in 1961 by John D.C. Little:

R = A * Rt

Where R is the number of total requests in the system (in queue and currently serviced), A the arrival rate and Rt the total time of a request from arrival to having been serviced.
The relation is fairly straight forward as it simply says, the longer requests take in the system and the more often they occur, the more requests will accumulate in the system.
This relationship can be resolved for the queue length as well:

Q = A * W -> Queue length = arrival rate * wait-time in queue

Another important formula is the Utilization Law:

U = A * St -> Utilization = arrival rate * service time of a request

Logical; the more services arrive and the longer time they need to be serviced, the higher utilization will be on average. To calculate this for multiple servers, just divide the utilization by servers. Of course this is for the theoretical approach only as real web servers will inevitably have an overhead through load balancing.

Last we have another important formula that says that the residence time Rt which is the latency of a request, is equal to the service time divided through the idling-percentage of the servers (1 – utilization).

Rt = St / (1-U)

Those formulas however do not allow you to predict much of how a real system will behave yet because they require you to know at least one crucial thing like average queue length, wait time in queue or the utilization.
To compute any of those, a more advanced set of formulas are needed. However before we can calculate something for a system, we first need to decide how to describe such a system.

Kendall’s annotation

A way to fulfill exactly that and describe approximately how a service system has been designed, is the “Kendall’s annotation”. It allows to differ between systems by major parameters.
Up to six parameters, each anotated by a letter, are separated by dashes:

A/T/S/P/R/D

Unfortunately the exact letters for every parameter differ a lot between sources. Therefore this blog entry describes a variant that appeared senseful to the author. Most sources use exactly the same order of parameters though.
Of all parameters, the first three are most significant as the others can be their default values for many systems.

A

The first parameter describes with what behavior the service-requests arrive. Most commonly that is one of the following letters:

  • M or G: Memoryless or General; means the requests occur randomly (there’s no memory of the last request). This results in an exponential distribution.
  • Mx: Random occurence of x requests at once.
  • D: Degenerate distribution; A deterministic or fixed time between requests
  • Ek: Erlang Distribution; Erlang Distribution with k as the shape

T

T describes the service time distribution. The same letters as for (A) are common.

S

S describes the number of services that serve requests in parallel.

P

P denotes the number of places in the system including services and queues. When this limit is reached, new requests are denied.
If omitted, this parameter is ‘infinite’ (INF).

R

R is the number of possible requesters. This is only relevant if the number of requesters is relatively low because if a significant fraction is already in queues, new requests naturally become more scarce.
If omitted, this parameter is ‘infinite’ (INF).

D

D describes the de-queueing behavior. The following abbreviations are common:

  • FIFO: First in first out
  • LIFO: Last in first out
  • SIRO: Service in random order
  • PNPN: Service order based on a priority value for every request

If omitted, this parameter is ‘FIFO’ by default.

Example

Following the Kendall’s annotation, a possible variant of a web service system is:

M/M/(s)/(b)/INF/FIFO

Where (s) is the number of requests that can be processed in parallel and (b) is the number of places for requests in memory in total.
A service system described like that means it expects requests at random intervals* and each requiring random time to process. It expects an infinite number of potential requesters because even for the largest system it is not feasible to be able to handle a significant fraction of all people with internet access at the same time**. It utilizes FIFO queues.

* Of course a system can make predictions like higher demand on certain times on the day, but that is not immediate enough because it is not a correlation between two requests. However for some types of sites (like social media for example) it can be assumed that after a user has accessed the first time, subsequent requests will occur as he keeps using the site. This type of behavior cannot be modelled with Kendall’s annotation.

** However for services available only to special, registered users the number may be limited.

Predicting the Behavior of single-server designs

A major point about those relatively theoretical approaches at queueing systems are the possibility to make certain predictions about their behavior in practice. The two numbers of interest are queue length and wait time.

In the context of the service of a large scale system, the queue length is relevant to determine the required memory for the system (or every particular machine).
The wait time on the other hand is significant for the user’s satisfaction with the whole service.
Indirectly the results of those predictions determine the required performance of the system to avoid an utilization that results in high waiting times (both tending towards infinity) as seen in the “hockey stick graph” earlier and usually in high queue lengths as well.

First it makes sense to look at the formula generating said graph for a queue that has the Kendall annotation M/M/1. That means it has random request- and service time, infinite requesters and queue-memmory and unitlizes a FIFO principle. That all runs on a single server.
The formula is:

R = S / (1 – U)

Where R is the total wait time (“residence time”), S the service time and U the utilization of the server.
Following this, the residence time is proportional to 1/(1-U). That’s often referred to as the stretch factor because it describes how much the real total wait time is stretched compared to the time needed to process a request after it left the queue.

Try out the formula here with this interactive online tool (U has been used as x and R as y).
You will notice that the lower S is, the more the critical percentage of utilization can be put off. As a rule of thumb the following can be deduced: Halving the idle capacity, doubles the whole, average response time!

Using the graph formula together with “Little’s Law” mentioned earlier, it is possible to compute further values related to the queue:

R = U / (1-U)

Where R is the average number of customers currently in the system (in queue and currently serviced).

Q = U² / (1 – U)

Where Q is the average queue length.

Eventually the time W requests wait in a queue before being serviced can be computed on average as:

W = U*St / (1-U)

Where St the service time and U the utilization of the server.

Predicting the behavior of multi-server designs

The “correct method”: The Erlang formulas

Agner Erlang who pioneered in the field of telecommunications formulated a series of equations in his field. For example one to predict how many telephone lines would be needed to carry an expected volume of calls.
A modern form of the formula involves the unit nowadays known as “Erlang” that describes the service demand. Indirectly the amount of Erlang is equal to the amount of concurrency in the optimal case and therefore the number of servers required to handle all requests.

A practical example are the backbone-telephone lines. Naturally one telephone line can “service” 60 minutes of talk in one hour. That results in exactly 1 Erlang.
Now if the real requests sum up to 600 1-minute calls in one hour, that results in 600 minutes of talk and therefore 10 Erlangs.
In practice of course a backbone with 10 lines and said demand would mean that calls would often need to wait for a free line.

This is where Erlang’s formula ‘C’ comes into play:

ErlangC where A is the request load in Erlangs and M is the number of servers.

This massive, internally iterative formula calculates the probability that a new request has no free line and therefore has to wait in a queue.
A is the request load (the current demand) in Erlangs and M is the number of total servers (or telephone lines in this example).

Wolfram Alpha allows to see this formula in action.

As expected, for the edge-case of 10 Erlang of demand and 10 available servers, the probability that the new request has to wait is practically 100% because the probability that all calls coincidentally line up accurately one after another is negligible low.
With 11 servers or telephone lines that result in a system that allows more overlapping, the result of the Erlang C formula is already about 68%.

Applying “Little’s Law” it is again possible to derive different formulas to compute desired values, however Erlang’s formulas are not easy to apply and very unintuitive. For this reason, an approximation has been found.

The aproximated method

For the case that there is one waiting queue but Xs servers, a modification of the earlier basic formula can be used:

Rt = St / (1-U^Xs)

Where S is still the service time and U the utilization.
By applying the number of servers as an exponent to the utilization, the formula is equal to the old formula in the case of 1 server. Furthermore for other cases it results only in an underestimation of total request time of up to 10%. More information can be found in Cpt 2 of “Analyzing Computer System Performance With Perl::PDQ (Springer, 2005)” by Neil Gunther.

One common queue or one queue per server?

Using the formulas, the answer to this question is always that a single queue is more efficient. The logical reason becomes clear if we keep in mind that processing requests also takes random amount of time. This can result in a situation where one server is occupied with a large request while the other server has handled its whole queue already and now has to idle.
This is why server infrastructure tends to use “load balancers” that maintain a queue of user requests and spreads them to servers.
However because transmitting requests from the balancer to a server is taking time too, the servers usually hold queues themselves to ensure to be able to work constantly.

Nevertheless, sophisticated algorithms are required for load balancers to ensure a stable system especially under unusually high load or when servers drop out. This topic is handled in other blog entries.

You can experiment with the Erlang C formula modified to compute the residence time depending on the number of servers here.
This shows how a higher number of servers also allows for better utilization of the whole system before the wait time rises too much.

Problems and Limitations

Certain precautions have to be taken before the computations above can be used in a reliable way.

Exact measurements

The two key values of the system, service time S and current utilization U need to be known accurately. This can be tricky in server environments where random network- and infrastructure-delays are added under certain circumstances or where servicing can require waiting on other, secondary services (common nowadays when building serverless websites).
A quality measure of modern server management systems are the ability to determine this and especially the utilization of the system.

Ensuring exponential distribution

While this is typical, it has to be ensured that service times do spread sufficiently close to an exponential distribution.

Cross-dependency

Modern websites are often built “serverless” and require assembling data from different sources. It is possible to assume an abstraction and only view the desired layer of services (like only the outer that delivers to the users, or for example the system inside that delivers only the images to the other process that assemble the website). However things become less predictable when a sub-service utilizes a resource that is currently needed by a different request.

Conclusions

Due to those reasons, the authors of the major paper this blog entry is based on, suggest that the queueing theory described above is most suited for everyday and physical problems. There they can be applied for relatively simple decisions. Additionally they can be used to find possible problem-causes when an actual system does not behave as assumed. In the end they mainly help to adjust peoples personal intuition that often is wrong due to the non-linearity of relations in queueing problems.

Personally I find that knowing about queueing theory and having experimented with the formulas can indeed open ones eyes to what is actually predictable and what is not. Furthermore together with modern observation tools for server systems, I would definitely suggest trying to apply the concepts and formulas to verify whether a certain system is working in an optimal way or not. Last but not least they can form a starting point when beginning to design a server infrastructure.

Writing High Performance Code on Modern Hardware

Today, with the use of modern hardware combined with optimized high performant code, it is an easy task to process more than 500 million images per day on a single machine. Small improvements in the underlying implementations can have extreme large impacts on the execution time and are therefore fundamentally important to handle the huge amount of data in modern large scale systems. Furthermore, we can dramatically reduce the costs of our infrastructure and stay competitive by optimizing our implementations. To show how this is possible and what optimizations we can use in our everyday programmer life, let us have a look at the following imaginary task:

The Task

Our example task is to calculate the average color of an image as fast as possible. The task is inspired by various applications, like Google Images, which contains many images and shows the average color of the underlying image, before all data is available on the device. This can be used to hide the latency and to give a more fluid feeling to the application. At first glance, there is nothing special about it but if we think about how many images are uploaded to modern large scale systems and how much computing power we would need to process them, things are getting pretty interesting. The calculation of the average color of an image gives a good example, where we can apply fundamental optimizations to our code. To compute the average color, we have to process every pixel of an image which can be a very expensive task, especially, if we have to deal with high resolution images. Unfortunately, for the developers who have to deal with these high resolution images, modern devices have high resolution cameras and displays which increase the computational costs. To get the correct average color, we would have to square the individual values, and calculate the average color based on the squared values. But in our case, it is enough to simply add the values from every color channel together without squaring them, and divide it by the number of pixels. This will give us a slightly darker average color which is no problem for our task and furthermore, we will get performance improvements based on this simplification, as we will find out in the following sections.

Choosing a Language

Today, many programming languages exist, each with its own advantages and disadvantages and it became a challenge to choose the perfect language for a given task. If we want to write high performant code, things are getting even more complicated because for every languages there are different optimization techniques and we can find different test cases, where they can beat other languages. To test which language is the right one for our purpose, I implemented the basic calculation of the average color with the languages I commonly use: C++, Java and Python with Numpy. Although, I already expected that the good old C++ will win this fight, I was surprised about the results:

Calculation time in % when calculating the average color value with different programming languages

One thing should be noted, these results can only be seen as a basic overview about how much improvements we can get, if we choose different languages for specific tasks. These differences should not be transferred to a general comparison between the performances of these languages! The real improvements always depend on the specific task and in this case, the basic C++ implementation is already twice as fast as the Java solution and more than ten times faster than Python without any optimizations!

Know your Data

For further improvements, we have to take a look at our data, because then we can find ways to optimize our memory accesses. In most cases, 8 bit per color channel (Red, Green, Blue) is used to store our data. Additionally, an alpha value is stored as a fourth color channel to represent the transparence of our picture. So all in all, we have 4 color channels per pixel, each containing a number between 0 and 255. Our pixels are stored like the plain-PPM file format, where the RGBA values of all pixels are listed one after another. If we calculate one channel after another, we can get a low performance, based on inefficient memory access. If we use libraries, from which we do not exactly know how they store the data, we can easily have an inefficient memory access without even noticing. The imaginary used API could have stored our images in a different way, whereby the color channels are stored in four separate arrays. This could be useful in some cases, but if we now have a function to access a single pixel, we have created a memory inefficient access. Due to a memory inefficient access, where we calculate one channel after another, the calculation time increases drastically:

Calculation time in % with bad or efficient memory access

We, as programmers, are more familiar with Integer or Float data types, that are in most cases represented by 32 bits. We use them basically everywhere when we have enough available memory, even if we could use smaller data types. We do not care about the small decrease of our memory footprint, but the reduced memory consumption is not everything we can get from more suitable data types. Due to suitable data types, we can get additional performance improvements:

Calculation time in % with different datatypes

Now our compiler has additional information about the data and can use more or even better optimizations. With this small change, we reduce the calculation time by more than 40% and this only by storing our data in a Char with 8 Bits instead of an Integer with 32 Bits!

Know your Hardware

If we know our hardware, we can use further optimization techniques to get the best out of our system. Modern CPUs come with many cores and therefore, with huge performance gains. Additionally, they can use a technique that is called vectorization, whereby our hardware can make multiple calculations in fewer steps and, if this is not enough, we can also utilize the raw computing power of modern GPUs.

Vectorization

Vectorization uses special hardware registers, by which it is possible to make calculations faster. These registers are limited in size, often 128-Bit or 256-Bit, and we can fill them with our data. In our case, we add 4D vectors together. Normally, we have to make four additions, one for each element of the vector but if we use vectorization, this could be done in a single step. First, I implemented a basic SIMD (Single Instruction Multiple Data) vector calculation where we can add two 4D vectors, each stored in 128-Bit, together in a single step. But this simple approach increased rather than reduced the calculation time, how can this be? Our compiler does a great job in optimizing our code, whereby he already tries to use vectorization automatically! This is especially visible in the performance improvements we got by using 8 Bits to store our data, now, the compiler could detect this and could add more values together in a single step with automatic vectorization. It was not an easy task to implement a faster vectorization solution, but we can still get some improvements by using AVX2 (Advanced Vector Extensions) instructions with 256-Bit registers. We could store 32 8-Bit values in these registers but because we need more bits to store our sum, this representation is not enough. The next bigger data type would be 16-Bits where we can add 16 values each with 16 bits together in a single step. With 16 bits we can sum 256 values together if we do not square the values, without losing data and with this knowledge we can get again performance improvements:

Calculation time in % with vectorization techniques

Multiprocessors

Modern CPUs are multiprocessors, to get performance gains by parallelization instead of the nearly impossible increase of clock rate. By using multiprocessors, we can distribute the work over multiple cores and can fully utilize the CPU. For our task, we use six threads corresponding to six hardware cores, where every thread calculates the average color of an individual image. Due to the fact that multiple threads do not access the same data, we are free of race conditions, which makes our life easier. With six hardware cores, we would expect that we also will be able to process six times more images, but starting and waiting for threads also consumes time, so that we end up with an 4.5 times faster implementation than the single threaded version.

GPUs

The next step to get more performance, is to use GPUs. GPUs are a great choice if it comes to raw calculation performance, based on their hardware architecture. To keep it simple, GPUs have way more cores than CPUs but GPU cores are more lightweight than CPU cores, which means we do not have thousand individual CPU cores running concurrently on a GPU. But if we are aware of our hardware architecture, they can be executed nearly concurrently and we can get huge performance improvements especially for calculation intensive tasks. Many programmers have not even touched GPU programming, but today it is quite easy to get good performance, even without heavy optimization or hardware knowledge. For our task, even a very simple and unoptimized OpenCL solution is better than our optimized multicore C++ implementation. We perform a simple parallel sum on our color vectors, by which we start as many GPU-Threads as we have pixels in our image. First, every GPU thread loads a single value into local memory, then we calculate the sum of 256 Elements and store the average color of these elements on our GPU. We can repeat this steps until we have the average color of our whole image, and that’s basically all we have to do on the GPU side to get a 25% faster solution! Another advantage is, that GPUs often scale better with larger data as CPUs. This is very helpful for our high resolution images:

Relative execution time on different resolutions

Conclusion

The used programming language and suitable data types can heavily improve our performance without complicated optimizations. This is a simple way to write more performant code. Furthermore, we can integrate this easy changes in our everyday work to improve our implementations. If we want further optimizations, we get stuck in the endless space of possible techniques and hardware dependent optimizations, but even with common techniques, we can get great performance improvements. We can use vectorization, multicore CPUs and GPU programming to get the best out of our systems. Especially with GPU programming we can get great results, even without heavy optimization and furthermore, GPU programming became more easy in the past years and it is easy to adopt it in our systems. With this techniques, it was possible to reduce the calculation time of our example task to less than 1% compared to the simple Python implementation. It is always hard to talk about general optimization techniques, but I hope that the results of our imaginary task give some motivation and suggestions what can be achieved with modern hardware and optimized code:

Calculation time in % for all optimizations

Apache Kafka – The one Stream Processor to rule them all?

If there is one statement that can be made about the current developments in the realm of distributed systems, it would probably be how most developers are turning away from a centralised, monolithic architecture and move towards a microservice architecture. This type of architecture proved itself as much more flexible and robust for the modern world where more and more software is offered as a cloud-based solution. By splitting up systems into smaller parts, they can be updated more easily and crashed services can be recovered faster. These services can be containerized with Docker, so quickly putting up and pulling down parts of the infrastructure became very easy. On most occasions it is simply less work to trash a running software instance and recreate it, than logging into the instance and trying to fix what is broken.

Continue reading

A Dive into Serverless on the Basis of AWS Lambda

Hypes help to overlook the fact that tech is often reinventing the wheel, forcing developers to update applications and architecture accordingly in painful migrations.

Besides Kubernetes one of those current hypes is Serverless computing. While everyone agrees that Serverless offers some advantages it also introduces many problems. The current trend also shows certain parallels to CGI, PHP and co.

In most cases, however, investigations are limited to the problem of cold boot time. This article will, therefore, explore Serverless functions and their behavior, especially when scaling them out and will provide information on the effects this behavior has on other components of the architecture stack. For example, it is shown how the scaling-out behavior can very quickly kill the database.

Continue reading

Improved Vulnerability Detection using Deep Representation Learning

Today’s software is more vulnerable to cyber attacks than ever before. The number of recorded vulnerabilities has almost constantly increased since the early 90s. The strong competition on the software market along with many innovative technologies getting released every year forces modern software companies to spend more resources on development and less resources on software quality and testing. In 2017 alone, 14.500 new vulnerabilities were recorded by the CVE (Common Vulnerability and Exposures) database, compared to the 6.000 from the previous year. This will continue in the years to come. [1]

Continue reading

Federated Learning

The world is enriched daily with the latest and most sophisticated achievements of Artificial Intelligence (AI). But one challenge that all new technologies need to take seriously is training time. With deep neural networks and the computing power available today, it is finally possible to perform the most complex analyses without need of pre-processing and feature selection. This makes it possible to apply new models to numerous applications. The only thing needed is tons of training data, a good model and a lot of patience during training.

Continue reading

About using Machine Learning to improve performance of Go programs

Gophers

This Blogpost contains some thoughts on learning the sizes arrays, slices or maps are going to reach using Machine Learning (ML) to increase  programs’ performances by allocating the necessary memory in advance instead of reallocating every time new elements are appended.

What made me write this blogpost?

Well first of all I had to because it is part of the lecture Ultra Largescale Systems (ULS) I attended past winter term. But as an introduction I’ll tell you what made me choose this topic: I started to learn Golang and coming from mainly Java, Python and JavaScript the concept of Arrays with fixed sizes and Slices wrapped around them for convenience was new to me. When I understood that initializing them with the correct capacity is good for performance and memory usage I always tried to so. Until I came to some use case where I could not know the capacity in advance. At almost the same time we talked about “ML for Systems” in the ULS-lecture. There the power of ML is used to speed up Databases, loadbalance Elastic Search Queries and other things. So I came up with the idea of ML for programming languages in this case for learning capacities in Golang. By the way I wanted to try out ML in Go, which is said to bring some performance advantages compared to python and is easier to deliver. But neither ML in Go (go for ML) nor ML on Go is topic of this post, though both appear at some parts.

The goal in more detail

As explained in various blogposts like here and there, arrays have fixed sizes in Go. For convenient manipulation anyway they can be wrapped by slices. Thus appending to a slice that reached its capacity needs to create a new slice with a larger underling array, copy the contents of the old slice to the new one and then replace the old one by the new one. This is what the append method does. That this process is more time consuming than appending to a slice that has a sufficient capacity can be shown with some very simple code that just appends 100 times to a test slice in a loop. Once the slice is initialized with a capacity of zero and once with 100. For both cases we calculate the durations it takes and compare them. Since those durations can vary for the same kind of initialization we run this 1000 times each and calculate the average duration to get more meaningful results. The averages are calculated by the method printSummary which is left out here in order to keep track of things. However the whole code can be found on GitHub.

https://gist.github.com/hslr4/4b069e609e6a6904685e8912a6b98c02

As expected the correct initialized version runs with an average of 1714ns faster than the other one with an average of 2409ns. Of course those durations are still just samples and vary if the code runs multiple times. But in over 20 runs each there is only one average value of the bad initialized slice lower than some of the good ones.

If we also take a look at the capacity the slower version ends up with, we see that this is 128 instead of the required 100. This is because append always doubles the capacity if it reaches the limit.

So we can see that it is worth setting the capacity correct in advance for performance and resource consumption reasons. But this is not always as easy as in the example we just saw and sometimes it is not even possible to know the length a slice will grow up to in advance. In those cases it might make sense to let the program learn the required capacities. It could be helpful at initialization with make as well as for growing with append.

A basic example

Setup

To check out feasibility I created a basic example that is a bit more complex than the first one but still possible to calculate as well. It iterates over index j and value s of a slice of random integer samples and for each of them the test slice is created. Then we append s times three values and one value j times. So the final length (and required capacity) of test can be calculated as s*3+j.

Also in this loop training data gets gathered. One sample consists of s and j as input and len(test) as label. Since the main goal of this scenario is to check if it’s worth using a trained ML model to predict the required capacity, this data is collected always to create equal conditions for every test case. Ways to avoid the time expensive training and data collection at runtime are discussed later.

https://gist.github.com/hslr4/472eaf93cc7553b5328943782ad560ae

As implementation for the ML part I chose go-deep. I picked it from this list because it looked well documented, easy to use and sufficient for my needs, though not perfect.

I used the collected training data to train a MLP (Multi Layer Perceptron) with two hidden layers containing two and five neurons. Of course I configured RegressionMode to use Identity as activation function in the output layer and MSE (Mean Square Error) as loss function. I also played around with some other hyperparameters but kept a lot from the examples provided as well, because the MSE already decreased very fast and became 0.0000 after three training-iterations. This is not surprising since the function to learn is very simple. Also there is no need to avoid overfitting in this basic example. I kept some of the belonging hyperparameters with low values anyway. In a real world use case one would probably try to keep the model as small as possible to get quickest responses.

https://gist.github.com/hslr4/297e8608aae5116d2eadc157201dab69

Results

The following table shows the test cases I compared along with the average durations in nanoseconds calculated over 1000 tries each. Since those averages vary again from run to run the table contains three of them.

Test caseAvg ns run1Avg ns run2Avg ns run3
Initialize capacity with
zero
12.790.50114.267.92514.321.735
Use s*3+j directly in make5.679.5956.067.9685.943.731
Use a function to
calculate s*3+j
5.242.1826.012.9205.515.661
Use the prediction of the
learned model
10.898.4376.361.9119.056.003
The model’s prediction +16.069.7765.714.3486.144.386
The model’s prediction
on new random data
10.165.7646.096.9299.296.384

Even though the durations vary the results show that not initializing the capacity is worst. Also usually it is best to calculate the capacity, if possible. It does not really matter if the calculation happens in a function or directly. When I took a closer look at the model’s predictions I saw that they are quite often exactly one less than the actual capacity. This is why I also added the prediction+1 test case, which is almost as good as the direct calculations. So investigating a bit deeper in what the model predicts is worth it. Maybe some finetuning on the hyperparameters could also fix the problem instead of adding 1 manually. The results also show that the learned model works on completely new random data as well as on partly known data from the training.

Conclusion

Of course creating such a model for a small performance optimization is heavy overengineered and thus not worth it. It could be worth in cases where you know you have a bottleneck at this place (because your profiler told you) and you cannot calculate the required capacity in any other way in advance. In the introduction I already mentioned that I had a use case where it is not possible to do so. In this case the length of the slice depends on a sql.rows object which doesn’t tell you how many rows it contains in advance. Other examples might be conditional appends where you cannot know how many elements fulfill the condition to be appended to a slice or something else. But also in those cases the required capacity might depend on something else. For example the current time, the size of an HTTP request that caused this action or the length this slice reached the last time. In those cases using a ML model might be helpful to avoid a performance bottleneck. With dependencies to previous lengths especially RNNs (Recurrent Neural Networks) might be helpful. At least they probably could give a better guess than a developer himself.

Looking ahead

As stated above in examples like this the engineering effort is too high. So ways for automating would be desirable. First I thought about a one-size-fits-all solution meaning one pretrained model that predicts for various makes the required capacity. But it would be difficult to find good features because they could change from make to make and just using all sorts of possible features would create very sparse matrices and require larger models if they could work at all.

So we should stick to use case specific models that can be smaller and use meaningful features depending on their environment like lengths of arrays, slices, maps or strings “close” to them or values of specific bools or integers. The drawback is that individual models need individual training maybe with production like data. Training during runtime would cause an overhead that might destroy the benefit the model could bring and slow the program down at least for a while until training can be stopped or paused because the ML model’s performance is good enough. So if possible pure online learning should be avoided and training on test stages or at times with low traffic should be preferred. If the length of a slice depends on the current traffic this is of course not possible. Then one should at least make use of dumping a model’s weights from time to time to the logs to be able to reuse them when starting a new node.

Still we need to solve the overengineering issue and try to build a model automatically at compile time, if the developer demands to do so for example using an additional argument in the call to make. I think that this might be another use case for ML on code: Finding good features and parameters to build a ML model by inspecting the code. Unfortunately I’m not sure what such an ML model on code could look like and what it would require to train it. Using a GAN (Generative adversarial network) to generate the models would probably require already existing ones to train the discriminator. If the automation could be realized the use case also could get broader because then calculating the capacity would be more effort than just saying “learn it”.

Some final thoughts

Using ML would not magically boost performance. It would require developers to remeasure and double check if it’s worth using it. For example it is not only important how often the program needs to allocate memory but also where. So stack allocation is cheap and heap allocation is expensive as explained in this blog post. If using ML to predict the capacity requires the program to allocate on the heap it might be slower even when the predictions are correct. In the test scenario above all the cases instead of initializing with zero escaped to the heap. There it was worth it but it needs to be measured. So the performance should be compared with and without learning for short and for longer running applications. As another example sometimes the required capacities might not be learnable because they are almost random or depend on things that cannot be used as features in an efficient way.

Another drawback of using ML is that your code behaves less predictable. You won’t know what capacity will be estimated for a slice in advance anymore and it will be much harder to figure out why the program estimated exactly what it did afterwards.

I also thought about to train the model to reduce a mix of performance and required memory instead of using the final length as labels. But then it is not that easy anymore to get the training data. In some cases however it might also be difficult to get the “final” length of a slice as well.

The last thing to remember is that it is always helpful to set a learned model some borders. In this case a minimum and a maximum. My test model for example predicted a negative capacity before I got the hyperparameters right, what made my program crash. So if the model for some reason thinks this could be a great idea a fixed minimum of zero should prevent the worst. Also such borders make a program a bit more predictable again.

End-to-end Monitoring of Modern Cloud Applications


Introduction

During the last semester and as part of my Master’s thesis, I worked at an automotive company on the development of a vehicle connectivity platform. Within my team I was assigned the task of monitoring, which turned out to be a lot more interesting but at the same time way more complex than I expected. In this blog post, I would like to present an introduction to system health monitoring and describe the challenges that one faces when monitoring a cloud-based and highly distributed IoT platform. Following that I would like to share the monitoring concept that was the result of my investigation.

Monitoring

When developing an application, monitoring usually is not the first topic to come up to the developers or managers. During my time in university, whenever we had a fixed date to present results as part of a group’s project, me and my colleagues usually shifted our focus towards feature development rather than testing and monitoring. The need for monitoring however, becomes extremely important as soon as you have to operate a software system. Especially when maintaining distributed systems which cannot be debugged as traditional, monolithic programs the need for enhanced monitoring grows. The biggest difficulty with monitoring, is that it has to be designed specifically for the application that is needs to be monitored. There is no one-size-fits-all solution for monitoring. Within the scope of monitoring, there are a lot of different use cases to focus on. When talking about operating a software platform, our focus relies on system health monitoring.

System health monitoring

Monitoring has a different meaning depending on the domain it is being employed in. Efforts when monitoring software systems can put special interest into topics concerning security, application performance, compliance, feature usage and others. In this post I would like to focus on system health monitoring, following the goal of watching software services’ availability and being able to detect unhealthy states that can lead to service outages.

Even when reduced to health monitoring, one implementation of monitoring can differ strongly from another, depending on how different the systems, the processes, the tools and the users and stakeholders of that monitoring information are defined. The basis for monitoring is constantly changing due to the introduction of new technologies and practices. Therefore, a state of full maturity for monitoring tools can hardly ever be reached.

Monitoring Tools

Monitoring tools were created and adapted to fit for the domain of the systems that had to be monitored. Monitoring requirements are substantially different when monitoring clusters of homogeneous hardware running the same operating systems to those when managing the IT-infrastructure of a middle to large-size company with heterogeneous hardware (e.g. databases, routers, application servers, etc.).

If you are building a new software application or have been assigned the task of maintaining one, here are a few example tools that focus on different aspects of monitoring that could be interesting to try out.

  • Collectd – Push based host metric collection: Collectd is able to collect data from dynamically started and stopped instances. It works by periodically sending host metrics from preconfigured hosts to a central repository. By being a push-based system, it works also for short-lived processes. [2]
  • StatsD – Application level metric collection: StatsD was designed as a simple tool that collects traces and sends them via UDP to a central collector, where they can be forwarded to a collection and visualization tool like Graphite to be used to render graphs that are displayed in dashboards. [3]
  • The Elastic-Stack – Log management and analysis: When applications are hosted inside containers running on separate hosts, it is not only difficult to extract a containers low-level metrics, but also to extract and collect application-level metrics and logs. The standard open-source solution for this problem is the Elastic-Stack.
  • Riemann – Monitoring based on Complex Event Processing: Riemann differentiates itself from other monitoring tools in that it introduces event stream processing techniques to monitoring. Riemann works by processing events that are pushed by many distinct sources and, thanks to its push-based architecture, provides high scalability.
  • New Relic – Monitoring solution as SaaS: New Relic is one of the more recent SaaS solutions that have emerged in the past years. The feature of MaaS (Monitoring-as-a-service) is that the infrastructure and services needed to implement monitoring tools is abstracted away from operations teams. MaaS systems like New Relic can be configured to watch the infrastructure running services and receive monitoring data from the monitored objects either via data pushes directly or by installing agents that run inside hosts. Apart from collecting and processing logs, New Relic provides a browser based interface to access data and create customized dashboards as well as features like auto-detecting anomalies with the use of machine learning.
  • Zipkin – Distributed tracing: Zipkin, among other distributed tracing solutions, focuses on tracking requests on a distributed system in order to inspect the system for errors or performance bottlenecks and to enable troubleshooting in architectures where multiple components are being employed.
    Zipkin supports the OpenTracing API. OpenTracing is a project aimed to introduce a vendor-neutral tracing specification. Its objective is to standardize the tracing semantics to allow for better distributed tracing capabilities across frameworks and programming languages. Trace collectors can implement the OpenTracing API and can thereby be employed in large scale distributed systems, in which system subcomponents are written by different teams in different languages.

If your project if focused more on infrastructure components, it would be interesting to take a look at collectd. For a smaller web application, I would recommend to build in StatsD and focus on tracking relevant business KPIs from inside the relevant functions within the code. For any software project I would recommend to employ a log management system like the Elastic-Stack or an alternative. Since its an open source solution, there are no licence costs attached and it can notably reduce troubleshooting time when debugging your application, specially if your application relies on a microservice architecture. For start-ups with smaller cloud projects going to production, I would recommend trying out a Monitoring-as-a-service solution to analyze your application usage. Especially for smaller businesses going to the cloud, finding out how your customers are using your application is crucial. Discovering which functions and to what extent resources are being utilized can enable operations teams to optimize the cloud resource allocation while providing the business with important feedback about feature reception.
If you are building a complex distributed architecture involving many different components, it might make sense to evaluate more versatile tools like Riemann and take a look at OpenTracing.

Monitoring Strategy

Unarguably, the employment of the right tools or solutions to monitor a distributed system is crucial for the successful operation of it. However, effective monitoring does not only focus on tooling, but rather incorporates concepts and practices derived from the experience of system operations in regard to monitoring, and to some extent, to incident management. If your task is to monitor a software system, the best way to start is to define a monitoring strategy. When doing so, it should consider the following concepts:

  • Symptoms and causes distinction – The distinction of symptoms and causes is especially important when generating notifications for issues and defining thresholds for alerting. In the context of monitoring, symptoms are misbehavior from a user perspective. Symptoms can be slow rendering web pages or corrupt user data. Causes in contrast are the roots of symptoms. Causes for slow rendering web pages can be unhealthy network devices or high CPU load on the application servers.
  • Black-box and white-box monitoring – Black-box monitoring is about testing a system functionality and checking if the results are valid. If results are not what was expected, an alert is triggered. White-box monitoring is about getting information about a component’s state and alerting on unhealthy values. While white-box monitoring can help the operations personnel find causes, black-box monitoring is used to detect symptoms.
  • Proactive and reactive monitoring – In relation to black-box and white-box monitoring are the concepts of proactive and reactive monitoring. Whereas reactive monitoring is concerned about collecting, analyzing and alerting on data that can reflect an outage of the system, reactive monitoring emphasizes on detecting critical situations within the monitored object that can lead to a service performance degradation or complete unavailability. While the results from black-box monitoring can be useful to detect problems affecting users, white-box monitoring techniques provide more visibility into a system’s internal health state. With careful analysis of trends on internal metrics, a prediction of imminent failures can be made, which can either trigger a system recovery mechanism, an auto-scale of system resources or alert a system operator.
  • Alerting – Alerting is a crucial part of monitoring. Alerting however, does not necessarily involve paging a human being. The latter should mostly remain the exception and only be employed when services are already unavailable or failure is imminent without human action. For events that require attention, but do not involve failing systems, alerts can simply be stored as incidents in a ticketing system, to be reviewed by the operations team when no other situation requires their attention. In case of pages, they should be triggered as sparsely as possible, since constantly responding to pages can be frustrating and time consuming for operations employees. When defining thresholds to be notified on, new approaches propose to aggregate metrics and generate alerts on trends, instead of watching simple low-level metrics, e.g. watching for the rate in which available space on a database is filling up, instead of its current free space at a certain point in time. This concept relates to focusing on symptoms rather than causes when alerting.

When implementing the monitoring strategy for your project, it’s best to start with black-box monitoring, covering the areas that affect your users the most (such as being able to log in to your system). When defining alerts, focus should rely on symptoms rather than causes, since causes might or might not lead to symptoms. Also, if your are defining alerts, always focus on the right choice of recipients. Different groups of project employees might be interested on different types of alerts or different granularity. Alert the business about decreasing page views and not about cluster certificates that will expire soon.

Real-life scenario: monitoring a cloud-based vehicle connectivity platform

The connectivity platform I worked on provides infrastructure and core services to develop connectivity services for commercial vehicles. These services enable users to open the vehicles’ doors remotely via smartphone or displaying the position and current state of a vehicle on a map. The platform provided a scalable backend which handled the load of incoming vehicle signals and provided it as JSON data from APIs. It also included frontend components which fetched data from the provided APIs and displayed it in a smartphone in the custom app or in a browser as a web application. The platform was built on top of Microsoft Azure and was based on the IoT reference architecture proposed by Microsoft in the Azure documentation. It was built on a highly distributed microservice architecture that made extensive usage of Azure’s products, both SaaS and PaaS. Its core was made of an Azure Service Fabric cluster running all microservices that implemented the business logic of the connectivity services. The communication between microservices was established either synchronously via HTTP or asynchronously using Azure’s enterprise messaging component Azure Service Bus. In order to implement most use cases the platform’s components also had to communicate with various company-internal systems, as well as third party software services that implemented specific functionality, such as user management.

Monitoring challenges

In order to reach a basic level of monitoring in the platform, at least each component of the platform, which was not being fully managed by Microsoft or a third-party service provider (e.g. a SaaS- component), had to be monitored by the operations team. As stated above, special care was put in the choice of components that demanded the lowest level of administration effort of the aforementioned team. That led to an increased usage of PaaS-components instead of Virtual Machines or other infrastructure alternatives.

Even though the infrastructure of PaaS-components does not need to be managed by the cloud consumer, there are some aspects when using PaaS-components that can generate errors in runtime and must, for this reason, be constantly monitored by the operations team. Some examples are:

A queue in the Service Bus, filling up rapidly because a microservice is not keeping up with the number of incoming messages.

An Azure Data Factory Pipeline that stopped running because a wrong configuration change removed or changed a database’s credentials and the pipeline is not able to fetch new data.

The former scenario can be easily detected, since Azure provides the resource owner with many metrics regarding the high-level usage of each component. In this case, a counter of unread messages is available in the Azure Portal or via Azure Metrics. Since the counter is available in the portal, widgets displaying its data can be pinned to a dashboard and alerts can be created for a given threshold on this metric.

In the data pipeline scenario, an outage can be detected by setting alerts on the activity log, which is the only place where occurring errors within Data Factories are reported. However, an aspect adding risk to the latter scenario is that detecting that a pipeline stopped running is especially difficult when testing the system from the user’s perspective, since it is only reflected by outdated or incorrect data, instead of directly throwing an error.

In this sense, the operations team of the platform should always be notified in the case of a stopped pipeline because user complains might be misleading. Also, the advantage of using PaaS components and having the underlying infrastructure abstracted away from the user comes with the disadvantage that all available logs and metrics, from which monitoring events can be derived, are defined by the component’s provider and cannot be easily expanded, customized or complemented with own information.

The monitoring strategy

When defining the monitoring strategy for the platform from a technological point of view, the first decision was to make extensive use of the SaaS monitoring solutions already provided by Microsoft to monitor applications and infrastructure hosted inside of Azure, before introducing external proprietary solutions or deploying open-source monitoring tools to watch the platform. The tools employed were Azure Monitor, Azure Log Analytics and Azure Application Insights. Implementing white-box monitoring was simpler due to the seamless integration of tool provided by Microsoft to monitor Applications on Azure. Implementing black-box monitoring to be able to tell that the most important functions of the platform were available was way trickier. The next section explains the concept developed as part of my thesis.

End-to-end black-box monitoring

Why monitoring single components is not enough

The problem statement above only describes the challenges of monitoring each component on its own. The services offered by the platform however, are provided only as a result of the correct functioning of a specific set of these subcomponents together. In many cases, the platform’s subcomponents receive messages, process them and then forward them via a synchronous or asynchronous calls to the next subcomponents, forming complex event chains. These event chains can also include third party systems operated within the same organization or third parties, which handle specific technical areas of the connected vehicle logic. From an operations’ perspective, the most valuable information is to know whether a specific, user-facing service is running at the moment or if it is facing technical disruptions. In this case, not only knowing if a service is running in the sense of a passing an integration test is important, but rather having further insights into the system (including its complex event chains) and which part of it is failing to process the requested operation.

This observability can be partially provided by monitoring the platform’s subcomponents on their own, but not entirely. Here is one example:

In case a Service Bus queue filles up, the cause of messages missing in the frontend can be detected by alerting on a threshold on the involved Service Bus queue’s length.

However, here is an example of a case in which finding out the root cause of a service disruption might become less obvious:

A configuration change from a deployment altered the expected payload of a microservice, and incoming asynchronous messages are being ignored by that microservice for being badly formatted.

The error would only be visible by examining the logs generated by the service. If the message drop is not being actively logged by that specific microservice, there is no way for the operations team to find out on which part of the system the messages are being lost. This
observability problem is increased by the fact that PaaS components do not generate logs for every message being processed. That means the last microservice which created logs including the message’s ID is the last point of known correct functioning of the event chain.

This case might not seem probable, but most of the disruptions happening in a production environment come from unintentional changes to the configuration made by an update. Also, as the architecture becomes more and more complex and further components are added to support the main services of the chain, the task of troubleshooting the platform grows in complexity. An end-to-end black-box monitoring system, which monitors the core services of the platform, would substantially reduce effort when searching for misbehaving components since it would alert the operations team as soon as one of the core services is not available. The alert would as well include information about the exact subcomponent responsible for the outage.

What we need is Event Chain Monitoring

The core of the proposed solution is based on event correlation. In this case, an own definition of events and their aggregation was used. The core ideas are inspired by the Complex Event Processing paradigm but its full complexity would go beyond the scope of this blog post. The use of events creates an abstraction level between incoming signals and relevant information for monitoring. Sources of events can be either HTTP requests triggering a component or logs being added to the log repository of the platform or any kind of process that can be converted to an event. To map data to an event, it has to contain information that can be derived to at least the following event metadata:

  1. Event source
  2. Event timestamp
  3. Tracking ID
  4. Event type

The event source is used to map the event to the component which originally generated the event. The timestamp contains the time in UTC at which the event was generated or processed. The tracking ID is used to correlate events that belong to the same action. Using a tracking ID, all events originated by the same operation can be grouped together. The event name states the type of event that was triggered, such as data from vehicle received at gateway or message processed by microservice A.

When triggered, all events have to be routed to central repository for analysis and aggregation. From this central repository, single events can be analyzed for patterns that indicate system failures.
Defining the model for a fully functioning event chain is then simple. It is calculated with a query as the aggregation of any number of events and a syntax stating the order and time frames in which the events need to reach the central repository. The exact count of events and their order depends on the level of granularity to which the event chain is going to be monitored.
A visualization of the proposed solution is presented in the following diagram using a sample event chain:

After the event chain is modelled in the monitoring web application, the flow of events can be tracked. First, signals have to be generated. Since generating real signals would involve driving around in a real car connected to the backend, a vehicle has to be simulated, thus the first event would come from a vehicle simulator. After vehicle data events start being pushed to the platform, the different components of the system will start processing the incoming data and interacting with other subcomponents. For each subcomponent, the telemetry capabilities have to be defined. If it supports generating an event for every data signal and sending it to the central event repository or if the event hast to be generated out of the logs or other information sources that are filled by this subcomponent. Depending on the type of components and the level of administration needed, there is more or less information available. If crucial information is missing from a component, a helper or observer unit (here the “observer” Azure Functions) has to be added to the chain, to provide the missing information.

After events are matched, their results can produce other, higher-level events. These high-level events can be consumed by a monitoring dashboard, or a ticketing system. If consumed by a custom dashboard, monitoring information can then be presented by visually showing a graph of the subcomponents of the system. Depending on the patterns matched by predefined queries, different parts of the event chain, that get recognized as failing, can be highlighted in the dashboard.
In addition to displaying a graph of the event chain, the operations team can subscribe to changes to the use-cases health state. If a disruption of a core service is registered, a text message can be sent to the person on-call. Less crucial incidents can be tracked in a ticketing system for further investigation by the operations team. A further advantage of having the event chains displayed visually to the operations team is that even without deep knowledge of the platform, the operations team is able to reason about symptoms visible by users of the platform and track back many incidents that might have the same origin. That is, the operations team would get direct visibility into the state of the subcomponents of the platform and, in the case of an outage, the team gets direct information about the outage’s repercussion on the operability of the system.

Conclusion

The implementation of modern software architecture patterns like microservices combined with the use cloud services beyond IaaS facilitates a faster development of software services. Despite the reduced effort to manage infrastructure when deploying applications to the cloud, being able to monitor the resulting highly distributed systems is becoming an increasingly complex task. On top of that, customers of modern online applications expect short response times and high availability, which raises the need for accurate monitoring systems. In this post I provided a short introduction to monitoring and an overview of current monitoring tools you can test in your projects. However, I also described how a well-thought monitoring strategy is more important than tools in order to increase your system’s availability. In the second half of this post I provided a real-life example scenario and depicted the challenges related to monitoring a distributed IoT-platform. I then presented the concept that was developed in order to reach a better state of observability into the platform.

References

Some of the ideas summarised in this blog post were originally presented in the following articles or books. They are also excellent reads in case the topic of monitoring got your interest.

  • B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. “ O’Reilly Media, Inc.,” 2016.
  • M. Julian, “Practical Monitoring: Effective Strategies for the Real World,” 2017.
  • J. Turnbull, The art of monitoring. James Turnbull, 2014.
  • I. Malpass, “Measure Anything, Measure Everything,” 2011. link.
  • J. S. Ward and A. Barker, “Observing the clouds: a survey and taxonomy of cloud
    monitoring,” J. Cloud Comput., vol. 3, no. 1, p. 24, 2014.
  • B. H. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” 2010.

Reproducibility in Machine Learning

The rise of Machine Learning has led to changes across all areas of computer science. From a very abstract point of view, heuristics are replaced by black-box machine-learning algorithms providing “better results”. But how do we actually quantify better results? ML-based solutions tend to focus more on absolute performance improvements (measured by metrics) instead of factors like resilience and reproducibility. On the other hand, ML models have a significantly growing impact on humans. One can argue that the danger is negligible for applications like playing games but with direct impacts like self-driving in production, there comes a responsibility. This responsibility was strengthened not only by laws such as the EU General Data Protection Regulation (GDPR).

Nevertheless, the objective of this post is not to philosophize about the dangers and dark sides of AI. In fact, this post aims to work out common challenges in reproducibility for machine learning and shows programming differences to other areas of Computer Science. Secondly, we will see practices and workflows to create a higher grade of reproducibility in machine learning algorithms.

Background

Having a software engineering background, my first personal experience of programming in machine learning felt like going back in time. Many frameworks are evolved and highly used in practice (TensorFlow, keras, pytorch, …) but other’s are still in the early stages and evolve quickly. This fact shouldn’t be surprising regarding the short history of current ML implementations. However, the definition of frameworks differs from other areas of Computer Science. Tensorflow and others create an abstraction layer for the underlying mathematical operations and indeed simplify processes like training, optimizations and more. But for me, they are closer to a toolkit of operations than a cookbook with best-practices.

Especially scientific results are often implemented with the same toolkit but as a standalone project. For this reason, the grade of reusability of such implementations is often low. Research scientists are interested in the most recent publications but there is no baseline project which can be used for different approaches, models and datasets. It’s more about copying and pasting workflows, downloading datasets and hacking it together. However, the research in ML is now establishing programming paradigms which exist in other parts of computer science for decades. Further, I am thankful to anyone contributing to state-of-the-art implementations in the first place. Thus, we will move the scientific scope into the background from now on.

Taking a more practical approach into account, Jupyter notebooks are often used as a starting point to explore data and different approaches. They are a great tool to evaluate a proof-of-concept and to showcase initial findings. However, notebooks tend to be chaotic with increasing complexity. In certain aspects, we can compare the workflow to creating an MVP in software engineering. You can reuse the created MVP as a setup for the productive application, but you shouldn’t expect a clean and extensible architecture then.

A machine learning workflow

For a better understanding, the following figure shows a typical workflow and the components of development in Data Science:

  1. Load and preprocess data, bring it into an interpretable form for our ML model.
  2. Code a model and implement the block-box magic that empowers AI.
  3. Train, Evaluate and fine-tune the model over days, weeks or months.
ML workflow
https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview

After an initial implementation and similar to software lifecycles, we have the following steps:

  1. Deploy the program (model) to our dedicated infrastructure (Cloud, local).
  2. Use the model in production.
  3. Monitor the application and its predictions.
  4. Maintain the source code, implement new features and deploy new versions.

Frameworks like TensorFlow provide tools to read data, train models and evaluate them with different metrics. Further, approaches like TensorFlow Serving address the second part of the workflow to deploy models on infrastructure for production. Nonetheless, these tools don’t explicitly address reproducibility issues in ML. For a better understanding, the following section goes one step back by pointing out these challenges.

Challenges in ML reproducibility

In contrast to other fields of computer science, the results are non-deterministic. In other words, the same source code produces different results for the same dataset. Reasons mostly lie in implementation details such as random initializations of parameters or randomly shuffled datasets.

However, the baseline for collaboration in software engineering is a project environment where changes can be reproduced. There is not enough time in this blog post to discuss concepts like Versioning and Continous Integration, but in general, they lead to projects that are less error-agnostic due to automatic testing and deploying. Furthermore, contributors are able to comprehend changes and reproduce it on their environment (if guidelines and rules are followed).

Having the non-determinism in mind, the  objective for ML is a process in which results can be produced having the exact same results. The following aspects address this issue:

  1. Versioning of models: Models should be versioned and any changes be transparent.
  2. “Results without context are meaningless.“ (https://www.pachyderm.io/dsbor.html). For reproducibility, the collection of metadata is essential. Running a model on a dataset and versioning it does not answer questions such as: Where did the data come from? How can we rerun the model with an updated dataset?
  3. A reproducibility flag should enable a mode in which features causing non-deterministic results such as random initialization are disabled.

Coming back to Jupyter notebooks as a baseline for ML projects, they are a nightmare from a versioning and reproducibility perspective. First, notebook cells can run in different orders which makes the results hard to understand. Secondly, the actual source of notebooks is illegible which basically means that one can’t understand changes between versions regarding the source code. So as software developer, just imagine checking out a version and searching for changes in the application because source differences are meaningless.

So, how in practice?

We have gained an intuition for challenges in ML development and learned that versioning and collecting metadata is crucial for reproducibility. We are now answering the question of how to address these issues with in practice.

Data versioning tools

Data version control (DVC) or datmo are open source production tools for model management. In other words, they are versioning tools similar to git but address additional data scientist needs. One fundamental need is the integration of large files from different data sources. Git is not made to handle large files and the source for machine learning data is often on a different source (Public Cloud, Customer’s infrastructure).

Thus, the Git Large File Storage (LFS) replaces large files such as data input files with text pointers inside Git and stores the data on a remote server. Going one step further, we don’t want the data as part of the tool but the entire workflow. For a better understanding, the following figure illustrates a very basic scenario for DVC.

https://dvc.org/doc/use-cases/data-and-model-files-versioning

We publish and check out the code using a remote hub (Gitlab, Github, …) just as usual. On top of this, we use DVC to publish and retrieve data versions using a different hub. This separation of code and data makes it possible to test different program versions on various data versions. This is particularly useful to reproduce a model’s performance over time in production (after collecting new data).

On top of this, the tools address the following features (not exclusively):

  • Language- and Framework-agnostic: Implement projects in different languages (Python, R, Julia, ..) using different frameworks (TensorFlow, PyTorch, …)
  • Infrastructure-agnostic: Deploy models to different environments and infrastructures (Google Cloud, AWS, local infrastructure)

However, the infrastructure-agnostic feature comes with a drawback. DVC or datmo lack pipeline execution features for build pipelines, monitoring or error handling. The philosophy of these tools is to be very generic without running servers. They are slim command line tools without user interfaces.

Pachyderm

In order to come closer to continuous integration, we need deployment pipelines and modular infrastructure. The goal is an automated processes of releasing new versions, testing it on a staging environments and deploying it to customers. The borders between infrastructure and programming are blurring and the same should apply to machine learning. The two keywords that pop up in every (modern) infrastructure are containers (docker) and Kubernetes. Say hi to Pachyderm.

Pachyderm runs on top of Kubernetes which makes it deployable to any service that supports Kubernetes (Google Cloud Platform, AWS, Azure, Local infrastructure). Further, it integrates git-like features to version code as well as data and it shares many of its names (repository, pipeline, …). With Pachyderm, we configure continuous integration pipelines with container images.

Image result for pachyderm data science
https://www.slideshare.net/joshlk100/reproducible-data-science-review-of-pachyderm-data-version-control-and-git-lfs-tools

The above figure shows a baseline workflow in Pachyderm. Assuming that we already created a repository and coded our model, we put a file to our dedicated data storage. We then create pipelines whose configurations are written as JSON-files and can look as follows:

{
  "pipeline": {
    "name": "word-count"
  },
  "transform": {
    "image": "docker-image",
    "cmd": ["/bin", "/pfs/data", "/pfs/out"]
  },
  "input": {
      "pfs": {
        "repo": "data",
        "glob": "/*"
      }
  }
}

By executing the above configuration with the Pachyderm command line interface pachctl create-pipeline -f above_pipeline.json, we run the commands in cmd within the container created from an image defined in image. Further, we use remote storages like S3 or store it on the Pachyderm File System (pfs). PFS is a distributed filesystem which is, until a certain level, comparable to the Hadoop file system (HDFS) where MapReduce-Jobs are replaced by Pachyderm pipelines (see 8).

After creating the pipeline above, Pachyderm will launch worker pods on Kubernetes. These worker pods will remain up and running, such that they are ready to process any data committed to their input repositories.

KubeFlow: Distributed large-scale deployment

As described beforehand, Pachyderm let us create scalable and manageable ML pipelines on Kubernetes. Although Pachyderm can be parallelized in a map/reduce-style way, the pipelines mostly rely on single nodes and non-distributed training (multiple GPUs, but not multiple nodes). Having a different approach in mind, KubeFlow mainly focuses on standards to deploy and manage distributed ML on Kubernetes. It integrates tools like Distributed TensorFlow or TensorFlow Serving and further JupyterHub which improves the process of developing in teams on shared notebooks.

However, KubeFlow (as of now) lacks tools that orchestrate Data Science workflows as seen earlier (Data Preprocessing, Modelling, Training, Deployment, Monitoring, …). It leaves these responsibilities up to the developer. Since this blog post mainly focuses on reproducibility in Machine Learning, KubeFlow does not answer these questions satisfactory. Consequently, further concepts are out-of-scope for this post while not being less exciting for productive and large-scale ML engineering.

Nevertheless, reproducibility and productivity should go hand in hand. For this reason, KubeFlow and Pachyderm can be jointly used in practice. In such a scenario, Pachyderm would provide the reproducibility through pipelines and KubeFlow would bring with the ease of deployment and distributed framework integrations (see 67 for more details).

So what should I use?

After an introduction to tools such as DVC and Pachyderm, one last question remains: Which is the best tool in production? And as always the answer is – it depends. DVC can improve productivity in smaller teams to organize and version projects and link the source code to the data. However, for organizations willing to introduce a workflow, richly-featured tools such as Pachyderm are the way to go. Taking one step further, KubeFlow paves the way for large-scale and distributed applications.

From a different point of view, the discussion behind it could be seen as a discussion about Kubernetes itself. This is fairly more wide-reaching and asks fundamental questions such as: Can we pay someone to set up and maintain the Kubernetes Cluster? Are our applications and workflows complex enough (multiple nodes, not multiple GPUs) to justify the overhead of Kubernetes? Unfortunately, we can’t answer these questions in this blog post.

Wrapping it up

In Machine Learning, programs can have the same meaning and even speak the same language but output different results because context matters and implementations are full of (intended) randomness. Further, development in ML is very sensitive to changes and even small differences can have high impacts on the result. For reproducibility, we have to record the full story and keep track of all changes.

Frameworks such as DVC or Pachyderm help to keep track of not only the code but also the data. Furthermore, they use pipelines to reproduce results and simplify collaborative projects. This increases the reproducibility and corresponds to the responsibility in ML. On top of this, the tools are a first step towards the fulfillment of laws like the GDPR because results can at least be reproduced. However, these solutions are to some extent immature and evolve quickly (however, just like everything else in ML). It is still a long way to go to obtain practices in ML which are comparable to standards in software engineering.

Related Sources and further reading

  1. Collaboration Issues in Data Science (accessed: 25.02.19): https://github.com/iterative/dvc.org/blob/master/static/docs/philosophy/collaboration-issues.md
  2. Hold Your Machine Learning and AI Models Accountable (accessed 25.02.19): https://medium.com/pachyderm-data/hold-your-machine-learning-and-ai-models-accountable-de887177174c
  3. How to Manage Machine Learning Models (accessed: 25.02.19): https://www.inovex.de/blog/how-to-manage-machine-learning-models/
  4. Introducing Kubeflow – A Composable, Portable, Scalable ML Stack Built for Kubernetes (accessed: 26.02.19): https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/
  5. Machine-Learning im Kubernetes-Cluster (German, accessed: 25.02.19): https://www.heise.de/developer/artikel/Machine-Learning-im-Kubernetes-Cluster-4226233.html
  6. Machine Learning Workflow (accessed: 26.02.18): https://cloud.google.com/ml-engine/docs/tensorflow/ml-solutions-overview
  7. Pachyderm and Kubeflow integration (accessed: 26.02.18): https://github.com/kubeflow/kubeflow/issues/151
  8. Pachyderm File System (PFS, accessed: 26.02.18): https://docs.pachyderm.io/en/v1.3.7/pachyderm_file_system.html
  9. Provenance: the Missing Feature for Rigorous Data Science. Now in Pachyderm 1.1 (accessed 25.02.19): https://medium.com/pachyderm-data/provenance-the-missing-feature-for-good-data-science-now-in-pachyderm-1-1-2bd9d376a7eb
  10. Reproducibility in ML: Why It Matters and How to Achieve It (accessed: 25.02.19): https://determined.ai/blog/reproducibility-in-ml/
  11. Reproducible data science: review of Pachyderm, Data Version Control and GIT LFS tools (slides, accessed: 25.02.19): https://www.slideshare.net/joshlk100/reproducible-data-science-review-of-pachyderm-data-version-control-and-git-lfs-tools
  12. The Data Science – Bill of Rights (accessed: 25.02.19): https://www.pachyderm.io/dsbor.html

Experiences from breaking down a monolith (1)

Written by Verena Barth, Marcel Heisler, Florian Rupp, & Tim Tenckhoff

The idea

The search for a useful, simple application idea that could be realized within a semester project proved to be difficult. Our project was meant to be the base for several lectures and its development should familiarize us with new technologies and topics. Someday a team member was standing at a train stop, waiting for the delayed subway to finally arrive and came up with the idea that it would be interesting to see statistics about the average/total delay of the Deutsche Bahn!

This is how the idea was born: We wanted to develop a device-independent web application to visualize both, the average and the accumulated delay, as well as current departures of the public transportation at some stops. Our application called Bahnalyse receives its data via the VVS API. To calculate the statistics, the departure times of the railways are crawled in a regular time interval and stored in a database. To search for station names as well as to display the current delays, the VVS API is called directly.

During the lecture “Web Application Architecture” we thought about a first simple application architecture, patterns and technologies we wanted to use. This was significantly influenced by the lecture “System Engineering and Management”, in the context of which this blog post is written.

General aims, technologies

Our aims for this lecture were initially to get in touch with CI/CD, Docker and the deployment to the Cloud – topics that are currently on everyone’s lips and with which we haven’t had much to do yet. Furthermore we wanted to improve the architectural concept of our quite simple application, e.g. by splitting it up, decouple the components from each other and experiment with different types of databases.

For the development we chose Angular and made use of some other useful npm packages including ng2-charts (offering Angular directives for Charts.js which provide HTML5 based, animated JavaScript charts) and Angular Material, which offers modern UI components following Google’s Material Design spec. The Java backend was developed with Spring Boot making it easy to create stand-alone applications without much XML configuration. Initially we wanted to realize a relational MySQL database in combination with Hibernate. But let’s see where the system engineering process took us to. 😉

Which cloud provider fits best?

Since deploying to the cloud was one of our initial goals, we started to look for the “best” cloud provider early. We compared the most common ones (in Europe and US) AWS (Amazon Web Services), IBM Cloud, GCP (Google Cloud Platform) and Microsoft Azure in four Categories.

The first category is Enterprise Adoption because it makes sense to learn and try out the cloud provider we will use most likely later on at work. We found a study from Right Scale about this topic where AWS is clearly the leader. 68% of the 997 participating companies from different industries were running applications on AWS in 2018. The second place made Azure with 58% but this might be misleading because it includes Office 365 customers which is not interesting for us.

Category two is documentation. We found a blog post from a fellow student from July 2018 that says “documentation is key”. He tried out AWS, GCP and IBM and again AWS won with a “Still bad documentation but best of all“. Of course this is only his point of view but at least he had tried them out before judging. So for us his opinion gives us at least a good clue.

The third category is the community. Therefore we compared the amount of questions asked on Stack Overflow of all time. If there are already a lot of questions asked the possibility grows, that means the questions we might have would have possibly been answered already. Again the winner is AWS with 65,427 questions. We did not consider the percentage of unanswered questions, because it is not differing very much between the cloud providers. We looked up those numbers in October 2018.

The last but most important category are costs. Being students just trying out some technologies we did not want to spend any money. So we only compared free tiers and student accounts we did not check out what it would cost to run our application for a longer time.

We started by trying out the AWS Student Account. Given only the Starter Account by the Stuttgart Media University (HdM) we don’t have access to any IAM-Tools. Also it is not possible to create access keys for our user. This and the federate login of the student account itself made logging in from a CI-Pipeline impossible. Thus the AWS student account became useless for us. The AWS free tier would probably provide the features we need for free up to 12 month and up to a limited amount of resources. But it requires a credit card for registration that will be charged if any of the free tier limitations are exceeded. There is no possibility to set any limits for the payments so it’s not possible to say “shut down all my instances before I would have to pay for them”.

For the GCP student accounts we are not enabled because our university would have to apply for it first. The free tier also requires a credit card. But at least it will only be charged after confirmation when the free tier is used up.

Still we did not want to provide a credit card at all. Luckily IBM Cloud provided exactly what we are looking for. A six-month free student account with no need to provide a credit card and only few limitations on the available features. To activate it you need to get a promo code and supply it in your free account to upgrade to a student account.

The table below shows an overview of the categories explained above. The last row contains additional considerations. So we were told that containers are running in actual Kubernetes Clusters on GCP whereas AWS uses VMs. Also we read that GCP provides the “best AI” services, but this does not matter for our use case.

Finally we can conclude that we would use IBM Cloud because of the free student trial. But actually we did not deploy to any cloud so far. While working on the project we found other interesting topics we concentrated on and deployed the old-fashioned way to a server we got from the HdM also for free.

AWSIBMGCPAzure
Enterprise adoption 68%15%19%58%
Documentation “bad but best“bad”
„little bit better“
Community 65.427 5.256 10.352 62.638
Free? Student account: useless
Free tier: credit card required, no limits possible
Student account: few limitations, six month free Student account: HdM not applied
Free tier: credit card required confirm paid account, few limits possible
Miscellaneous VMs Kubernetes,
best AI Services
Mainly Office 356 customers