When a distributed software system grows bigger and bigger, one will end up with a big amount of various components which all need to scale independently. In order to achieve these components working smooth together, it is necessary to figure out at which time a component needs to be scaled, to avoid having one component as a bottleneck.
This blog post focuses on the possibility to test the behaviour of a large scale system under extreme load in order to discover vulnerabilities. Therefore I will provide an overview of scalability testing and a more specific variant, which has already proven itself as a successful testing variant for such systems, called Chaos Engineering.
Scalability testing is a testing method that is assigned to non-functional testing, which means it checks non-functional aspects of a system, like performance, reliability or usability.
The main aspect behind testing scalability is to measure an application’s performance based on its ability to perform up – and downscaling during different amount of load. 
Scalability Testing vs Load Testing
The term “Load Testing” might be better known for some people. Performing load tests on an application means, that the entire system is tested under maximum load, to figure out the point at that the application is not usable anymore. 
This point indicates how much load the application can just about handle before the response times become too long or the system even crashes.
So what’s the difference to Scalability Testing?
When testing for scale, unlike load tests, now the minimum and maximum load for all levels of a system is measured, which includes software, hardware and database levels.
When there is more traffic, the system should scale up the specific layer in order to prevent large response times. The goal of scalability testing is to ensure that the application performs these scaling functions correctly. Measuring performance always means defining metrics or parameters to be able to produce results which can be compared later on. 
These parameters can be different, depending on the type of the application. For example in a web frontend application one would most likely measure the number of users or the network usage, while when testing a web server, the amount of processed requests is more important.
As you might have seen, testing in general is closely linked to another topic when dealing with large scale systems, which is monitoring.
You can get a good introduction into this topic by reading this blog post: https://blog.mi.hdm-stuttgart.de/index.php/2019/02/26/end-to-end-monitoring-of-modern-cloud-applications/.
For the purpose of testing it is important to know, that results can only be compared, if a correct monitoring environment is set up, on which the specific parameters can be can be viewed independently.
I have talked a lot about different parameters for different types of applications, so let’s take a look at some examples: , 
- Response time
- Response time is defined as the time between the user request and the application’s response.
- The expected behaviour would be, that the response time stays the same, no matter the load level. Although it is quite normal, that the response time is larger at high load, everybody will have to define his own threshold, which he thinks is tolerable for his users.
- Throughput defines the amount of processed requests by the application.
- It is important to differentiate between system levels: On database level, the amount of queries would be relevant, whereas in a typical web application, it would be the number of user requests.
- The expected behaviour would be the same throughput for all levels of load.
- CPU/memory usage
- Here, the CPU utilization and consumed memory while executing a task in the application is measured.
- The more the application scales, the more CPU and memory will be used. In order to minimize this, one should stick to standard programming practices like caching or optimizing database queries.
- Network usage
- Monitoring network usage is done by measuring the bandwidth consumed by an application.
- Ultra scaling systems should try to minimize their network usage as much as possible in order to avoid network congestion. This can be done by following typical programming practices like compression techniques.
How to test
In order to perform scalability testing, a scenario for the parameter that should be tested, needs to be defined. This scenario should stay the same during the tests, so that the results can be compared afterwards.
After setting up a test environment, the tests can be carried out by repeating the scenario under different load in order to test application scaling. By monitoring the chosen parameter, potential issues can be verified. 
The Netflix way of testing
The engineers of Netflix created their own testing style: As their systems grew bigger and bigger, they soon discovered two things: First, failures are unavoidable and second, they can’t set up a testing environment which reflects their production environment because it would be way too expensive. To improve confidence in their systems, they introduced a concept called Chaos Engineering. 
About Chaos Engineering
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
The principle behind Chaos engineering is, that you have to be aware, that large scale distributed systems are “chaotic”, which means various events will unavoidable lead to unpredictable outcome. Chaos testing deals with these kind of problems with the goal to reduce unpredicted behaviour in production. 
Chaos testing is similar to scalability testing: First, a “steady state” of the system needs to be defined, which means measuring parameters when the system is in a normal behaviour. It should be assumed, that this steady state will continue the same way, even if the behaviour of the system changes.
Now let’s bring in some chaos: By faking real world events like a server crash or a network connection that is suddenly disconnected, the system’s behaviour is going to change as it should adapt to the event that has occurred, like for example launching a new instance of the server. After the tests, the measured test results can be compared with the original steady state, which can discover weaknesses in the system.
The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. chapter: CHAOS IN PRACTICE
The experiments should ideally reflect real world events, which can be hardware or software failures or simply peaks in traffic. Netflix’ recommendation is to run tests in the production environment, as this is the only reliable way to prove confidence in a system. Another point is, that it’s nearly impossible to build up a separate environment only for testing, as this would be way too expensive for a large, distributed system. When testing in production, there will always be the risk of causing trouble for customers, as the system is under extreme stress. Therefore it should be considered running these tests at a time, in which less traffic is expected. As an example, Amazon would never run various chaos tests short before christmas, as they know there will be a lot of traffic and a potential failure of their system could damage the entire company, not to mention the enormous loss of money. , , 
The Netflix Simian Army
Netflix first introduced the concept of Chaos Engineering and released an open-source tool called Chaos Monkey which provides the ability to run chaos tests on a system, by “randomly terminating virtual machine instances and containers that run inside of your production environment”. 
The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables
To improve the stability and learn about weaknesses of their system, Netflix started running chaos monkey during a normal business day, having developers ready to instantly react on possible problems. The outcome was very successful which motivated them to develop more “monkeys” for different kinds of failures. They ended up with a “virtual Simian Army”: , 
- Latency Monkey focuses on making very large delays in order to simulate service degradation or even an entire service downtime.
- Doctor Monkey runs health checks on the instances and shuts down unhealthy instances to give the owner the possibility to fix the issues and relaunch them.
- 10-18 Monkey is short for Localization-Interlocalization and detects language problems for different geographic regions, like the usage of different character sets.
- Janitor Monkey checks the cloud environment for unused resources and disposes them.
- Conformity Monkey is able to detect instances, which do not follow best-practices and shuts them down. For example, it would detect and shut down an instance which does not auto.
By using the Simian Army, Netflix feels a lot more confident to deal with unpredictable problems arising in their production environment.
To improve the robustness of a large scale system, scalability testing is unavoidable. The principle of Chaos Engineering shows, that there are many different events, which can’t be predicted, yet a robust system should be able to survive such impacts with as less as possible drawbacks.