{"id":22581,"date":"2022-03-03T21:56:33","date_gmt":"2022-03-03T20:56:33","guid":{"rendered":"https:\/\/blog.mi.hdm-stuttgart.de\/?p=22581"},"modified":"2023-08-06T21:39:32","modified_gmt":"2023-08-06T19:39:32","slug":"cascading-failures-in-large-scale-distributed-systems","status":"publish","type":"post","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/03\/cascading-failures-in-large-scale-distributed-systems\/","title":{"rendered":"Cascading failures in large-scale distributed systems"},"content":{"rendered":"\n<p>Internet service providers face the challenge of growing rapidly while managing increasing system distribution. Although the reliable operation of services is of great importance to companies such as <em>Google<\/em>, <em>Amazon<\/em> and Co., their systems fail time and again, resulting in extensive outages and a poor customer experience. This has already affected <em>Gmail<\/em> (2012) [1], <em>AWS DynamoDB<\/em> (2015) [2], and recently <em>Facebook<\/em> (2021) [3], to name just a few examples. In this context, one often encounters so-called <em>cascading<\/em> <em>failures <\/em>causing undesirable complications that go beyond ordinary system malfunctions<em>.<\/em> But how is it that even the big players in the online business cannot completely avoid such breakdowns, given their budgets and technical knowledge? And what are practical approaches to risk mitigation that you can use for your own system?<\/p>\n\n\n\n<p>With that said, the goal of this blog article is to learn how to increase the resilience of your large distributed system by preventing the propagation of failures.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Cascading failures<\/h2>\n\n\n\n<p>A <em>cascading failure<\/em> is a failure that increases in size over time due to a <strong>positive feedback loop<\/strong>. The typical behavior is initially triggered by a single node or subsystem failing. This spreads the load across fewer nodes of the remaining system, which in turn increases the likelihood of further system failures resulting in a vicious circle or snowball effect [4]. Cascading failures are <strong>highly<\/strong> <strong>critical<\/strong> for three reasons: First, they can shut down an entire service in a short period of time. Second, the affected system does not return to normal as it does with more commonly encountered problems, but it gets progressively worse. This ultimately leads to being dependent on human intervention. Finally, in the worst case, cascading failures can strike seemingly without warning because load distribution, and consequently failures, occur rapidly [4][5].<\/p>\n\n\n\n<p>Although this blog article will focus on cascading failures in the context of distributed computing, they can also occur in a variety of other domains: e.g., power transmission, finance, biology, and also ecosystems. So, they are a fairly widespread phenomenon that is somewhat similar to patterns found in nature [5]. To get a better idea of what a cascading failure in computer science looks like, let&#8217;s look at a specific case study.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Case Study: The <em>AWS<\/em> <em>DynamoDB<\/em> Outage in 2015<\/h2>\n\n\n\n<p><em>AWS DynamoDB<\/em> is a highly scalable non-relational database service, distributed across multiple datacenters that offers strongly consistent read operations and ACID transactions [6]. It is, and at the time of the event was, being used by popular internet services such as <em>Netflix<\/em>, <em>Airbnb<\/em>, and <em>IMDb<\/em> [7]. The incident we want to look at as an example of a cascading failure occurred on September 20, 2015, when <em>DynamoDB<\/em> was unavailable in the US-East region for over four hours. There were two subsystems involved: storage servers and a metadata service. Both are replicated across multiple datacenters. The storage servers request their so-called <em>membership<\/em> for their data partition allocations from the metadata service. This is shown in Figure&nbsp;1.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1.jpg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"22606\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/03\/cascading-failures-in-large-scale-distributed-systems\/figure_1\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1.jpg\" data-orig-size=\"983,412\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure_1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1.jpg\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1.jpg\" alt=\"\" class=\"wp-image-22606\" width=\"609\" height=\"255\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1.jpg 983w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1-300x126.jpg 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_1-768x322.jpg 768w\" sizes=\"auto, (max-width: 609px) 100vw, 609px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 1: Storage servers and metadata service <br>(Own illustration based on [8])<\/figcaption><\/figure>\n\n\n\n<p>For the membership (and thus also for the allocation of data partitions) there are timeouts during which the request must be successful. If these are exceeded, the corresponding storage server retries and excludes itself from the service.<\/p>\n\n\n\n<p>An unfortunate precondition for the incident was a newly introduced <em>DynamoDB<\/em> feature called <em>Global Secondary Index<\/em> (GSI). This gives customers better access to their data but has the downside of significantly increasing the size of metadata tables. Consequently, the processing time was much longer. Regarding the capacity of the metadata service and the timeouts for membership requests, unfortunately no corresponding adjustments were made [9].<\/p>\n\n\n\n<p>The real problem began when a short network issue caused a few storage servers (dealing with very large metadata tables) to miss their membership requests. These servers became unavailable and kept retrying their requests. This overloaded the metadata service, which in turn slowed down responses and caused more servers to resubmit their membership requests because they had exceeded their timeouts as well. As a consequence, the state of the metadata service deteriorated even further. Despite several attempts to increase resources, the system remained caught in the failure loop for hours. Ultimately, the problem could only be solved by interrupting requests to the metadata service, i.e., the service was basically taken offline [9].<\/p>\n\n\n\n<p>The result is a widespread <em>DynamoDB<\/em> outage in the US-East region and an excellent example of a cascading failure. However, what are the underlying concepts and patterns of the systems that are getting caught in such an error loop?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Reasons for cascading failures<\/h2>\n\n\n\n<p>First, it should be mentioned that the trigger points for cascading breakdowns can look diverse: e.g., these could be new rollouts, maintenance, traffic drains, cron jobs, distributed denial-of-service (DDoS), throttling and so on. What they all have in common is that they work in the context of a finite set of resources, potentially implying effects such as server overload, resource exhaustion, and unavailability of services [4][10]. Let&#8217;s look at those in detail:<\/p>\n\n\n\n<p><strong>Server overload<\/strong><\/p>\n\n\n\n<p>The most common cause is server overload or a consequence of it. When that happens, the drop in system performance often affects other areas of the system. As shown in Figure 2, in the initial scenario (left), load coming from two reverse proxies is distributed between clusters A and B, so that cluster A operates at an assumed maximum capacity of 1000 requests per second. In the second scenario (right), cluster B fails and the entire load hits cluster A, which can lead to an overload. Cluster A now has to process 1200 requests per second and starts to misbehave, causing the performance to drop well below the desired 1000 requests per second [4].<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2.jpg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"22608\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/03\/cascading-failures-in-large-scale-distributed-systems\/figure_2\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2.jpg\" data-orig-size=\"1347,528\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure_2\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2-1024x401.jpg\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2-1024x401.jpg\" alt=\"\" class=\"wp-image-22608\" width=\"756\" height=\"295\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2-1024x401.jpg 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2-300x118.jpg 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2-768x301.jpg 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_2.jpg 1347w\" sizes=\"auto, (max-width: 756px) 100vw, 756px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 2: Clusters A and B receiving load according to capacity (left) and cluster A receiving overload if cluster B fails (right). (Own illustration based on [4])<\/figcaption><\/figure>\n\n\n\n<p><strong>Resource Exhaustion<\/strong><\/p>\n\n\n\n<p>Resources of a server are limited. If the load increases above a certain threshold, the server&#8217;s performance metrics, such as latency or error rates, deteriorate. This translates into a higher risk of a crash. The subsequent effects depend on the type of resource that is causing the bottleneck, for instance,<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>if <strong>CPU<\/strong> is not sufficient, a variety of issues can occur, including slower requests, excessive queuing effects, or thread starvation.<\/li>\n\n\n\n<li>If <strong>memory<\/strong>\/<strong>RAM<\/strong> is overused, tasks may crash, or cache hits can decrease.<\/li>\n\n\n\n<li>Also,<strong> thread<\/strong> starvation may directly cause errors or lead to health check failures [4].<\/li>\n<\/ul>\n\n\n\n<p>Troubleshooting for the main cause in this context is often painful. This is due to the fact that the components involved are <strong>interdependent<\/strong> and the root cause may be hidden behind a complex chain of events [4]. For example, assume that less memory is available for caching, resulting in fewer cache hits and thus a higher load for the backend, and such combinations [10].<\/p>\n\n\n\n<p><strong>Service Unavailability<\/strong><\/p>\n\n\n\n<p>When resource exhaustion causes a server to crash, traffic spreads to other servers, increasing the likelihood that those will crash as well. A cycle of crashing servers establishes. The bad thing about it, these problems remain in your system because some machines are still down or in the process of being restarted, while increasing traffic prevents them from fully recovering [4].<\/p>\n\n\n\n<p>In general, the risk of cascading failure is always present when we redistribute traffic from unhealthy nodes to healthy nodes. This may be the case with orchestration systems, load balancers, or task scheduling systems [5]. In order to solve cascading failures, we need to take a closer look at the relationships of the components involved.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Getting out of the loop &#8211; how to fix cascading failures<\/h2>\n\n\n\n<p>As seen in the case of <em>DynamoDB<\/em>, fixing cascading failures is tricky. Especially from the perspective of a large tech company, distribution adds a lot of complexity to your system which makes it even more difficult to keep track of the diverse interconnections. One basic way to illustrate (the cascading) relationships here is the so-called <em>Causal Loop Diagram<\/em> (CLD). The CLD is a modeling approach that helps to visualize feedback loops in complex systems. Figure 3 visualizes the CLD for the <em>AWS<\/em> <em>DynamoDB<\/em> outage. It can be explained as follows. An arrow represents the dynamic between the initial and subsequent variable. For instance, if the latency on the metadata service increases, the number of timeouts increases and so does the number of retries needed. If the effects in the system are highly unbalanced, i.e., the number of pluses and minuses is not equal by a large margin, there is a reinforcing cycle. This means that the system might be sensitive for cascading failures [5].<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3.jpg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"22610\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/03\/cascading-failures-in-large-scale-distributed-systems\/figure_3\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3.jpg\" data-orig-size=\"1121,733\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure_3\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3-1024x670.jpg\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3-1024x670.jpg\" alt=\"\" class=\"wp-image-22610\" width=\"626\" height=\"409\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3-1024x670.jpg 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3-300x196.jpg 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3-768x502.jpg 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_3.jpg 1121w\" sizes=\"auto, (max-width: 626px) 100vw, 626px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 3: The Causal Loop Diagram for the AWS DynamoDB outage in 2015 <br>(Own illustration based on [5])<\/figcaption><\/figure>\n\n\n\n<p>Now, to address the cascading scenario, various measures can be taken. The first and most intuitive option is to <strong>increase<\/strong> <strong>resources. <\/strong>In the diagram above you can see the minus that is introduced to the circle by the <em>metadata service capacity<\/em>. If this is increased, it works against the reinforcing cycle. However, this might be useless, as we have seen in the case of <em>AWS<\/em>. In addition to increasing resources, you may need to employ other strategies:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Try to <strong>avoid health check failures\/deaths<\/strong> to prevent your system from dying due to excessive health checking.<\/li>\n\n\n\n<li><strong>Restart your servers<\/strong> in case of thread-blocking requests or deadlocks.<\/li>\n\n\n\n<li><strong>Drop traffic<\/strong> significantly and then slowly increase the load so that the servers can gradually recover.<\/li>\n\n\n\n<li><strong>Switch to a degraded mode<\/strong> by dropping certain types of traffic.<\/li>\n\n\n\n<li><strong>Eliminate batch\/bad<\/strong> <strong>traffic<\/strong> to reduce system load due to non-critical or faulty work [4].<\/li>\n<\/ul>\n\n\n\n<p>Since this ultimately means that parts of the system are not available and this becomes visible to the customer, it is better to avoid cascading failures in the first place.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Avoiding cascading failures<\/h2>\n\n\n\n<p>There are numerous approaches to render distributed systems robust against cascading failures.<\/p>\n\n\n\n<p>On the one hand, large internet companies have already thought about how to prevent a system from falling into a cascade of errors, e.g., by conducting an isolation of errors. Tools and frameworks have been developed for this purpose. Two examples are <em>Hystrix<\/em> (from <em>Netflix<\/em>), a latency and fault tolerance library, or <em>Sentinel<\/em> [11][12]. Regarding the former, <em>Netflix<\/em> has already made further developments, namely the <em>adaptive<\/em> <em>concurrency<\/em> <em>limits<\/em> (you can read more on that <a href=\"https:\/\/netflixtechblog.medium.com\/performance-under-load-3e6fa9a60581\">here<\/a>). But in general, these kinds of tools wrap external calls into some kind of data structure, trying to abstract the critical points.<\/p>\n\n\n\n<p>On the other hand, and this is where the hype is going, there are more complex solutions, such as the implementation of so-called <em>side car<\/em> <em>proxies<\/em>, e.g., service meshes like <em>istio<\/em>. Some examples technologies are <em>Envoy<\/em> or <em>Haproxy<\/em> [10][13].<\/p>\n\n\n\n<p>In addition to these solutions, there are certain system design concepts you can keep in mind. For example, you can try to reduce the number of synchronous calls in your systems. This can be done by moving from an orchestration pattern to a choreography pattern by applying a <strong>publish<\/strong>&#8211;<strong>subscribe<\/strong> <strong>pattern<\/strong> design, e.g., by using Kafka. In the face of increasing traffic this solution often turns out more robust. Other approaches, such as performing capacity planning (depending on the use case) can also be helpful. This often implies implementing solutions for automatic provisioning and deployment, automatic scaling, and automatic healing. In this context, close monitoring of SLAs and SLOs can be considered important [10][4].<\/p>\n\n\n\n<p>Now, in order to better understand the underlying solution approaches, we can take a look at typical <strong>antipatterns<\/strong> in distributed systems that should be avoided in the context of cascading failures. Laura Nolan proposes six of these, which are also discussed in terms of risk mitigation strategies in the following.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><strong>Antipattern 1: Acceptance of an unrestricted number of requests<\/strong><\/p>\n\n\n\n<p>The number of tasks in the queue\/thread pool should be limited. This allows you to control when and how the server slows down in case of excessive requests. The setting should be in a range where the server can reach peak loads, but not so much that it blocks. In this case, it is better to <strong>fail<\/strong> <strong>fast<\/strong> than to hang for a long time, for both the system and the user [5]. On the proxy or load balancer side this is frequently implemented by <strong>rate<\/strong> <strong>limiting<\/strong> strategies, e.g., to avoid DDoS and other forms of server overload [11]. But there is also more to consider, for example in the context of queue management, as most servers have a queue in front of a thread pool to handle requests. If the number increases beyond the capacity of the queue, requests are rejected. A high number of requests waiting in the queue requires memory and increases latency. If the number of requests is close to constant, a small queue, or no queue at all is sufficient. This means that requests will be rejected immediately if there is an increase in traffic. If stronger deviations are to be expected, a longer queue should be used [4].<\/p>\n\n\n\n<p>In addition, to protect servers from excessive load, the concepts of <em>load shedding<\/em> and <em>graceful degradation<\/em> are viable options. <strong>Load<\/strong> <strong>shedding<\/strong> is used to maintain the server&#8217;s performance as best as possible in case of overload. This is achieved by dropping traffic using approaches ranging from simply returning an <em>HTTP 503<\/em> (Service unavailable) status code to prioritizing requests individually. A more complex variant of this is called <strong>graceful<\/strong> <strong>degradation<\/strong>, which means it switches incrementally to lower quality responses for queries. These might run faster or more efficiently. However, this should only be a well-considered solution because it can add a lot of complexity to your system [4].<\/p>\n\n\n\n<p><strong>Antipattern 2: Dangerous (client) retry behavior<\/strong> In order to reduce the workload of the system, it\u2019s important to make sure that excessive retry behavior is avoided. Exponential backoff is a suitable approach, in which the time intervals for retries are successively incremented. You should also use so-called <em>jitter<\/em>, i.e., you add random noise to the retry intervals. This prevents your system from being hit by accumulating &#8220;load waves&#8221;, which is also known as <em>retryamplification<\/em> (see Figure 4)[5][10].<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><a href=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4.jpg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"22611\" data-permalink=\"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2022\/03\/03\/cascading-failures-in-large-scale-distributed-systems\/figure_4\/\" data-orig-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4.jpg\" data-orig-size=\"1356,378\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"figure_4\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4-1024x285.jpg\" src=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4-1024x285.jpg\" alt=\"\" class=\"wp-image-22611\" width=\"660\" height=\"183\" srcset=\"https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4-1024x285.jpg 1024w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4-300x84.jpg 300w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4-768x214.jpg 768w, https:\/\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2022\/03\/figure_4.jpg 1356w\" sizes=\"auto, (max-width: 660px) 100vw, 660px\" \/><\/a><figcaption class=\"wp-element-caption\">Figure 4: Typical pattern of retry amplification (Own illustration based on [10]).<\/figcaption><\/figure>\n\n\n\n<p>Also, there is a design pattern called a <em>circuit<\/em> <em>breaker<\/em>. Circuit breakers can be thought of as a type of switch. In the initial state, commands from an upstream service are allowed to pass through to a downstream service. If the errors increase, the circuit breaker switches to an open state and the system fails fast. This means the upstream service gets an error, allowing the downstream service to recover. After a certain time, the requests are gradually ramped up again. For instance, in the library <em>Hystrix<\/em> (which was already mentioned above) some kind of circuit breaker pattern is implemented [11].<\/p>\n\n\n\n<p>Another approach to mitigating dangerous retry behavior would be to set a server-side <em>retry<\/em> <em>budget<\/em>, meaning you only retry a certain number of requests per minute. Everything that exceeds the budget is dropped. However, in all cases a global view is important here. It should be avoided at all costs to execute retries on multiple levels of the software architecture, as this can grow exponentially [4].<\/p>\n\n\n\n<p>Finally, it should be noted that retries should be idemponent and free from side effects. It can also be beneficial in terms of system complexity to make calls stateless [10].<\/p>\n\n\n\n<p><strong>Antipattern 3: Crashing on bad input<\/strong><\/p>\n\n\n\n<p>The system should ensure that servers do not crash due to bad input. Such crashes, combined with retry behavior, can lead to catastrophic consequences such as one server crashing after another. In particular, inputs from outside should be carefully checked in this regard. Using fuzz tests is a good way to detect these types of problems [5].<\/p>\n\n\n\n<p><strong>Antipattern 4:&nbsp; Proximity-based failover<\/strong><\/p>\n\n\n\n<p>Make sure that not all of your traffic is redirected to the nearest data center, as it can become overloaded as well. The same logic applies here as with the failures of individual servers in a cluster, where one machine can fail after the other. So, to increase the resilience of your system, load must be redirected in a controlled manner during failover, which means you have to consider the maximum capacity of each data center. DNS, based on IP-Anycast, eventually forwards the traffic to the closest data center, which could be problematic [5].<\/p>\n\n\n\n<p><strong>Antipattern 5:&nbsp; Work prompted by failure<\/strong><\/p>\n\n\n\n<p>Failures often cause additional work in the system. In particular, a failure in a system with only a few nodes can lead to a lot of additional work (e.g., replication) for the remaining nodes. This can lead to a harmful feedback loop. A common mitigation strategy would be to delay or limit the amount of replication [5].<\/p>\n\n\n\n<p><strong>Antipattern 6:&nbsp; Long startup times<\/strong><\/p>\n\n\n\n<p>In general, processes are often slower at the beginning. This is for instance because of initialization processes and runtime optimizations [10]. After a failover, services and systems often collapse due to the heavy load. To prevent this, you should prefer systems with a fast startup time [5]. Also, caches are often empty at system startup. This makes queries more expensive as they have to go to the origin. As a result, the risk of a crash is higher than when the system is running in a stable mode, so make sure to keep caches available [4].<\/p>\n\n\n\n<p>In addition to these six antipatterns, there are other system components or parameters that should be checked. For example, you can look at your <strong>deadlines<\/strong> for requests or RPC calls. In general, it is difficult to set good deadlines here. But one common problem you frequently encounter in the context of cascading failures is that the client misses many deadlines, which means that a lot of resources are wasted [4]. This was also the case in the <em>AWS<\/em> <em>DynamoDB<\/em> example from the beginning. The server \u2013 in general \u2013 should check if there is still time left until the deadline is reached to avoid working for nothing. A common strategy is so-called <em>deadline<\/em> <em>propagation<\/em>. Here, there is an absolute deadline at the top of the request tree. The servers further down only get the time value that is left after the previous server has done its calculations. Example: Server A has a deadline of 20 seconds and needs 5 seconds for the calculation, then server B has a deadline of 15 seconds and so on [4].<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>Cascading failures are a dreaded and at the same time special phenomenon in distributed systems. That&#8217;s because sometimes counterintuitive paths must be taken to avoid them, e.g., customizations actually intended to reduce errors, such as what appears to be intelligent load balancing, can increase the risk of total failures. And sometimes it\u2019s just better to simply show an error message to your customer, instead of implementing a sophisticated retry logic and risking a DDoS against your own system. However, compromises often have to be made here. Testing, capacity planning, and applying certain patterns in system design can help to improve the resilience of your system.<\/p>\n\n\n\n<p>After all, the lessons learned, and postmortems of large technology companies provide a good guide for further action to avoid cascading failures in the future. However, it can also be worth keeping an eye on the latest hypes and trends.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">List of Sources<\/h2>\n\n\n\n<p>[<strong>1<\/strong>] <a href=\"http:\/\/static.googleusercontent.com\/media\/www.google.com\/en\/us\/appsstatus\/dashboard\/ir\/plibxfjh8whr44h.pdf\">http:\/\/static.googleusercontent.com\/media\/www.google.com\/en\/us\/appsstatus\/dashboard\/ir\/plibxfjh8whr44h.pdf<\/a><\/p>\n\n\n\n<p>[<strong>2<\/strong>] S. (2015, September 27). SentinelOne | Irreversible Failures: Lessons from the DynamoDB Outage. SentinelOne. <a href=\"https:\/\/www.sentinelone.com\/blog\/irreversible-failures-lessons-from-the-dynamodb-outage-2\/\">https:\/\/www.sentinelone.com\/blog\/irreversible-failures-lessons-from-the-dynamodb-outage-2\/<\/a><\/p>\n\n\n\n<p>[<strong>3<\/strong>] Beckett, L. (2021, October 5). Facebook platforms back online &#8211; as it happened. The Guardian. <a href=\"https:\/\/www.theguardian.com\/technology\/live\/2021\/oct\/04\/facebook-down-instagram-whatsapp-not\">https:\/\/www.theguardian.com\/technology\/live\/2021\/oct\/04\/facebook-down-instagram-whatsapp-not-working-latest-news-error-servers<\/a><\/p>\n\n\n\n<p>[<strong>4<\/strong>] Murphy, N. R., Beyer, B., Jones, C., &amp; Petoff, J. (2016). Site Reliability Engineering: How Google Runs Production Systems (1st ed.). O\u2019Reilly Media.<\/p>\n\n\n\n<p>[<strong>5<\/strong>] Nolan, L. (2021, July 11). Managing the Risk of Cascading Failure. InfoQ. <a href=\"https:\/\/www.infoq.com\/presentations\/cascading-failure-risk\">https:\/\/www.infoq.com\/presentations\/cascading-failure-risk\/<\/a><\/p>\n\n\n\n<p>[<strong>6<\/strong>] Amazon DynamoDB \u2013 H\u00e4ufig gestellte Fragen| NoSQL-Schl\u00fcssel-Werte-Datenbank | Amazon Web Services. (2021). Amazon Web Services, Inc. <a href=\"https:\/\/aws.amazon.com\/de\/dynamodb\/faqs\/\">https:\/\/aws.amazon.com\/de\/dynamodb\/faqs\/<\/a><\/p>\n\n\n\n<p>[<strong>7<\/strong>] Patra, C. (2019, April 19). The DynamoDB-Caused AWS Outage: What We Have Learned. Cloud Academy. <a href=\"https:\/\/cloudacademy.com\/blog\/aws-outage-dynamodb\/\">https:\/\/cloudacademy.com\/blog\/aws-outage-dynamodb\/<\/a><\/p>\n\n\n\n<p>[<strong>8<\/strong>] Nolan, L. (2020, February 20). How to Avoid Cascading Failures in Distributed Systems. InfoQ. <a href=\"https:\/\/www.infoq.com\/articles\/anatomy-cascading-failure\/\">https:\/\/www.infoq.com\/articles\/anatomy-cascading-failure\/<\/a><\/p>\n\n\n\n<p>[<strong>9<\/strong>] Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region. (2015). Amazon Web Services, Inc. <a href=\"https:\/\/aws.amazon.com\/de\/message\/5467D2\/\">https:\/\/aws.amazon.com\/de\/message\/5467D2\/<\/a><\/p>\n\n\n\n<p>[<strong>10<\/strong>] The Anatomy of a Cascading Failure. (2019, August 5). YouTube. <a href=\"https:\/\/www.youtube.com\/watch?v=K3tgWsMxaAU\">https:\/\/www.youtube.com\/watch?v=K3tgWsMxaAU<\/a><\/p>\n\n\n\n<p>[<strong>11<\/strong>] Osman, P. (2018). Microservices Development Cookbook: Design and build independently deployable, modular services. Packt Publishing.<\/p>\n\n\n\n<p>[<strong>12<\/strong>] Arya, S. (2020, January 23). Hystrix: How To Handle Cascading Failures In Microservices. All About Buying &amp; Selling of Used Cars, New Car Launches. <a href=\"https:\/\/www.cars24.com\/blog\/hystrix-how-to-handle-cascading-failures-in-microservices\/\">https:\/\/www.cars24.com\/blog\/hystrix-how-to-handle-cascading-failures-in-microservices\/<\/a><\/p>\n\n\n\n<p>[<strong>13<\/strong>] Architecture. (2020). Istio. <a href=\"https:\/\/istio.io\/latest\/docs\/ops\/deployment\/architecture\/\">https:\/\/istio.io\/latest\/docs\/ops\/deployment\/architecture\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Internet service providers face the challenge of growing rapidly while managing increasing system distribution. Although the reliable operation of services is of great importance to companies such as Google, Amazon and Co., their systems fail time and again, resulting in extensive outages and a poor customer experience. In this context, one often encounters so-called cascading failures causing undesirable complications that go beyond ordinary system malfunctions.<\/p>\n","protected":false},"author":1083,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[650,26,223],"tags":[84,579],"ppma_author":[870],"class_list":["post-22581","post","type-post","status-publish","format-standard","hentry","category-scalable-systems","category-secure-systems","category-ultra-large-scale-systems","tag-aws","tag-cascading-failure"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":24516,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/03\/09\/wie-ticketmaster-taylor-swift-verargerte-und-was-software-developer-daraus-lernen-konnen\/","url_meta":{"origin":22581,"position":0},"title":"Wie Ticketmaster Taylor Swift ver\u00e4rgerte und was Software Developer daraus lernen k\u00f6nnen","author":"Marilena Brink","date":"9. March 2023","format":false,"excerpt":"Verkaufsstarts f\u00fcr gro\u00dfe Ereignisse wie Konzerte oder Sportveranstaltungen sind immer mit Spannung erwartete Ereignisse. Doch wenn es bei diesen Verkaufsstarts zu Problemen kommt, kann dies f\u00fcr Veranstalter und Kunden gleicherma\u00dfen \u00e4rgerlich sein. Ein bekanntes Beispiel hierf\u00fcr ist das Debakel von Ticketmaster bei dem Verkauf von Taylor Swift Tickets f\u00fcr ihre\u2026","rel":"","context":"In &quot;Scalable Systems&quot;","block_context":{"text":"Scalable Systems","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/scalable-systems\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-12.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-12.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/03\/image-12.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":27354,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2025\/02\/28\/how-servers-achieve-99-999999-uptime-and-why-it-matters-for-todays-enterprises\/","url_meta":{"origin":22581,"position":1},"title":"How servers achieve 99.999999% uptime, and why it matters for today&#8217;s enterprises","author":"Jeremy Polanco Schmitzberger","date":"28. February 2025","format":false,"excerpt":"Note: This post was composed for the module Enterprise IT (113601a) Understanding High Availability High Availability refers to a System's ability to function without failure over a significant amount of time. It ensures that services remain reachable despite of Hardware malfunctions, Software failures and other unexpected issues.Nowadays, with more and\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/image-28.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/image-28.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2025\/02\/image-28.png?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":9663,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2020\/02\/24\/how-to-increase-robustness-of-a-large-scale-system-by-testing\/","url_meta":{"origin":22581,"position":2},"title":"How to increase robustness of a large scale system by testing","author":"Johannes Mauthe","date":"24. February 2020","format":false,"excerpt":"When a distributed software system grows bigger and bigger, one will end up with a big amount of various components which all need to scale independently. In order to achieve these components working smooth together, it is necessary to figure out at which time a component needs to be scaled,\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/lh3.googleusercontent.com\/8h_z-5W6olzeJeyXw7NwIHYdRJs3FyHcLk-NSsfw_eWM-2oCE1FnZFBxC3qw2IqdnSal43O8bc5uMGFaBvbKLZjhRu4Q2nlitp7AbAeNTc3BOFW2u_6xtpR3jIEvNLDPpsrmL8c9","width":350,"height":200},"classes":[]},{"id":5120,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2019\/02\/09\/observability-where-do-we-go-from-here\/","url_meta":{"origin":22581,"position":3},"title":"Observability?! \u2013 Where do we go from here?","author":"Alexander Wallrabenstein","date":"9. February 2019","format":false,"excerpt":"The last two years in software development and operations have been characterized by the emerging idea of \u201cobservability\u201d. The need for a novel concept guiding the efforts to control our systems arose from the accelerating paradigm changes driven by the need to scale and cloud native technologies. In contrast, the\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"MEME: I always, always test my code. The I test it again in production.","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2019\/02\/meme-1.jpg?resize=525%2C300&ssl=1 1.5x"},"classes":[]},{"id":23961,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/02\/10\/microservices-any-good\/","url_meta":{"origin":22581,"position":4},"title":"Microservices &#8211; any good?","author":"Kim Bastiaanse","date":"10. February 2023","format":false,"excerpt":"As software solutions continue to evolve and grow in size and complexity, the effort required to manage, maintain and update them increases. To address this issue, a modular and manageable approach to software development is required.\u00a0Microservices architecture provides a solution by breaking down applications into smaller, independent services that can\u2026","rel":"","context":"In &quot;Allgemein&quot;","block_context":{"text":"Allgemein","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/allgemein\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=700%2C400&ssl=1 2x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=1050%2C600&ssl=1 3x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/02\/Microservice.png?resize=1400%2C800&ssl=1 4x"},"classes":[]},{"id":25560,"url":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/2023\/08\/07\/high-availability-and-reliability-in-cloud-computing-ensuring-seamless-operation-despite-the-threat-of-black-swan-events\/","url_meta":{"origin":22581,"position":5},"title":"High Availability and Reliability in Cloud Computing: Ensuring Seamless Operation Despite the Threat of Black Swan Events","author":"mk306","date":"7. August 2023","format":false,"excerpt":"Introduction Nowadays cloud computing has become the backbone of many businesses, offering unparalleled flexibility, scalability and cost-effectiveness. According to O\u2019Reilly\u2019s Cloud Adoption report from 2021, more than 90% of organizations rely on the cloud to run their critical applications and services\u00a0[1]. High availability and reliability of cloud computing systems has\u2026","rel":"","context":"In &quot;Cloud Technologies&quot;","block_context":{"text":"Cloud Technologies","link":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/category\/scalable-systems\/cloud-technologies\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/CrossRegion.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/CrossRegion.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/CrossRegion.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/blog.mi.hdm-stuttgart.de\/wp-content\/uploads\/2023\/08\/CrossRegion.png?resize=700%2C400&ssl=1 2x"},"classes":[]}],"jetpack_sharing_enabled":true,"authors":[{"term_id":870,"user_id":1083,"is_guest":0,"slug":"hf020","display_name":"Harri Fa\u00dfbender","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/e2593536b8542ca678a7088ebaccde9ddfc0f5f68adc6d55bcadb2451834db60?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/22581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/users\/1083"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/comments?post=22581"}],"version-history":[{"count":9,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/22581\/revisions"}],"predecessor-version":[{"id":25339,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/posts\/22581\/revisions\/25339"}],"wp:attachment":[{"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/media?parent=22581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/categories?post=22581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/tags?post=22581"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/blog.mi.hdm-stuttgart.de\/index.php\/wp-json\/wp\/v2\/ppma_author?post=22581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}