Cascading failures in large-scale distributed systems

Internet service providers face the challenge of growing rapidly while managing increasing system distribution. Although the reliable operation of services is of great importance to companies such as Google, Amazon and Co., their systems fail time and again, resulting in extensive outages and a poor customer experience. This has already affected Gmail (2012) [1], AWS DynamoDB (2015) [2], and recently Facebook (2021) [3], to name just a few examples. In this context, one often encounters so-called cascading failures causing undesirable complications that go beyond ordinary system malfunctions. But how is it that even the big players in the online business cannot completely avoid such breakdowns, given their budgets and technical knowledge? And what are practical approaches to risk mitigation that you can use for your own system?

With that said, the goal of this blog article is to learn how to increase the resilience of your large distributed system by preventing the propagation of failures.

Continue reading