#How servers achieve 99.999999% uptime, and why it matters for today’s enterprises

Note: This post was composed for the module Enterprise IT (113601a)

Understanding High Availability

High Availability refers to a System’s ability to function without failure over a significant amount of time. It ensures that services remain reachable despite of Hardware malfunctions, Software failures and other unexpected issues.
Nowadays, with more and more Businesses and important infrastructure relying on an uninterrupted access to key data, High Availability keeps rising in both relevance and intricacy

Definition and Importance

High Availability is usually measured in form of “nines of availability,” where more numbers implicate a higher overall availability and therefore lower system downtime. At the current level of Technology, 99.999% (“five nines”) has been implemented as the industrial Standard availability, as the current optimal middle point between the exponentially rising Price and the diminishing returns of higher availability levels.[1] This translates to about 5.26 minutes of Downtime a year. The following chart shows the the differences and diminishing returns of subsequent availability levels.

Figure 1. Downtime of a system with the respective number of “nines of availability”. [7]

The High Cost of Downtime

For industry Branches that rely on real-time transactions, Downtime can have severe financial consequences. A Gartner Report form 2016 estimated that the average Costs for a minute of Downtime in IT is approximately 5.600$, depending on the severity and the Mean Time To Recovery (MTTR) even reaching into the Millions. In order to both prevent immense losses and ensure a smooth service even in case of failures, Companies choose to invest significantly into their availability. After all, according to google statistics, 53% of users abandon a site if it takes more than 3 seconds to load.[3]

Aside from financial losses, downtime in critical sectors like Healthcare, communications and transportation can have life-threatening consequences. Hospitals can find themselves delayed when accessing patient data, airlines need to manage flight schedules and air-traffic control. Many more examples can be found, where a constant Uptime is basically required to ensure safety and prevent disasters.

Strategies to Achieve 99.999999% Uptime

As seen above, Uptime of 99.999999% (“eight nines”) corresponds to about 0.3 seconds of downtime a year. To achieve this level, a multitude of design principles and requirements need to be established. Reliability can be increased through the following: [2]

By monitoring key performance indicators, a system can automatically recover from failure. Automated responses to detect, track and potentially resolve failures can be set up to reduce the individual work that needs to be done.
Automation can also be used to test recovery procedures. Simulating failures and testing recovery strategies can expose weaknesses before they become a problem and ensure that existing strategies will work when needed.
In order to minimize the impact of individual failures, it is advisable to scale systems horizontally. With many connected resources instead of one central unit, processes can be distributed to ensure they don’t share a single point of failure. This represents one of the key thought processes behind redundancy.
Managing workload capacities on a requirement-basis, instead of setting limits up-front will prevent resource saturation. Scaling resources dynamically to meet demand both reduces costs and optimizes provisioning.
If changes to the infrastructure are made through automation, not only is the risk of human error reduced, changes to the automation can also easily be documented and reviewed to increase reliability and even Accountability.

Redundant Infrastructure

Redundancy means having multiple duplicates of relevant components and systems, to ensure that there is always a backup available in case of failure. This also means that repairs can be executed while the system is running on the unaffected station. Redundancy can be implemented through many methods in almost any link of the process chain.
Geographical Distribution is the most basic of redundancies and implies the existence of multiple data centers. By distributing them over different physical locations, protection against unpredictable natural phenomena as well as potential outages, be it electrical or other, traffic can be routed to an intact instance of the center, while repairs and rebuilding are underway.
Load Balancing means the distribution of workloads across multiple servers. This way, failures can be bypassed and requests can be handled by an alternate server. When implementing dynamic load balancing, the target for rerouting traffic is decided depending on the current distribution of capacity utilization among the servers. The alternative would be a static approach, where the request is transmitted to every server until it is admitted.
Whenever backup infrastructure is involved, Failover processes and management need to be considered. These ensure that in case of failure, the switch to systems in standby is both seamless and fast.

Proactive Monitoring and Maintenance

In order to recognize anomalies and issues before the develop into serious problems, systems and processes need to be checked regularly, even if they don’t currently show problems. By implementing monitoring tools it is possible to track all kinds of relevant data and identify possible issues well before they appear. Afterward, preemptively finding solutions for issues reduces likelihood of unexpected failures. Checking in on the physical side, meaning Hardware inspections in the actual data centers should be done just as regularly. Hardware problems are often reliant on specific parts that might not always be available immediately.

Disaster Recovery Planning

Even after all the above steps are sufficiently prepared and executed, it is often not possible to prevent an eventual disaster. This means that there should always be standardized rules and a recovery plan for emergencies.
Regular Backups should be standard in any one of such plans. Having copies of relevant data is important in cases of corruption or loss.
The previously established recovery protocols should be regularly tested, to prevent sudden and unexpected issues when dealing with failures.
Having prepared communication resources with relevant authorities and support personnel helps guide the process and reduces stress related faults in responding to issues.

By integrating these strategies, systems can approach the goal of 99.999999% Uptime, ensuring that the provided services remain reliable and resilient when faced with inevitable challenges

References

https://www.datacenterknowledge.com/uptime/achieving-eight-9s-as-the-world-goes-digital
AWS, 2023 – AWS Well-Architected Framework: Reliability Pillar
https://www.thinkwithgoogle.com/consumer-insights/consumer-trends/mobile-site-load-time-statistics/
https://www.ibm.com/think/topics/high-availability
precisely.com
docs.aws.amazon.com
https://www.ni.com/en/shop/electronic-test-instrumentation/add-ons-for-electronic-test-and-instrumentation/what-is-systemlink-tdm-datafinder-module/what-is-rasm/what-is-availability-.html

How servers achieve 99.999999% uptime, and why it matters for today’s enterprises