The dashboard flashes red across nearly every service. The developers rush to try to locate the problem. They quickly realize that the main server, which hosts the core of the application, has had its power supply fail. There is supposed to be a backup, but the operations team hasn’t provisioned it yet. Customer complaints start rolling in as the team rushes to replace the power supply. How could this have been avoided?
When it comes to software systems, time really is money — unexpected downtime can result in a loss of transactions, decay of customer trust, and a host of other issues. Failure will always be a part of our systems, however with the right preparation, we can build systems that are resilient to failure.
In this lesson, we will introduce resiliency, a system’s ability to continue to perform despite experiencing problems. Creating a resilient system allows our services to be highly-available, which means our customers can access our functionality a vast majority of the time.
To ensure resiliency, we can apply concepts of DevOps culture such as systems-level thinking, feedback, and continuous experimentation. We need to continuously monitor the whole system to understand how the components work together, learn from our system’s failures and create policies that respond to those failures.
Can you think of a time in which an unavailable system affected you?
In 2021, Facebook had a major outage causing most of its platforms to go offline. The main site, Instagram, WhatsApp, and more were affected. Many people across the world depend on WhatsApp for communication or Facebook Marketplace for their business. This outage showcased the downsides of relying on a particular organization to always be available.
The diagram pictured here showcases a system dashboard in the middle of quite an emergency. With multiple issues visible at the same time, it can be difficult to understand the root cause of the problem. Times like these can often be avoided through the use of resiliency strategies.
In order to properly respond to system problems, we need to understand the common types of problems our systems will face. Coming up, we will introduce various problems that our system will ideally be resilient against.