Over the last few months, Bloop Co. has been trying to improve its systems’ resiliency. But how do they know that they are on the right track? Which of their efforts is most successful? What are further areas they need to address? The next several exercises seek to answer these questions.
Without the ability to measure our systems’ resiliency, we can’t be sure how effective our current strategy is. We can use aspects of monitoring to measure our system’s responses to problems. Some important metrics that indicate a system’s resiliency include:
- Uptime: what percentage of the time is our system available?
- Recovery speed: when an outage occurs, how long does it take for the system to become available again?
These metrics tell us how our system handles issues, but what do we compare them against? A pair of benchmarks that we can compare our system’s performance against are recovery time objective (RTO) and recovery point objective (RPO).
Recovery Time Objective (RTO)
The RTO is the amount of time an application can be unavailable before it causes significant harm to the business. Imagine that a business has promised its users that it will never go down for more than an hour at a time. This business has set their RTO to be one hour. If the business goes down for more than that time, it will have violated an important promise to the customer.
Recovery Point Objective RPO
The RPO is the acceptable amount of data loss after a system outage. Different applications have varying levels of data importance. A popular bank losing minutes of transaction data might be a nightmare. Losing hours of progress in an online multiplayer system would be unfortunate, but not a disaster. This acceptable level of data that can be lost might affect how often data is backed up, or cause adjustments to the RTO.
Benchmarks such as these can help us establish target levels of uptime and recovery time for our business. These targets can help us be aware of when our systems are approaching a critical “danger zone”.
How might too high a target of uptime actually hurt an organization? Why is a balanced target important?
Even the world’s biggest and most well-funded organizations struggle to achieve 100% uptime of their systems. Having too high of a target of resiliency could cause us to direct more resources than necessary to achieve that goal. Keeping a balanced target allows us to achieve a useful level of uptime while still prioritizing other important aspects of the system such as new features.
Real-world situations give us the most insight into how our system will perform under adverse conditions. However, we cannot simply wait for disasters to occur before testing any aspect of the resiliency of our system. We have the ability to prepare for these scenarios to happen under much more controlled conditions. In the next exercises, we will discuss ways to simulate issues occurring within the system in order to test our resiliency.