In DevOps, resiliency is a system’s ability to continue to provide an acceptable level of performance even when problems are occurring within the system.
Common problems that occur to a system are internal failures, external failures, as well as malicious attacks. These problems are all inevitable, but DevOps practices can minimize their frequency and severity.
Resiliency can be achieved through methods such as:
Resiliency can be measured by determining how quickly a critical failing service can recover, and the degree to which critical services are available during a failure (number of failed requests, latency).
Resiliency is measured to gain an understanding of how our system has performed under adverse conditions.
These measurements can be compared to targets to identify areas for improvement.
Internal system problems include system changes and hardware failures.
These internal problems can be addressed through practices such as change-management processes and redundancy.
External system problems include loss of dependency availability or support.
These can be addressed through practices such as fallback strategies and a proactive approach to dependency management.
A system’s resiliency can be tested through practices such as:
Cyber attacks include DDOS and SQL injection attacks and can be mitigated through practices such as input validation.