A Service Level Objective (SLO) is a range of valid measurements for a metric. For example, an SLO might define that a webpage loads within 200ms of the user accessing it.
In DevOps, monitoring tools are use to provide continuous process of identifying, tracking, analyzing, and alerting on specific components of the system.
A Service Level Indicator (SLI) is a quantitative measure of a metric. For example, an SLI might indicate that the current loading time of a webpage is 152 milliseconds.
A Service Level Agreement (SLA) is a contract with consumers about expected levels of service.
SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements) are used to tie system monitoring metrics to business goals and objectives.
An alert helps teams stay informed about activities occurring in a monitored system.
Monitoring allows teams to watch and understand the state of their systems by gathering predefined metrics or logs.
Observability is the degree to which the metrics of a system can be acted upon to locate and fix a problem.
When implementing monitoring, metrics should be chosen that reveal the health of the system, as well as issues with user experience.
Metrics should exist to measure the quality of the monitoring and observability of systems. These metrics might include the number of improper alerts, time for issue resolution, and the time taken for an issue to be identified.
Improper monitoring can produce too many alerts which are not useful or actionable. These noisy alerts can cause staff to distrust alerts and ignore alerts that are actually useful.
In DevOps, resiliency is a system’s ability to continue to provide an acceptable level of performance even when problems are occurring within the system.
Common problems that occur to a system are internal failures, external failures, as well as malicious attacks. These problems are all inevitable, but DevOps practices can minimize their frequency and severity.
Resiliency can be achieved through methods such as:
Resiliency can be measured by determining how quickly a critical failing service can recover, and the degree to which critical services are available during a failure (number of failed requests, latency).