Codecademy Logo

System Health

What is a Service Level Objective

A Service Level Objective (SLO) is a range of valid measurements for a metric. For example, an SLO might define that a webpage loads within 200ms of the user accessing it.

What are Monitoring Tools

In DevOps, monitoring tools are use to provide continuous process of identifying, tracking, analyzing, and alerting on specific components of the system.

What is a SLI

A Service Level Indicator (SLI) is a quantitative measure of a metric. For example, an SLI might indicate that the current loading time of a webpage is 152 milliseconds.

What is a Service Level Agreement

A Service Level Agreement (SLA) is a contract with consumers about expected levels of service.

slo-sli-sla

SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs (Service Level Agreements) are used to tie system monitoring metrics to business goals and objectives.

What is an alert

An alert helps teams stay informed about activities occurring in a monitored system.

Monitoring

Monitoring allows teams to watch and understand the state of their systems by gathering predefined metrics or logs.

Observability

Observability is the degree to which the metrics of a system can be acted upon to locate and fix a problem.

Monitoring Metrics

When implementing monitoring, metrics should be chosen that reveal the health of the system, as well as issues with user experience.

Monitoring and Observability Metrics

Metrics should exist to measure the quality of the monitoring and observability of systems. These metrics might include the number of improper alerts, time for issue resolution, and the time taken for an issue to be identified.

Monitoring Pitfalls

Improper monitoring can produce too many alerts which are not useful or actionable. These noisy alerts can cause staff to distrust alerts and ignore alerts that are actually useful.

Resiliency

In DevOps, resiliency is a system’s ability to continue to provide an acceptable level of performance even when problems are occurring within the system.

System Threats

Common problems that occur to a system are internal failures, external failures, as well as malicious attacks. These problems are all inevitable, but DevOps practices can minimize their frequency and severity.

Resiliency Methods

Resiliency can be achieved through methods such as:

  • Caching: storing frequently or recently retrieved responses in a fast, accessible location.
  • Input validation: running checks for malicious requests and throwing them away if found.
  • Load balancing: distributing requests evenly across different servers to avoid overworking any individual server.

Measuring Resiliency

Resiliency can be measured by determining how quickly a critical failing service can recover, and the degree to which critical services are available during a failure (number of failed requests, latency).

Learn More on Codecademy