Here is where we will introduce Monitoring and how this DevOps practice can help us gain insights into the health and performance of a system. 

_Over time, Bloop Co's IT infrastructure has become very complex. Jeff's team used to be able to inspect the health of each component, but those days are long gone. When the system slows or an outage occurs, Jeff and his team don't know where to begin. How can they check system health as the company’s infrastructure expands?_

This system needs monitoring.

**Monitoring** helps teams understand the state of their systems based on gathered data. The monitoring stage of the DevOps lifecycle is crucial in gathering real-time insights on:

* Performance
* Health status
* Scalability

In the exercises ahead, we will explore the following key topics:

* Metrics utilized to assess the health of the system
* Comparing metrics to business objectives 
* Monitoring tools 
* Effective alerting
* Observability
* Monitoring quality

Learning about these topics will deepen our understanding about our complex systems. Let's jump right in!



This image shows the diagram of DevOps lifecycle. 

What is Monitoring

_Clark added a new service to handle more traffic for his growing business. But he doesn’t know how his new application is doing! What information could help him gain insight into his system?_

Clark can measure how his application is operating with the help of metrics. **Metrics** express a value relevant to the system at a specific point of time. Here are some key metrics for monitoring system health and performance:

<details>
<summary>
Latency
</summary>
Latency is the time between the start of an event, such as serving a request, to its completion. This metric is a key indicator of performance. 
</details>

<br/>

<details>
<summary>
Traffic/Connections
</summary>
Traffic is the amount of system usage over time. An abnormal amount of traffic can require scaling to maintain performance.
</details>

<br/>

<details>
<summary>
Errors
</summary>
An error is an invalid state our system has reached. Examples include exceeding a memory limit or reading a corrupted data file. The rate of errors returned by a service can indicate deeper issues.
</details>

<br/>

<details>
<summary>
Saturation
</summary>
Saturation describes the load on our system’s resources. Our system reaching its limits can result in poor performance.
</details>
<br/>

Tracking these metrics can give teams a broad view of system health and help diagnose issues. 


This image shows a spread of graphs to represent the metrics garnered from monitoring. 

Metrics

_Klaus runs a popular retail website. He wonders if his website is meeting his customers’ expectations. If not, how does Klaus build features to further improve customer satisfaction?_

To relieve Klaus’s uncertainty, he can tie his business goals and objectives to the data he is receiving. This is where SLOs, SLIs, and SLAs can come in. 

<h6>Service Level Objective (SLO)</h6>

A **Service Level Objective** is a range of valid measurements for a metric. An SLO might define that the page response time should be less than 100 milliseconds. Measurements going outside of these ranges tend to require action. SLOs might be defined in terms of:

* Latency
* Availability
* Error rate
* And many more metrics!

A goal is only helpful if we know where we are in relation to it. While an SLO defines where we want to be, and SLI says where we are now.

<h6>Service Level Indicator (SLI)</h6>

A **Service Level Indicator** is the current measurements of a metric related to an SLO. Let's say that one of our SLOs is that response time should be less than 100 milliseconds. Our system measures that current load times are approximately 75 milliseconds. Our SLI for that SLO would be within a valid range.

SLOs and SLIs might be useful for internal performance, but what of promises we make to consumers? Our next term, the SLA, is for this purpose.

<h6>Service Level Agreement (SLA)</h6>

A **Service Level Agreement** is a contract with consumers. Not every business will have an SLA, as they carry a degree of legal responsibility. SLA binds the business to the level of expected service promised to a business’s customers. An SLA might define that a business' services be available 99.99% of the time. While breaking an SLO is an issue, breaking an SLA is a major problem. A team would want to take corrective action far before an SLA is breached.

SLOs, SLIs and SLAs allow businesses to tie promises and goals to the data coming from their systems. Using these terms make it much more clear when a system metric is jeopardizing the health of a business.



This image shows a table summarizing the SLX term and its definition. 

SLOs, SLAs, and SLIs

This image shows a laptop with a pop out window showing data, graphs, and charts. 

_Lenny runs an online business selling refurbished furniture. He's nervous about his website crashing during peak traffic periods. What tools can Lenny use to keep informed about his website's status?_

There are several monitoring tools that can aid in understanding Lenny’s infrastructure. Teams practicing DevOps use monitoring tools to:

* Oversee resources such as application services, databases, and other processes
* View system health
* Capture real-time health statuses and performance metrics. 

In many situations, monitoring is achieved using a combination of tools. Big picture trends, like performance, can be monitored by a tool like Prometheus. Meanwhile, a finer piece of detail, such as the time taken for a database query, may be monitored using Monyog. When Lenny uses these kinds of combinations he can identify both the problem and its cause.

There are several popular monitoring tools available to measure a variety of metrics. Let's take a brief look at monitoring tools on the market:

<details>
<summary>
Zabbix
</summary>
Zabbix offers many features such as a centralized, easy-to-use web interface. Zabbix can directly monitor Java applications and offers built-in graphing and visualization capabilities.
</details>

<br/>

<details>
<summary>
Prometheus
</summary>
Prometheus collects metrics values from target systems. This monitoring tool uses PromQL, a query language, that lets users select data in real time.
</details>

<br/>

<details>
<summary>
Sensu
</summary>
Sensu is a highly extensible and scalable system that monitors cloud infrastructures. Using Sensu, the monitoring requirements can be implemented as code.
</details>

<br/>

<details>
<summary>
Monyog
</summary>
Monyog is used by database teams to detect issues affecting MySQL database performance.
</details>
<br/>

Monitoring tools offer a clearer picture of how applications and infrastructure are working. Teams can use these tools to assess system health and track down problems.


Monitoring Tools

This image shows a character flustered by the the multitude of active alerts. 

_Callen received a text alert warning her that the company page can't connect to the database. Prometheus revealed that the database API call failed. Callen quickly switches the system to the backup database. The company’s webpage is restored and the day is saved!_

When a system experiences an issue and isn’t able to fix itself, an alert is triggered. An **alert** is a notification informing about a change of state, usually a problem. There are a variety of ways to deliver alerts:

* Tickets
* Email alerts
* Slack alerts
* SMS/phone calls

Imagine receiving alerts across all of these channels, all throughout the day. We may develop alert fatigue, where we start ignoring alerts or turning them off. To avoid alert fatigue:

* Only alert when immediate human intervention is required
* Alert based on customer facing issues
* Set clear ways to indicate urgency
* Ensure an alert is not a copy of another

With proper restraint, alerting is a critical component of system monitoring. Alerts provide context to help teams solve an issue before it becomes a crisis.


Alerting

_Caleb received a Slack alert about an issue with one of the company’s backend applications. How does Caleb and his team track down the error in a complex maze of backend services?_

Let's take a look at the general steps Caleb’s team might go through when an issue arises:

1) Evaluate usage and performance data
<br/>
2) Identify the cause of the issue. 
<br/>
3) Apply the appropriate solution, restoring system performance.

**Observability** is the degree to which a system's information can be used to locate and fix a problem. In a system with high observability, a team can more easily trace, diagnose, and fix the problem. With poor observability, the data does little to help.

To improve a system’s observability:

* Make sure team is aligned with service level objectives 
* Create meaningful alerts
* Optimize application logging by ensuring messages are informational and descriptive
* Automate work processes

Maintaining an observable system enables teams to proactively monitor and track for errors. 


This image shows a laptop with an error. The error expands into web of nodes. 

Observability

_A modern pilot relies heavily on their plane's equipment. These allow navigation in even the most stormy weather. For a complex software system, we need a high quality monitoring system to make sense of it all. What are some ways we can make sure our instrumentation is able to navigate any crisis?_

For system monitoring to be effective, it takes a significant effort. Let’s take a look at some practices to improve monitoring quality: 

- Define actionable alerts that are customized to the needs of our organization
- Collect application logs and make this data available and understandable.
- Incorporate logging into the build and deployment process.
- Define custom, actionable alerts that are relevant to the organization. 

Another key indicator of monitoring quality is the way our organization handles alerts. Be on the lookout for:

<details>
<summary>False negatives</summary>
Pay attention when a user-affecting issue has happened, and the system does not alert us. The lack of an alert indicates a hole in our monitoring.  We should hold a retrospective meeting to find out what metrics could have alerted us to the problem.
</details>

<br/>

<details>
<summary>
False positives
</summary>
This occurs when an alert is generated, but there is nothing wrong with the system. The threshold for an alert may need to be adjusted, or the alert might need to be deleted altogether.
</details>
 
<br/>

<details>
<summary>
Unactionable alerts
</summary>
This type of alert has little to do with a problem and doesn’t need anything done. Like false negatives, we should reduce or delete unactionable alerts.
</details>

<br/>

Useless or incorrect alerts add to the chance that valuable alerts will be ignored or unseen. Keeping our alerts at a high quality ensures that each is given proper attention.

By following these best practices, we are off to a great start with monitoring quality!


This image shows an image of airplane cockpit. 

Monitoring Quality

Congratulations! We have learned so much about monitoring systems. Let's review what we've covered: 

* Monitoring is the practice of collecting metrics to gain insights on our systems.
* Insights into our systems tell us about its health and problems.
* We can tie our business objectives to metrics with SLAs, SLOs, and SLIs.
* Alerts should be sent out for customer facing issues requiring immediate attention.
* We can use monitoring tools to aggregate and visualize our system insights.
* Observability is the degree to which our monitoring helps us solve system problems.
* A high-quality monitoring system allows us to better diagnose and resolve system issues!

Monitoring is critical to applying DevOps in any organization. Start gaining system insights with monitoring today!


Image shows person staring at a display of charts, graphs, and other data. 

Review

Introduction to Monitoring

Learn how we can apply monitoring to gain insights into how our systems are doing!

Monitoring

### What will you learn?

Welcome to the Introduction to DevOps course! In this course, you will be introduced to DevOps, a culture of collaboration between Development and Operations teams. Through this collaboration, teams are able to more reliably and rapidly release high quality software to their users.

Through this course, you will learn:
- The stages that changes to software go through to get from a developer's computer to its users
- The role and responsibilities of a traditional Operations team
- An overview of modern infrastructure management
- The difference between a DevOps culture and that of a traditional organization
- The purposes of key practices such as CI/CD, monitoring, and containerization
- The role of scalability, observability, and resiliency in a modern software system

Learning is social. Whatever you're working on, be sure to connect with the Codecademy community in the [forums](https://discuss.codecademy.com/). Remember to check in with the community regularly, including for things like asking for code reviews on your project work and  providing code reviews to others in the [projects category](https://discuss.codecademy.com/c/project/1833), which can help to reinforce what you’ve learned. 

We are excited for you to start your journey into the field of DevOps!

Learn about what Intro to DevOps has in store!

Welcome to Introduction to DevOps

Development and Operations members working on the same team and sharing responsibilities. 

Development sending source code to an Operations team to deploy it on infrastructure.


Development and Operations keeping completely separate from each other.


Operations members managing the Development team.


Team members seek to communicate and share responsibilities across all aspects of production.

Team members seek to be highly specialized in their role, performing only one aspect of production very efficiently.

The team seeks to rely on outside teams as much as possible in order to focus on what they do best.


The team seeks to create a culture of secrecy, allowing them to gain a competitive advantage over the organization.

Test your knowledge of the core components of a DevOps culture.

DevOps Culture

Learn the steps involved in getting code from your computer out into the world and into the hands of users.

Code moves from a developers laptop through the deployment process. Along the way there is version control, testing, and various environments.

_Gabriel is a junior software developer who has just finished working on a new feature. They are excited to share their code with their users. What happens next? Do they go door to door installing it manually from a floppy disk? Maybe they can send the files via email! No — these aren't the stone ages!_

**Deployment** is the general process of making a piece of software available to its users. Before the invention of the internet, deployment meant storing software on floppy disks or CD-ROMs, shipping them to users, and having those users manually install the software on their own devices. This process  was slow and expensive, and any bugs that slipped through the cracks could be [catastrophic](https://en.wikipedia.org/wiki/Disney%27s_Animated_Storybook#Release_of_The_Lion_King_storybook_(November_1994)).

Today, software developers are able to deploy their software via the internet with greater ease and speed of delivery than ever before. The process for deployment is largely the same whether it is for the major release of a new application, or a minor patch to fix a bug.

In this lesson, we will discuss the various components that make up this deployment process. Though deployment is a general process that can include many different steps, we will cover the most commonly used tools and practices:

* Infrastructure management
* Version control systems
* Testing: unit, integration, acceptance, and end-to-end tests
* Deployment environments

Traditionally, members of the Development team and the Operations team take ownership of various aspects of the deployment process. We'll start by taking a look at the main responsibility of the Operations team, managing the infrastructure.


What is Deployment?

Various pieces of hardware are connected with lines in a network/on a map. (Laptops, servers, data centers, smartphones, cloud)

_Gabriel is excited to learn about how their company's software is accessed by their users. They head over to see Lars, a member of the Operations team, who is setting up a server in the server warehouse._

In the modern age of web applications, software is stored and executed on **servers**. Servers are computers that run software that can be accessed by another device (also known as the **client**), often via the internet. Servers respond to requests with website code, images, and other content which are rendered by the client (typically a web browser).

Servers require quite a bit of maintenance in order to be used by developers. They need to be purchased, have operating systems and various software installed on them, configured to handle network requests, and much more. Typically, developers don't manage the servers — so who does?

Traditionally, an Operations team is responsible for managing a company's servers. More broadly, the Operations team manages all of an application's **infrastructure**. Infrastructure is the full set of resources that support the development, testing, and deployment of applications. Infrastructure consists of:

* Hardware components such as servers, routers, switches and cables
* Software components such as operating systems, version control systems, and applications

There are dozens of tasks that fall under the responsibility of the Operations team, including:
* Installing and replacing (a.k.a "provisioning") physical components such as servers, switches, and hard drives
* Performing software/firmware upgrades such as security patches
* Configuring infrastructure such as firewalls, user access, ports
* Monitoring network health and alerting personnel when issues arise


Infrastructure Management

Branches, copies of the "trunk", can be worked on independently. When it is time, new code can be merged into the trunk.

_Gabriel is excited to get their code onto their company's servers. They sneak into the server room and pull out their USB stick. Right as they are about to plug it into the server, Lars stops him and says, "Whoa there Gabriel! That's not how we do things around here. You should check out our version control system."_

**Version control systems** (such as [Github](https://www.codecademy.com/learn/learn-git)) are tools designed to manage different versions of a file or project. They track every change that is made to a file while saving all previous versions of the file. Some of the data that is tracked by a version control system includes:

* changed files
* new or deleted files
* renamed or moved files
* the author and date of the change

With version control (a.k.a. "source control"), the risk and impact of bugs are reduced. When a new version of the software has issues, it can be compared to previous "stable" versions to identify the error. If necessary, the software can be "rolled back" to previous versions until a fix is implemented. With the author and date of each change stored, development teams can quickly identify who has the most information about a breaking change. 

Version control systems (VCS) change how teams work together. Common operations of version control, such as **branching** and **merging** enable development teams to collaborate more effectively.
* Branching is the process of creating a copy of the source code (the "trunk"). Developers can work on their own branches without changing the source code that real users and other developers depend on.
* Merging is the process of combining the changes in one branch with another. This occurs when the differences between the two branches are ready to be reconciled. When conflicts between branches arise during a merge, version control systems can assist in resolving them.

Lastly, version control systems are able to synchronize with project management tools. For example, when new code is added, an engineering manager can be alerted to review the changes.

Version Control Systems

_Monica is Gabriel's engineering manager. Gabriel has checked their code into the version control system and wants users to access the new feature as soon as possible. Monica knows better. Changes need to be tested before being released to users._

**Testing** is an essential component of the deployment process. Testing ensures new features integrate with existing features, work smoothly within the existing infrastructure, and satisfy the product and design requirements. 

Different types of tests exist that are used in the various stages of deployment. Four types of tests that are often used are:
* **Unit test** — evaluates the smallest possible unit of testable code, such as a single function.
* **Integration test** — evaluates how the units of a particular program work with one another.
* **Acceptance test** — evaluates whether the user experience aligns with the business requirements of the software. 
* **End-to-end test** — evaluates the application's behavior using production-like infrastructure that includes networking, databases, and calls to external APIs. 

Failures during testing help developers know that they need to update their code or increase their infrastructure resources. Success during testing gives a developer confidence that their project is in a releasable state.

For example, a developer’s workflow might incorporate testing like so:
1. Develop a new feature locally
2. Add new code to the version control system
3. The code change has tests run against it (unit, integration, sometimes acceptance or end-to-end)
4. If there are any failures, the developer will work on fixes
5. Repeat steps 1-4 until all tests pass.
6. The change is allowed to be merged.

Testing

The development environment is where code is first written, typically on developers' own laptops. The staging environment is where new software is tested before being deployed to production where users can access the application.

_Gabriel's new feature went from their own computer, into the version control system, passed tests, and was merged into the source code. Now, it's time to "deploy to production"! But what exactly does that mean?_

The **production environment** refers to the infrastructure that supports the complete application used by real users. This infrastructure consisted of hardware and software components scaled for real-world usage.

More broadly, an **environment** is the subset of infrastructure resources used to execute a program under specific constraints. 

Along the way to the production environment, software often moves through a series of intermediate environments. Each intermediate environment allows developers to rigorously test new software without impacting production infrastructure.

Though the names of environments may differ from company to company, a common set of environments includes:
* The **local development environment** — where software is first written and tested, typically on a developer's own computer.
* The **integration environment** — where software changes are merged using a version control system.
* The **quality assurance (QA) / testing environment** — where tests are executed to ensure the functionality and usability of each new feature.
* The **staging environment** — where the software can be performance tested in a production-like environment, but before real users are involved.
* The **production environment** — where software is accessible by real users!

These environments do not strictly represent a linear path from a developer's computer to production. Instead, each of these environments can be viewed as a space that developers can use throughout the entire deployment process. 

Deployment Environments

_Gabriel's new feature has finally been deployed to production! Users around the world are accessing the new feature via servers, networks, and other infrastructure. Their user experience is as good as ever thanks to extensive testing and version control._

In this lesson, we learned about **deployment** — the process of making a piece of software available to its users. Deployment is a long journey that starts with a developer moving their code into a **version control system**, through a **staging environment**, and ends with them deploying their code to a **production environment**. Along the way, **unit**, **integration**, **end-to-end**, and **acceptance tests** are conducted. 

This deployment process ensures that new code is shipped quickly, reliably, and with high quality. Members of the Development team and Operations team own various pieces of this process. In the next lesson on DevOps culture, you will learn how the boundaries between these two teams begin to blur in order to create a more open, and efficient, team dynamic.

Learn the practices used by teams to break down barriers between Development and Operations and optimize the deployment pipeline.

A developer and an operations members arguing about who is to blame for the crashing application. They should be working together instead!

_Ariadne needs a new database server to support her new search feature. This should only take a few minutes but it’s stuck in the Operations team’s backlog! She might have to wait months for this simple task. What’s going on?! Ariadne's coworker Jenna suggests trying out a new team strategy called DevOps. Ariadne is not familiar with this term — what's DevOps?_

**DevOps is a culture** that shifts the way that Development and Operations team members work together. DevOps culture aims to foster trust, collaboration, problem resolution, and continuous improvement across the entire team. So how can Ariadne's team adopt a DevOps culture to increase collaboration and reduce barriers between these departments? 

Teams use a variety of practices and tools in order to foster a DevOps culture. You may have even heard of some of these practices, such as **automation** or **blameless retrospectives**. Your team may already use some of these tools, such as **version control systems**.  On their own, these practices and tools can certainly make improvements. When used within a DevOps culture, a team can be transformed.

This lesson will start by looking at traditional team culture before introducing the core pillars of a DevOps culture. We'll cover:
* Development versus Operations teams
* Dev+Ops: Shared Responsibilities and Roles
* Systems-Level Thinking
* Learning From Failure
* Feedback Loops

Introduction

In a traditional team culture, Development team handles writing code and hands off complete featuers to Operations. Operations deploys software on infrastructure.

_Ariadne is frustrated by how long it takes to get things done. She's just submitted a fix for a bug but can't deploy her code until the Operations team gets out of their hour-long meeting. Why can't she just deploy the fix herself?_

A traditional software company often has a separate Development and Operations team. 
* The **Development team** writes an application’s features. 
* The **Operation team** creates and maintains the infrastructure that the application runs on. 

In this arrangement, the Development team sends its code to the Operations team who deploys it on the infrastructure. 

While each team can have clearly defined responsibilities, there is an inherent conflict between the two teams. Developers want to produce new functionality as fast as possible. Meanwhile, Operations members want the infrastructure to be stable and reliable. Unfortunately, new changes are the biggest threat to the stability of a system. 

This difference in goals gets in the way of software development. Specifically, a few issues arise when an Operations team is separate from a Development team:
* Development and Operations teams own different environments in the deployment process. Differences between environments can lead to bugs that are difficult to resolve.
* Handoffs between teams take time
* Information is siloed, meaning decisions are often made without consideration of the other team.


Dev vs. Ops

In DevOps culture, Development and Operations teams share responsibilities, learning from each other and making decisions as a team with shared objectives.

_Tired of the lack of communication and collaboration, Ariadne approaches her engineering manager and the Operations manager with a proposal to share responsibilities._

Teams that foster DevOps culture seek to be highly communicative, sharing knowledge and experience across team members. By integrating Development and Operations teams, we can resolve many of the issues that arise due to conflicting goals. As a result, we have:
* Faster development and deployment cycles due to fewer handoffs & shared knowledge
* Environment consistency from Development to Staging to Production
* Improvement of operations activities by applying dev best-practices like version control

A typical engineering team within a DevOps culture may include various engineers, quality assurance (QA) testers, security operations, and information technology (IT) specialists. Rather than siloing information, these team members can share responsibilities, align on team objectives, and make decisions together. 

Through this collaboration, teams can produce better software in less time and more reliably than ever before.


Dev + Ops

Systems level thinking is illustrated by a globe. Continuous experimentation & learning shows a cycle of experiment, fail, and learn. Feedback loops shows a magnifying glass over a graph.

_Ariadne has really gotten behind her coworker Jenny's idea that combining Development and Operations is the way to go. Excited to adopt DevOps, Ariadne gets to work learning about the key components of a positive DevOps culture._

The culture of DevOps is the most critical factor to its success. Collaboration cannot occur from only applying a set of practices and tools. It requires a culture in which collaboration can thrive. 

The central pillars of a DevOps culture include (click on each item to learn more!):

<details>
<summary> Systems-level thinking</summary>

> Systems-level thinking means thinking about the whole production system, rather than a single department. Doing so allows teams to identify **bottlenecks**.

</details>

<details>
<summary>Continuous experimentation and learning</summary>

> Teams that embrace continuous experimentation and learning encourage rapid development of new features and accept failure as a learning opportunity.
</details>

<details>
<summary>Feedback loops</summary> 

> Feedback loops allow teams to draw information from each part of the system. As they gain insight into their systems, processes can be improved and optimized.
</details>

To have a successful DevOps culture within an organization, these components must be implemented at both a team and individual level. Only with a positive culture can the practices and tools be utilized to their fullest.


Cars driving in three lines that merge into one, clogging up the road.

Within a traditional team, team members often focus on their own tasks rather than the big picture. Developers might concentrate on the code without considering infrastructure or testing needs. The Operations team might only consider infrastructure needs without considering impacts on functionality. 

DevOps seeks to have each team member consider all aspects of the development process. This practice is known as **systems-level thinking**. Though this may sound daunting, the shared-responsibility team structure of DevOps makes this possible. When a team has developers, IT specialists, QA testers, and security experts, information is shared and decisions are made as a team. Meanwhile, individuals are given the opportunity to grow and gain knowledge in new domains.

One important outcome of systems-level thinking is the identification and resolution of **bottlenecks**. A bottleneck is a system’s slowest point, resulting in the slowing down of the entire process. A Development team with a systems-level view can more easily see where these slowdowns occur, resolve systemic issues, and optimize processes.




Systems-Level Thinking

A developer tries to identify who to blame for a bug in the code. It was them.

_The team is called into their manager's office once again and Thomas is nervous. His team has been losing people and is struggling to keep up with demand. The user response to the latest feature he spearheaded has not been positive. Everyone in the room is silent, waiting to see who their manager will be blaming this time. A few minutes of yelling and finger-pointing later, Thomas is packing his things, the latest member to be fired._

Organizations that punish failure create a culture of fear that gets in the way of innovation and growth. DevOps seeks to change this by viewing failure as a natural part of everyday work. Rather than having mistakes be punished, DevOps culture stresses the importance of failure as a learning opportunity. 

Failure itself is never something that teams seek out. However, once failure is normalized, teams can be more open to experimentation. Teams can try out ambitious solutions, fail quickly, learn from those mistakes, and then experiment again. This cycle enables teams to regularly make product and process improvements, both big and small.

One method DevOps uses to normalize failure and learn from experimentation is through **blameless retrospectives** (or "post-mortems"). These retrospectives are meetings held at the end of a sprint, project, or issue resolution. Here, team members discuss what went well and the areas where they can improve.

Learning from Failure

A dashboard shows the health of several parts of a system. An alert is flashing indicating an issue.

_Ariadne's project manager Yuri regularly tracks the average time it takes for various tasks. She notices that since the start of the new year her developers are taking an unusually long time to respond to user-submitted bug reports. She digs deeper and discovers that the automation tool her team had been using for years to assign bug reports to developers has been shut down! She quickly pivots and finds a new tool for her team to use._

In this situation, Yuri created a **feedback loop**. A feedback loop is created when a team identifies and tracks a key piece of data, or **metric**. Teams can then use that information to drive process improvements. A typical feedback loop may look like this:

* The system is monitored and data is collected.
* Data is analyzed and bottlenecks are identified.
* Solutions are created and implemented.
* New solutions are monitored again. 

Choosing which metrics to track is perhaps the most important step in creating a feedback loop. Metrics that can't provide meaningful insight into a system just distract from the real bottlenecks. 

Focusing on metrics that affect the customer is a great place to start. Some of these include:
* Time to load a website page
* Time to resolve an issue/outage
* Time to release new features

When and where metrics are tracked is another important aspect to consider. A defect, or problem, becomes more expensive to fix as it moves along the development process. DevOps seeks to discover defects as early as possible, a strategy known as **shifting left**.

When we collect and utilize feedback loops, we are also embracing, rather than turning away from, the inefficiencies in our system. Feedback loops contribute directly to a culture of continuous learning.


Feedback Loops

DevOps culture centers around collaboration between team members across all domains of software development. Teams practicing DevOps can likely include developers, IT specialists, QA testers, and security experts. Rather than siloing information, teams can share knowledge and make informed decisions together, also known as **"building quality in"**.

In order to foster a DevOps culture, teams can make use of key DevOps practices:
* **Systems-level thinking** — thinking of the whole system to identify **bottlenecks**
* **Continuous experimentation and learning** — embracing failure through practices such as **blameless retrospectives**
* **Feedback loops** — using **metrics** and **shifting left** to drive process improvement

By fostering DevOps culture and following DevOps practices, teams can build and release software with greater speed, quality, and reliability than ever before. Throughout the rest of this course, you will learn about some of the specific tools that teams use to implement this culture and its practices.


The process of making a piece of software available to its users.

The process of removing old features from a codebase.

The process of merging new code changes into the main branch.

Provide suggestions for how to resolve bugs.


Track every change that is introduced into the code.


Retrieve all previous versions of the code.


Test your knowledge of the deployment process and DevOps culture.

They create a new virtual server with the correct dependency and then destroy the old one. 

They run a command to update the dependency.

They do nothing since the virtual server is immutable.

They create a new virtual server with the right dependency and they leave the old one in place. 

It can lead to inconsistencies across infrastructure configurations.

It limits the type of infrastructure components that are able to be used. 

A virtual machine requires its own operating system while a container shares the operating system of the underlying host. 

Virtual machines can each run an entire instance of an application, while containers can only run part of an application.

Virtual machines each require their own dependencies, such as libraries and packages. Containers are capable of sharing these dependencies.

Virtual machines run on company premises while containers run on the cloud.

Test your knowledge of infrastructure management practices!

A dive into traditional and DevOps practices surrounding infrastructure management. 

_Summer launches her web browser, enters the URL for her company’s business analytics application, and voilà — immediately, her browser is filled with beautiful charts, graphs, and statistics loaded from millions of data points._

_Meanwhile, Kai navigates to an e-commerce application and struggles to purchase a pair of shoes. Half of the product images in his search results fail to load, and when he finally manages to add an item and click the checkout button, nothing happens._

What could cause such drastically different experiences when using web applications? Often, the answer lies in how businesses manage their infrastructure. 

Recall that infrastructure is the set of hardware and software components used to develop, test, and deploy applications. Managing this infrastructure involves quite a number of tasks. 
* Hardware needs to be installed and maintained. 
* Thought must be given to power and cooling. 
* Networks and databases need to be configured. 
* Hardware failures and cyber-attacks are serious concerns with any infrastructure. 

Dealing with all of these things is a tall task — even for seasoned professionals! While there are various types of infrastructure and methodologies for managing it, with DevOps, certain practices have become the norm. 

In this lesson, we’ll explore the ways that infrastructure has traditionally been managed, as well as DevOps practices that address the common issues with traditional infrastructure management. The main topics we will cover are:

* Scaling Infrastructure
* In-house Infrastructure
* Virtualization
* Containerization
* Infrastructure as Code
* Orchestration
* Cloud Infrastructure

Before we dive in, let’s get a feel for the importance of properly managing infrastructure.

Introduction to Infrastructure Management

**Scalability** is a system's ability to add resources to keep up with growing demand. When more more users begin using an application, infrastructure with great scalability will handle it without interrupting services. An infrastructure with poor scalability will likely cause slowdowns or disruptions. 

##### Vertical & Horizontal Scaling

In practice, scalability can be achieved either through vertical scaling or horizontal scaling. **Vertical scaling** means adding computing resources, such as increasing network speeds, storage, or RAM. **Horizontal scaling** means adding more servers (or "nodes") that each run the application. A tool called a "load balancer" then distributes the work across the many servers. 

Vertical scaling is relatively simple and affordable, as it only involves upgrading a machine. However, there is some downtime required to perform the upgrade. 

Horizontal scaling has the benefit of not requiring any downtime for existing servers. This benefit is the main reason why this is the scaling option chosen by most DevOps teams. That being said, it is slightly more complex to manage the many servers, and it is more expensive than upgrading existing machines.

##### The Price of Scaling

If scalability is so important, why not just run an application on a million powerful servers? Surely this would be enough to keep up with plenty of growth? Probably, but companies still need to consider the cost of their infrastructure. Scaling is about finding that sweet spot — enough to perform well but not so much that money is wasted. 

This leads to another important goal when it comes to infrastructure — **elasticity**. Whereas scalability only deals with increases in resources, elasticity is the ability to automatically add _or subtract_ resources to accommodate fluctuating demand. Elasticity is especially important when using pay-per-use infrastructure services since resources can be returned, and money can be saved, when demand shrinks. 

Scaling Infrastructure

Let’s imagine we have a very simple blog application. It doesn’t require much of our server’s memory or processing power. However, it is getting a ton of traffic, which is great! We need to scale the application to keep up with growing demand. 

We want to avoid any downtime to our servers so we won’t scale vertically. Scaling horizontally will enable us to handle the increase in traffic without interruptions! 

Horizontal scaling should work well for performance, but it isn’t very efficient. Servers are expensive, and our application uses only a small portion of each server’s physical resources. This will result in a lot of unused physical capacity. 

To solve this problem, we could turn to **virtualization**. Virtualization technology allows many virtual machines (VMs) to run on one physical computer. Each virtual machine is a distinct environment with its own operating system, dependencies, and users. 

Virtualization would reduce the number of servers needed to run many instances of our blog application. Each server can be utilized closer to its full capacity by being split into several VMs. This allows us to horizontally scale even more efficiently based on demand. 

In addition to more efficient resource utilization, virtualization brings convenience as well. Virtualization management tools simplify the task of creating virtual machines. Using these tools is more efficient than installing and managing pieces of hardware by hand.

Diagrams depicting server utilization with and without virtualization. With virtualization, a single server can run multiple processes and achieve a higher utilization.

Virtualization

_Dakota has tested all of the features of her application in her local development environment. She is proud to push the app to their team’s testing environment for QA to look at. Within hours, QA is reporting that none of the pages are loading. She looks at their server logs and sees cryptic error messages coming from a specific Linux package that is out of date. "Hmmm", she wonders. "Why didn’t I see this in my local environment?"_

This type of story is all too common in software development. An application or feature works in one environment but not another. Often, this is caused by differences in **dependencies** – external files and programs that are not a part of the application but are used by it. 

DevOps relies upon **containerization** to solve this issue. Containerization is a form of virtualization in which users create virtual environments called **containers**. Like virtual machines (VMs), containers include instances of applications as well as their dependencies. This makes them a convenient solution to help applications behave consistently when moving through the deployment pipeline. 

But couldn't we just use VMs to also solve this problem? 

Unlike VMs, containers do not include their own operating system. Instead, they share the operating system of the host machine. The lack of their own operating system makes containers smaller and faster to spin up than VMs. It would be inconvenient to re-create a virtual machine whenever changes to an application or its dependencies are made. Containers, on the other hand, can be started in only seconds. 

Additionally, since containers do not need their own operating system, they use less physical resources than virtual machines. This allows many more containers to be run than VMs, leading to even more efficient scaling. 

The technology for containerization has been around for decades. However, it did not become widespread until 2013 with the release of [Docker](https://www.docker.com/). Docker provided a simple interface for developers to create and run containers. Today, there are a handful of containerization tools that are used in addition to Docker.

GIF that shows an application moving between environments with and without containerization

Containerization

A config file is loaded into Kubernetes and controls several nodes

_Juan has been in the software business for a long time. He oversees application deployment for a prestigious banking institution. The team is about to roll out a major update to their application. It will require getting thousands of production servers to run the new version. He thinks back to the old days and laughs. He would have had to log in to each server, one by one, and redeploy the application with the changes. Today, there is a far better way…_

Juan will likely turn to orchestration software to assist in deploying the update to their application. According to [RedHat](https://www.redhat.com/en/topics/automation/what-is-orchestration), "**Orchestration** is the automated configuration, management, and coordination of infrastructure." 

Much like a conductor to an orchestra, orchestration tools direct many individual components on how to play their part in order to achieve consistency across the entire infrastructure system. 

Tools like Docker give the ability to create and control individual containers. Orchestration software, such as [Kubernetes](https://kubernetes.io/), controls many containers working together in harmony. Once the desired infrastructure configuration for a system has been defined, Kubernetes makes sure that new containers are deployed based on that configuration. To do this, Kubernetes automatically performs tasks such as:

* Deploying containers across many servers
* Restarting failed containers
* Rolling out updates without any downtime
* Horizontal scaling of containerized applications

Orchestration

Orchestration tools rely on a core principle of DevOps: **Infrastructure as Code**. Infrastructure as Code (IaC) is the act of defining infrastructure properties in configuration files. These configuration files are then stored and tracked in version control systems. 

Infrastructure configuration used to be performed manually. Individual team members would install operating systems, upgrade dependencies, and configure networks. This process was both time-consuming and error-prone. Over time, configurations across servers would become inconsistent due to human error. This build-up of errors is known as "configuration drift".

IaC solves this problem by relying on configuration files as the source of truth for infrastructure state. Since configuration files are machine readable, it enables easier automation. It is also simple to roll back changes if needed since these files are stored in version control systems. 

IaC/orchestration tools also rely on the idea of _immutable infrastructure_. Immutable infrastructure means that servers are not changed once they are created. Instead, when infrastructure configurations change, new servers are created and old ones are destroyed. Immutable infrastructure and the use of configuration files guarantees that all servers are created in the same manner and eliminates the risk of configuration drift. 

Immutable infrastructure would not have been practical in the age of physical servers, since the cost and time to set up physical servers is high. Operations teams had to rely on updating servers once they were deployed. Virtual machines, however, can be destroyed and created in minutes with little cost. With containers, it might only take seconds. 

IaC and immutable infrastructure ultimately lead to lower business costs and a better user experience. These principles demonstrate how DevOps applies development principles to operations. 

GIF showing configuration file which enables automating provisioning and configuration of infrastructure.

Infrastructure as Code

Historically, businesses owned and managed infrastructure on company premises with their staff. This is known as **traditional-** or **in-house infrastructure**. With traditional infrastructure, the company acquires, configures, and maintains physical infrastructure components themselves. These components include servers, power supplies, and cooling.

Due to the complexity of infrastructure components and how they interact, this can be a tall order. Some of the problems that companies run into when managing their infrastructure include:

* Hardware components such as power supplies, hard drives, and RAM fail over time.
* Malicious users attempt to disrupt web services and steal sensitive data.
* Software becomes outdated, requiring consistent patches and upgrades.
* Scaling up infrastructure as demands grow.

Despite the many challenges that traditional infrastructure management faces, it offers unparalleled customization over its resources. While more modern solutions exist, many companies still use traditional infrastructure.

Infrastructure such as servers and databases are maintained in-house. Developers and users can access that infrastructure directly via networks.

In-House Infrastructure

The last couple of decades brought about an important shift in infrastructure management. This shift was **cloud-based infrastructure**. Cloud-based infrastructure refers to infrastructure and computing resources that are available to users over the internet. Usually, a third-party company owns, houses, and manages the physical infrastructure, allowing application developers to focus on defining their desired configurations.

This shift was largely brought about by virtualization technology. While cloud providers create physical pools of resources, virtualization allows many instances of an application to run on these resources. 

Cloud-based infrastructure has several benefits:

* It allows specific companies to specialize in physical infrastructure management while others focus on business logic. 
* It allows a company to scale quickly since cloud providers have physical resources readily available.
* It allows for efficient scaling by taking full advantage of virtualization. 

However, there are still valid reasons to use on-premises infrastructure. These include:

* It provides the ultimate in flexibility and control.
* It decreases reliance on third-party vendors.
* Sensitive data can be kept inside the company.

DevOps emphasizes cloud-based infrastructure because of its benefits in specialization and scalability. However, many companies still maintain on-premises infrastructure, especially for internal applications. Some companies use a hybrid approach, with servers both in the cloud and on-premises. 

Cloud providers data storage, servers, virtualization, to support applications

Cloud Infrastructure

In this lesson, we covered many facets of infrastructure management, including:
* Scalability
* Virtualization and Containerization
* Orchestration
* Infrastructure as Code
* Cloud vs On-Premises infrastructure

DevOps brought about many changes to infrastructure management. DevOps practices include an emphasis on containerization, cloud-based infrastructure, and Infrastructure as Code. These practices have brought about cost-effective, scalable infrastructure and ultimately better user experience. 

image of various pieces of hardware connected with lines in a network/on a map. (Laptops, servers, data centers, smartphones, cloud)

Monitoring allows teams to watch and understand the state of their systems based on gathering predefined metrics and logs

Monitoring only validates the health of databases of an infrastructure

Monitoring a system prevents 100% bugs before it is found in production

Monitoring involves applying a series of automated tests to a new piece of code, gradually ensuring that it is entirely correct

monitor metrics such as the number of false alerts and the time for defect resolution


pay for an expensive monitoring tool and that would be enough

monitor for latency and the rates of errors


monitor and resolve any alerts involving overuse of CPU

This quiz will assess the learner's knowledge on Monitoring in DevOps. 

Learn about the difficulties of manually performing the deployment pipeline in large projects and how the CI/CD pipeline can help.

_Anita has been an independent programmer for years, working solo on most of her projects. Recently, she has begun working on larger projects with a team and some of her development and deployments practices aren't working as well as before. When merging her code, she found that there were a lot of conflicts with the other developers’ code. Some bugs were slipping through the merges and causing more issues. Even some of the deployments failed! They did not have a prototype ready to show and faced many delays._ 

_Anita began to realize that her processes needed to change to scale up for larger projects._

Throughout this course, we have learned about various DevOps practices such as version control, environment configuration, and testing. These practices enable developers to deploy their code with great speed and reliability. However, when attempting to use these practices and tools at scale, issues like Anita's can arise. 

Teams practicing DevOps will turn to **automation** to scale their development and deployment practices. 

Automation is the process of using tools and scripts to perform tasks for us. Many aspects of the deployment process can be automated, from version control to testing and finally to deployment. Altogether, the automation of the deployment process creates the **Continuous Integration / Continuous Delivery (CI/CD) Pipeline**. This pipeline also encompasses **Continuous Testing** and **Continuous Deployment**.

In this lesson, we will take a look at each of the components of the CI/CD Pipeline and how it relates to the overall culture of DevOps. We will cover:

* Bottlenecks of Manual Deployment
* Automation in DevOps
* Continuous Testing
* Continuous Integration
* Continuous Delivery
* Continuous Deployment
* The Complete CI/CD Pipeline

Let's get started!


Long-lived feature branches, infrequent testing, and manual infrastructure management can lead to many merge conflicts, missed bugs, and failed deployments.

Although manual deployment isn’t always a bad thing, it can cause a lot of problems when trying to work on a large project with many developers. Let's take a look at some common bottlenecks found within the deployment process.

For each of these bottlenecks, think about what issues can occur when scaled up to a large project. Select the dropdowns to see the answers.
 
#### Version Control Management
 
When Anita was working by herself, she used to create a branch and code a new feature. Merging wasn't a concern since she was the only one working on the project. As a result, she typically waited until an entire feature was complete before merging into the main branch.

<details>
<summary>See the issue!</summary>

> This can cause some problems on a large scale if developers are working on long-lived branches containing new features. Once the developers try to merge their code, it can cause a lot of merge conflicts and introduce new bugs. 
</details>

#### Testing
 
Anita used to run tests when she felt it was necessary. She usually ran tests before merging a large feature branch or if she found a problem herself.
 
<details>
<summary>See the issue!</summary>

> It is usually not enough to only run a test before merging a branch. Testing infrequently can lead to bugs being overlooked. Additionally, testing issues can compound once merge conflicts begin to occur between developers. 
</details>

#### Infrastructure and Environments
 
Anita previously set up the servers and the environments herself. Before deploying her code on her staging or production server, she would configure the environment variables manually. A deployment would sometimes fail due to human error, but it wasn’t such a problem when she was working by herself. Now, the team relies on the servers working properly and can’t afford any simple mistakes.
 
<details>
<summary>See the issue!</summary>

> Differences between various deployment environments can be tricky to manage. Setting up environments manually makes room for human errors that can cause the server to go down and delay production. 
</details>

Bottlenecks of Manual Deployment

Many aspects of releasing software can be automated including testing, infrastructure configuration, deployment, and monitoring.

Automating processes leads to many benefits when compared to traditional manual processes. Automated systems are:
 
* Faster — automated processes can perform operations much faster than people.
* Less error-prone — automation is able to perform a task more consistently than a person.
* Cheaper — workers don’t have to be paid to do these repetitive workflows.

Automation can be incorporated into every part of the deployment pipeline. For example, automated tools can be set up to monitor repositories, detect new merges, and trigger the new version of the application to be built and tested. 

Some other ways to use automation in the deployment pipeline include:
 
* Configuring servers
* Moving the application through testing and staging servers
* Executing tests
* Deploying to production
* Monitoring the application
 
Over the next few exercises, we will see which parts of the CI/CD pipeline are responsible for automating each of these processes. First, we'll look at **Continuous Testing**.

Automation in DevOps

Continuous testing is featured in continuous integration, delivery, and deployment.

_It happened again. A bug was accidentally released to the public in one of Anita’s apps. This rarely happened when she wrote the code on her own. Now with so many developers merging in code and fixing conflicts all while reviewing different PRs, it seems to be inevitable._
 
Using automation to control the frequency and location of tests can minimize this issue. This automation of testing is called **Continuous Testing**. It involves automatically triggering tests to be executed once an application is built in a new environment. Using continuous testing casts a wide net, catching bugs early and ensuring that the requirements of the project are met. 
 
Tests can take a variety of forms with the most common being:
 
* Tests during development
  * Unit Tests
  * Integration Tests
* Tests before deployment to production
  * Acceptance Tests
  * End-to-end Tests
 
Since tests are triggered automatically each time the application is built in a new environment, developers don't need to constantly monitor the state of the tests. Developers are simply notified of any failed tests by the automation tool, allowing them to quickly address the problem. When tests pass in a given environment, the approved code can automatically be moved into the next environment where even more tests may be executed. 


Continuous Testing

Small changes trigger automated testing before merging into the main branch.

Continuous integration is a practice that consists of two main components:
 
* Merging source code changes on a frequent basis
* Building and testing the changes in an automated process
 
The main focus of continuous integration is frequency. By merging, testing, and building rapidly, we can prevent a lot of problems that could occur with many developers working on a single piece of software. Let's take a closer look at how.
 
#### No more long-lived branches
 
One of Anita’s problems was that she used large branches which she worked on for a long time. As other team members start contributing to the main branch, her branch will start to fall behind.

By switching to continuous integration practices, Anita now merges smaller changes more frequently. This follows the branching strategy of **trunk-based development** rather than **feature branch development**. Here’s the difference:
 
* Feature branch development — Merging in long-lived branches containing entire new features
 
* Trunk-based development — Merging in small changes frequently into the main branch (called the trunk)
 
Take a look at this diagram showing how each branching strategy performs differently.

> Hint: You can expand this "Learn" panel to get a better look

![Branching Strategies](https://static-assets.codecademy.com/Courses/DevOps/fundamentals/ci-trunk-vs-feature-branch.svg)

#### Less merge conflicts

Rapidly merging smaller changes means that there is less of a risk for merge conflicts. If any are found, they can be addressed quickly and require fewer changes due to the small size.
 
#### Frequent tests
 
Each new change automatically triggers building the application in an integration environment (sometimes referred to as a "CI server"). Through continuous testing, tests are executed immediately as changes are introduced. As a result, bugs are caught early on in the process.


Continuous Integration

Continuous delivery prepares code for production by running tests in intermediate environments.

Building and testing an application in testing and staging environments ensures that the code is compatible with production infrastructure.  When done manually, subtle differences between environments can make this a complicated process that is prone to human error. **Continuous delivery** addresses this issue through automation. 

Continuous delivery is the automated process of preparing new versions of an application to be deployed into the production environment. Picking up where continuous integration left off, continuous delivery automates:
* Environment configuration through containerization and infrastructure as code
* Deployment to intermediary environments (testing, staging, etc…)
* Further testing (acceptance and end-to-end tests)

Sometimes, an application that runs well in the development or testing environment will crash if deployed to production. DevOps practices, such as containerization and Infrastructure as Code (IaC) can be incorporated into continuous delivery to resolve these issues. 

Through continuous delivery, developers can be confident that the application has been thoroughly tested and is ready to be deployed at any time.


Continuous Delivery

Continuous deployment automates the release of new code to users.

The application is finally ready for production, but who is going to approve it and move it onto the production server? Traditionally, an approval process is controlled by the deployment team. The team would ensure that the production server is ready, all tests from continuous delivery have passed, and the feature meets the business requirements. Afterwards, the application would be manually deployed onto the production environment. 

This process typically requires entire features to be completed before deploying to production, resulting in slower release cycles. For many businesses, this is desirable. For others businesses, releasing new updates on a faster and more regular schedule may be preferred. To achieve, this, **Continuous Deployment** is the solution.

**Continuous deployment** is the automatic process of deploying a project to the production server after it has been tested in testing and staging environments.

Automated deployments might look like this:
* If previous stages were successful, then the deployment is approved automatically. 
* An automated system deploys the application onto the production environment. 
* Final tests, feedback from users, and monitoring tools identify any bugs in production.
* Developers can quickly react and push bug fixes.
* The automated pipeline is triggered. This results in another deployment which updates the live application.

Once automated deployment is in place, users are always able to access the most up-to-date version of the application. Keep in mind that this may or may not be the desired effect depending on the business. Some businesses may automate continuous testing, integration, and delivery while choosing to maintain manual deployments.


Continuous Deployment

The deployment pipeline is automated, allowing developers to focus on building new features!

Overall, the CI/CD pipeline is the implementation of DevOps culture through automation. The relationships between them can be summarized in three points:
1. The CI/CD pipeline optimizes the production system as a whole, removing bottlenecks at each stage of the deployment process.
2. The CI/CD pipeline enables learning and experimentation due to the ability to rapidly release changes and update the application. 
3. Feedback loops are built into the CI/CD pipeline through continuous testing. As changes move through the pipeline, developers can track the testing progress of their code at each stage of the pipeline.
 
CI/CD not only improves each individual part of the deployment pipeline, it optimizes the entire system. The pipeline allows for developers to react quickly to changing customer demands, and allows customers to view the most recent version of the application. Additionally, by automating many of the processes, developers can focus on developing new features. No wonder many companies are incorporating the CI/CD pipeline into their projects!


DevOps Culture & The CI/CD Pipeline

Continuous testing, integration, delivery, and deployment make up the CI/CD pipeline.

In this lesson, we learned about the problems which can occur when using manual deployment processes with large team projects. Although manual processes may be fine for small projects, customers in the modern age require teams to move quickly and reliably.
 
The CI/CD pipeline provides this consistent and agile solution. It automates every part of the deployment process using:
* **Continuous Integration** --> Merging small changes and triggering builds and tests
* **Continuous Delivery** --> Preparing the application for deployment using intermediate environments and more tests
* **Continuous Deployment** --> Moving the application to production with final tests
 
Thanks to the practice of **Continuous Testing**, tests are automatically executed in every part of the CI/CD pipeline. Due to the high frequency of tests, we can "shift left" and catch bugs early on.
 
The entire CI/CD pipeline is the direct implementation of DevOps culture concepts. Each part of DevOps culture (systems-level thinking, continuous learning and experimentation, feedback loops) is put into practice by CI/CD. When we put it all together, implementing a CI/CD pipeline reduces human error, speeds up project development, and improves quality.
 
The next time manual processes are getting tiresome, or problems in your deployment process start occurring, consider using a CI/CD pipeline!


CI/CD

They are examples of bottlenecks in the deployment process.

They are examples of good practices in the deployment process.

They are ways to speed up the deployment process.

They do not occur in the deployment process.

Automatically running tests throughout the entire CI/CD pipeline.

Manually running tests throughout the entire CI/CD pipeline.

Running tests before merging code into the main branch.

Running tests after merging code into the main branch.

Practice your knowledge of CI/CD concepts!

Congratulations, you’ve successfully completed the Intro to DevOps course! In this course, you learned about DevOps, a culture of collaboration between Development and Operations teams that is supported by practices and implemented using tools. 

More specifically, you learned:
- The stages that changes to software go through to get from a developer's computer to its users
- The role and responsibilities of a traditional Operations team
- An overview of modern infrastructure management
- The difference between a DevOps culture and that of a traditional organization
- The purposes of key practices such as CI/CD, monitoring, and containerization
- The role of scalability, observability, and resiliency in a modern software system

Your learning journey into DevOps isn’t over yet! There are plenty of other topics that you can dive into to continue learning. Here are our recommendations for the next steps:

 Go is the defacto language of DevOps and is used for tools such as Docker, Kubernetes, and Prometheus, Bash is a powerful way to automate scripting of infrastructure, Git is a key component of most CI/CD pipelines

### Git & GitHub

Version control is an essential piece of the deployment pipeline. In [Learn Git & GitHub](https://www.codecademy.com/learn/learn-git), you will learn how to add version control to your projects by using Git and GitHub.

### Bash Scripting

Scripts are used to automate many tasks that are integral to infrastructure management. In [Learn Bash Scripting](https://www.codecademy.com/learn/bash-scripting), you will learn Bash, one of the most popular scripting languages.

### Go

In [Learn Go](https://www.codecademy.com/learn/learn-go), you will learn Go (or Golang), an open source programming language designed to build fast, reliable, and efficient software at scale. It is also the de facto programing language of DevOps, used for tools such as Docker, Kubernetes, and Prometheus.

### Cybersecurity

Data is valuable and often times our infrastructure is not set up to protect it! In [Introduction to Cybersecurity](https://www.codecademy.com/learn/introduction-to-cybersecurity), you will learn the basic concepts needed to identify and protect against common cyber threats and attacks.

### Contribute to Codecademy Docs

Want to build your portfolio and help others? Contribute to Codecademy Docs! You can go to our [Contribution Guide](https://www.codecademy.com/resources/docs/contribution-guide) to get started.

<br>

Once again, congratulations on finishing your Intro to DevOps course! We are excited to see what you accomplish next. Happy coding!

You've completed the Intro to DevOps course! What's next?

Next Steps

In this lesson, we will introduce resiliency and practices related to having our systems able to function despite experiencing problems.

_The dashboard flashes red across nearly every service. The developers rush to try to locate the problem. They quickly realize that the main server, which hosts the core of the application, has had its power supply fail. There is supposed to be a backup, but the operations team hasn’t provisioned it yet. Customer complaints start rolling in as the team rushes to replace the power supply. How could this have been avoided?_

When it comes to software systems, time really is money — unexpected downtime can result in a loss of transactions, decay of customer trust, and a host of other issues. Failure will always be a part of our systems, however with the right preparation, we can build systems that are resilient to failure. 

In this lesson, we will introduce **resiliency**, a system’s ability to continue to perform despite experiencing problems. Creating a resilient system allows our services to be **highly-available**, which means our customers can access our functionality a vast majority of the time. 

To ensure resiliency, we can apply concepts of DevOps culture such as systems-level thinking, feedback, and continuous experimentation. We need to continuously monitor the whole system to understand how the components work together, learn from our system’s failures and create policies that respond to those failures.


A system dashboard in the middle of quite an emergency. With multiple issues visible at the same time, it can be difficult to understand the root cause of the problem. Times like these can often be avoided through the use of resiliency strategies.

What is Resiliency?

Internal problems: these problems come from within the components of the system that we control. Internal problems include in-house hardware issues and software bugs.

External problems: these problems arise from dependencies we have on other parties outside of our control. External problems might include issues with an API, or a cloud service our application relies on.

Malicious actors: these problems stem from other people (or sometimes bots) that seek to disrupt or exploit our services for a variety of reasons.



*It's been a hectic week for Karla. First, there was a power supply failure that took down the main server. Then, on Wednesday, Karla's company lost support for the payment processing service they have been using for years. Today, the home page is receiving a suspiciously large amount of traffic that isn't actually interacting with the site — it looks like it's probably a cyber-attack. What's going to be thrown at this company next?*

Encountering problems is an intrinsic part of dealing with software systems. However, we can make significant gains in resiliency by designing our system to handle some of the most common issues it will face.

The common types of system problems fit into the following categories:

* **Internal problems**: these problems come from within the components of the system that we control. Internal problems include in-house hardware issues and software bugs.
* **External problems**: these problems arise from dependencies we have on other parties outside of our control. External problems might include issues with an API, or a cloud service our application relies on.
* **Malicious actors**: these problems stem from other people (or sometimes bots) that seek to disrupt or exploit our services for a variety of reasons.

Through managing these threats, we can make our systems more resilient!

System Problems

This gif depicts a system losing connection to a primary database and switching to a backup.

_The Data team was excited to roll out the new database system they had been working on for months. However, when the database was deployed on production hardware, something was wrong. All of the database requests were getting immediately rejected. It turned out that a configuration property was still set to the testing environment, causing the connection to fail in production._

Within our organization, internal failures will occur that are our own responsibility — and are often our own doing. Let's explore some of the ways these problems occur and how to mitigate them.

### System Changes

Updates to our system's hardware, dependencies, or code all have the potential to make our system fail or behave unexpectedly. These issues are best mitigated through a comprehensive suite of automated tests performed prior to completing any change. Change-management processes should also exist to:

* Clearly identify a change
* Determine how the change will be made
* How the change can be reviewed
* How the change can be rolled back

However, not all internal issues are our own doing. Some issues occur from our systems existing over time.

### Hardware Failures

Time passing can sometimes be enough of a change for a system to break down. Over time, hardware components, like hard drives and power supplies, reach the end of their lifespan and fail. 

**Redundancy** is one of the most common methods for providing resiliency against hardware failures. Many computer users make use of a backup hard-drive in case our primary one fails. We don't need two hard-drives, until a problem happens and we suddenly wish we had a backup.

To combat this, organizations can duplicate their hardware components. This redundancy can allow for a seamless switchover to a backup component when a failure occurs. Despite an increase in the cost and complexity of managing backup components, redundancy can go a long way in ensuring high availability of our systems.


Internal Failures

_At an emergency call center, something wasn’t right. All of the calls coming in were apparently coming from the same spot in the middle of the Atlantic ocean! It turned out the third-party geo-location service that the call center was using had stopped working. For each address query, the service was simply responding with latitude and longitude (0,0). This default error value made it look like the dispatch center was suddenly serving Atlantis!_

Some issues occur outside of our organization that we can only react to. Common problems involving external dependencies include:
* A dependency no longer being supported
* A dependency being taken down
* A dependency having an internal outage of its own

There are a variety of methods we can use in order to help resolve issues with external dependencies:

* We can be on the lookout for news of scheduled outages, vulnerabilities, or cancellations from our most important dependencies. 
* We can define fallback strategies for our external dependencies. If we detect that a dependency is not working, we might switch to a different dependency, or hide the functionality that depends on the failing service.
* We can automatically update dependencies as new versions are released. However, there is the possibility that a change in the new dependency version can result in our system breaking.

When using external services, we open up our system to issues that are often outside of our control. On the other hand, external services enable the development of far more powerful applications than we could create on our own. By implementing resiliency strategies, we can have the best of both worlds.


An application becoming disconnected from the cloud.

External Failures

In this illustration, waves of requests eventually take down a server.

In 2020 Google reported dealing with a [DDOS](https://www.codecademy.com/resources/blog/what-is-a-ddos-attack/) attack that lasted for over six months. At peak times the traffic created by the attackers was at 2.5 Terabytes per second sent to over 160,000 servers. This was at the time, the largest DDOS attack by traffic volume ever recorded. Events such as these showcase that malicious actors have been creating ever larger and more powerful attacks.

Cyber-attacks are attempts to disrupt system services or steal an organization's data. They can happen to businesses of different sizes and types. Some common types of cyber-attack include:

* **Distributed Denial of Service (DDOS)** attacks try to crash a target by overwhelming it with requests.
* **SQL injection** attacks try to run malicious database queries to reveal internal information.

Some techniques to handle these kinds of attacks include:

* **Filtering**: Applications might have an infrastructure layer in front of them that detects incoming DDOS attacks and ignores similar traffic.
* **Validation**: Checking that requests are valid and do not contain malicious code can help prevent attacks such as SQL injection.

When thinking about cyber-attacks, it is important to consider what attackers generally want. The prime target for most cyber-attacks is data. Cyber criminals often want to gain access to or destroy sensitive data. By employing security best practices we can be better protected against cyber-attacks.

> Learn more about cyber-attacks in our [Introduction to Cyber Security course](https://www.codecademy.com/learn/introduction-to-cybersecurity)!


Cyber Attacks

In this chart, the downtime of a system is graphed over time. There are two peaks in the chart, caused by a cloud outage and a database failure.

_Over the last few months, Bloop Co. has been trying to improve its systems' resiliency. But how do they know that they are on the right track? Which of their efforts is most successful? What are further areas they need to address? The next several exercises seek to answer these questions._

Without the ability to measure our systems’ resiliency, we can't be sure how effective our current strategy is. We can use aspects of monitoring to measure our system's responses to problems. Some important metrics that indicate a system's resiliency include:

* **Uptime**: what percentage of the time is our system available?
* **Recovery speed**: when an outage occurs, how long does it take for the system to become available again?

These metrics tell us how our system handles issues, but what do we compare them against? A pair of benchmarks that we can compare our system's performance against are recovery time objective (RTO) and recovery point objective (RPO). 

### Recovery Time Objective (RTO)

The **RTO** is the amount of time an application can be unavailable before it causes significant harm to the business. Imagine that a business has promised its users that it will never go down for more than an hour at a time. This business has set their RTO to be one hour. If the business goes down for more than that time, it will have violated an important promise to the customer.

### Recovery Point Objective RPO

The **RPO** is the acceptable amount of data loss after a system outage. Different applications have varying levels of data importance. A popular bank losing minutes of transaction data might be a nightmare. Losing hours of progress in an online multiplayer system would be unfortunate, but not a disaster. This acceptable level of data that can be lost might affect how often data is backed up, or cause adjustments to the RTO.

Benchmarks such as these can help us establish target levels of uptime and recovery time for our business. These targets can help us be aware of when our systems are approaching a critical "danger zone".


Measuring Resiliency

_Bloop Co wants to make sure that their system can handle thousands of new users attempting to buy their product. One of the engineers has created a test. The company gathers around as simulated requests flood into the system. 1000 requests, 2000 requests, 5000 requests, the system is able to handle more and more requests as they come in. It looks like some of these resiliency methods have been working!_

Remember, we want to know how our system will perform under difficult circumstances. It makes sense then to create some problems on purpose, to see how our system responds. Let's take a look at some ways engineers test the resiliency of their systems.

**Penetration testing** involves simulating cyber-attacks to try to exploit security vulnerabilities. Penetration testing gives us a chance to see how our system might respond to a malicious user. Using penetration testing allows us to identify holes in our security that we need to fix.

**Load testing** seeks to replicate situations in which the system is under heavy use. Load testing might simulate millions of customers trying to access our site all at once. Load testing can help us identify areas in which the system will break under real-world conditions.

Each of these practices can help us identify and then remove weak spots within our system. Sometimes it takes breaking a system to make it better!


A system undergoing load testing, receiving requests until it starts to smoke.

Testing Resiliency

_The engineering team waits anxiously as their manager walks around in the server room. Suddenly, the manager pulls the plug on one of the server racks. The team watches the monitors. An alert goes off, but other servers come online. It looks like the system is recovering. They’ve passed a "Chaos Monkey" test for the first time._

**Chaos engineering** is the practice of performing experiments on a system in order to test its resiliency. Chaos engineering emerged as a practice in the early 2010s in Netflix, whose engineers introduced the idea of a “chaos monkey“. This chaos monkey would roam around the server room randomly disconnecting pieces of equipment. 

> There was no real monkey – these actions were performed at first by people and later through automation.

These kinds of chaotic experiments more closely replicate the random nature of real-world failure. Having a system be able to withstand chaos provides a much more powerful guarantee that it can handle “anything”.

At a much larger scale than the chaos monkey are _disaster recovery exercises_. In these exercises, companies simulate a large-scale event happening to their infrastructure. Whole rooms of servers or external dependencies might suddenly go out all at once. Companies will go through the strategy for dealing with these scenarios and see how their responses might pan out in real-time. These kinds of strategies can help companies prepare for a rare but catastrophic event.


Simulated Failure

Over the course of this lesson, we've discussed resiliency, a system's ability to perform under problematic conditions. Common conditions we need our systems to be resilient against include:

* **Internal failures**, including hardware, software, and process breakdowns
* **External failures**, including APIs, external services, and third-party providers
* **Malicious attacks**, such as DDOS and SQL injection attacks

There are a variety of practices that allow us to measure how resilient our systems are. Some common practices include:

* Having backups for our infrastructure
* Limiting our crucial dependencies
* Validating client requests

In addition to using these methods, we also need to measure how resilient our systems are. We can use metrics such as the time needed to solve a problem or the amount of downtime our system experiences. Another option is to simulate issues using techniques such as:

* **Penetration testing and load testing**
* **Chaos Engineering**
* **Disaster recovery exercises**

Resiliency practices allow us to provide critical services even under adverse conditions. Building resiliency relies on DevOps cultural practices such as continuous experimentation and learning from failure. Let's start building systems capable of weathering any storm!


Internal problems: these problems come from within the components of the system that we control. Internal problems include in-house hardware issues and software bugs.

External problems: these problems arise from dependencies we have on other parties outside of our control. External problems might include issues with an API, or a cloud service our application relies on.

Malicious actors: these problems stem from other people (or sometimes bots) that seek to disrupt or exploit our services for a variety of reasons.

Introduction to Resiliency

Users will be able to access the application most of the time, increasing user-satisfaction

Developers will be able to deploy code automatically to production servers, increasing speed

Developers will be able to gain more insight into what's happening in their systems, increasing quality

Developers will find bugs faster because of the actionable alerts coming from their system, increasing speed


This quiz will test the learner's knowledge on resiliency in DevOps. 

### Why Intro to DevOps?

DevOps culture has taken over the software industry and permanently changed the way many organizations do their work. DevOps is a culture of collaboration between Development and Operations teams that is supported by a variety of practices and tools. Through practices such as monitoring, CI/CD, and blameless retrospectives we can reliably deliver software with great speed and high-quality. In this course, you will learn about this culture, these practices, and more!
 
### Take-Away Skills 

Here's what you'll be learning In this course:

- The steps that changes to software go through to get from a developer's computer to its users
- The role and responsibilities of a traditional Operations team
- An overview of modern infrastructure management
- The difference between a DevOps culture and that of a traditional organization
- The purposes of key practices such as CI/CD, monitoring, and containerization
- The role of scalability, observability, and resiliency in a modern software system


DevOps is a culture of collaboration between Development and Operations teams that is supported by a variety of practices and tools.