Learn

The Data team was excited to roll out the new database system they had been working on for months. However, when the database was deployed on production hardware, something was wrong. All of the database requests were getting immediately rejected. It turned out that a configuration property was still set to the testing environment, causing the connection to fail in production.

Within our organization, internal failures will occur that are our own responsibility — and are often our own doing. Let’s explore some of the ways these problems occur and how to mitigate them.

System Changes

Updates to our system’s hardware, dependencies, or code all have the potential to make our system fail or behave unexpectedly. These issues are best mitigated through a comprehensive suite of automated tests performed prior to completing any change. Change-management processes should also exist to:

  • Clearly identify a change
  • Determine how the change will be made
  • How the change can be reviewed
  • How the change can be rolled back

However, not all internal issues are our own doing. Some issues occur from our systems existing over time.

Hardware Failures

Time passing can sometimes be enough of a change for a system to break down. Over time, hardware components, like hard drives and power supplies, reach the end of their lifespan and fail.

Redundancy is one of the most common methods for providing resiliency against hardware failures. Many computer users make use of a backup hard-drive in case our primary one fails. We don’t need two hard-drives, until a problem happens and we suddenly wish we had a backup.

To combat this, organizations can duplicate their hardware components. This redundancy can allow for a seamless switchover to a backup component when a failure occurs. Despite an increase in the cost and complexity of managing backup components, redundancy can go a long way in ensuring high availability of our systems.

Instructions

Here we can see a database experiencing an issue and crashing. However, the resilient system depicted is able to detect this and route database traffic to a backup. Despite experiencing a problem, our users wouldn’t notice a thing.

In the art for this exercise, we can see a backup database being maintained along with a primary database.

While our backup is able to save the day, what might be some of the challenges associated with managing such a backup?

Answer

There are a couple of things we might want to consider when dealing with a database backup. Some examples include:

  • Keeping the backup up to date with the primary database: How often do we replicate the primary database to the backup? Once a month? Once a day? All the time? This depends on how critical it is that our users have access to the most recent information.
  • Making sure our backup is healthy: We don’t want to have to switch to the backup, only to find that it’s also broken. We need to perform health checks on both the backup and the primary database.
  • Making sure our system properly switches to the backup: We need to ensure that our system is able to detect a failure to the primary database and correctly switch over to the backup when needed.
  • Moving back to the primary database when the issue is fixed: Once our primary database is up and running, does it resume operation? Or maybe it could become the new backup. We need to be clear on what the process is.

While it might be impossible to prevent all failures within a system, practices such as testing, change management, and redundancy can help prevent and reduce the impact of internal failures. But what about problems coming from outside the system, from external parties that our system relies on? We will be discussing these external problems next.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?