The Data team was excited to roll out the new database system they had been working on for months. However, when the database was deployed on production hardware, something was wrong. All of the database requests were getting immediately rejected. It turned out that a configuration property was still set to the testing environment, causing the connection to fail in production.
Within our organization, internal failures will occur that are our own responsibility — and are often our own doing. Let’s explore some of the ways these problems occur and how to mitigate them.
Updates to our system’s hardware, dependencies, or code all have the potential to make our system fail or behave unexpectedly. These issues are best mitigated through a comprehensive suite of automated tests performed prior to completing any change. Change-management processes should also exist to:
- Clearly identify a change
- Determine how the change will be made
- How the change can be reviewed
- How the change can be rolled back
However, not all internal issues are our own doing. Some issues occur from our systems existing over time.
Time passing can sometimes be enough of a change for a system to break down. Over time, hardware components, like hard drives and power supplies, reach the end of their lifespan and fail.
Redundancy is one of the most common methods for providing resiliency against hardware failures. Many computer users make use of a backup hard-drive in case our primary one fails. We don’t need two hard-drives, until a problem happens and we suddenly wish we had a backup.
To combat this, organizations can duplicate their hardware components. This redundancy can allow for a seamless switchover to a backup component when a failure occurs. Despite an increase in the cost and complexity of managing backup components, redundancy can go a long way in ensuring high availability of our systems.
Here we can see a database experiencing an issue and crashing. However, the resilient system depicted is able to detect this and route database traffic to a backup. Despite experiencing a problem, our users wouldn’t notice a thing.
In the art for this exercise, we can see a backup database being maintained along with a primary database.
While our backup is able to save the day, what might be some of the challenges associated with managing such a backup?
While it might be impossible to prevent all failures within a system, practices such as testing, change management, and redundancy can help prevent and reduce the impact of internal failures. But what about problems coming from outside the system, from external parties that our system relies on? We will be discussing these external problems next.