The engineering team waits anxiously as their manager walks around in the server room. Suddenly, the manager pulls the plug on one of the server racks. The team watches the monitors. An alert goes off, but other servers come online. It looks like the system is recovering. They’ve passed a “Chaos Monkey” test for the first time.
Chaos engineering is the practice of performing experiments on a system in order to test its resiliency. Chaos engineering emerged as a practice in the early 2010s in Netflix, whose engineers introduced the idea of a “chaos monkey“. This chaos monkey would roam around the server room randomly disconnecting pieces of equipment.
There was no real monkey – these actions were performed at first by people and later through automation.
These kinds of chaotic experiments more closely replicate the random nature of real-world failure. Having a system be able to withstand chaos provides a much more powerful guarantee that it can handle “anything”.
At a much larger scale than the chaos monkey are disaster recovery exercises. In these exercises, companies simulate a large-scale event happening to their infrastructure. Whole rooms of servers or external dependencies might suddenly go out all at once. Companies will go through the strategy for dealing with these scenarios and see how their responses might pan out in real-time. These kinds of strategies can help companies prepare for a rare but catastrophic event.
Can you come up with any additional ways to cause chaos in a system in order to test its resiliency?
The chaos monkey is just one part of Netflix’s much larger Chaos Monkey Army. Here are a few of the tools within this army in a brief description:
- Latency Monkey: Causes long delays on important calls to functionality
- Simius Cogitarius: Causes high CPU usage
- Simius Plenus: Attempts to completely fill disk space