Why is chaos engineering important? Let’s look at why Quality Assurance (QA) exists in the first place.
Quite simply, QA exists because no matter how hard we try to create perfect software that always does what it is designed to do and supposed to do, the real world doesn’t seem to allow it.
Mistakes sneak in. Machines interpret our code differently from how we intended. Stuff happens. Quality engineering exists to try to find those unintentional problems before our customers do.
What Is Chaos Engineering?
Chaos Engineering is a disciplined approach to identifying potential failures before they become outages. With Chaos Engineering you design failure injection experiments and compare what you think will happen to what actually happens in your systems. You literally “break things on purpose” to learn how to build more resilient systems.
Why Is Chaos Engineering Important?
Because systems are changing. Traditionally, QA runs a variety of tests and test types to proactively seek out these problems, long before the code ends up in production. These tests are run at the end of a build and before that code is deployed publicly, typically testing in a stage or testing environment.
So far, so good, if we are operating in a traditional software development model and deployment model. Monolithic designs and deployments to corporate-owned machines give a great amount of control. There is stability inherent in this control. This makes these stage and testing environments similar to the production environments and permits testing in them to be successful.
Distributed systems are different. The cloud is different. We don’t control the infrastructure. It is constantly changing. The infrastructure changes according to our design with individual services and microservices and load balancing spinning up additional compute nodes or removing them as needed. Failover systems adjust themselves to ensure risks are managed. The constant change causes unexpected, emergent behaviors. These are behaviors that we can’t always predict, but which we can reproduce and cause using a form of testing called Chaos Engineering.
Testing Must Adapt As Systems Change
We cannot effectively test everything that our production code and the environment will encounter by testing in any other environment. We can test many things, things that traditional QA does quite well and should continue to do, although perhaps some of it can now be done via automation during the build pipeline. But the way we have done QA can’t test how our distributed system will react when networking between a data store and multiple compute nodes is overwhelmed and latency rises, for example.
Our automated and human-driven tests do not account for this rapidly changing production environment where services spin up and shut down due to demand. The only way to test whether a distributed system will remain reliable when faced with emergent behaviors caused by changing production conditions is to do what all testing paradigms do: try it and find out.
How Chaos Engineering Works: Test and Learn Through Failure
We all have uptime objectives. We all want to improve our performance. To do this, we must use every resource available to learn as much as we can about how our systems handle failure, whether it is a user’s failure to enter suitable data in a form field or whether it is a system component in the cloud failing to function as expected.
“What happens when?” is a question we all love to ask. Then, we try it and find out.
Because our systems and system designs have evolved in such an unprecedented way, we must also evolve our testing methods to better understand how our distributed systems will handle failures and how component and dependency failures impact the entire system. Holistic testing of this sort is what Chaos Engineering exists to perform because it is able to test our entire system as it exists in production.
A Chaos Engineering program starts out small, testing things that we already know or believe we know:
Will our monitoring system actively catch networking latency above a specific threshold?
Will that initiate a page to the on-duty engineer or maybe automated mitigation?
Has our configuration drifted over time, or are we still spinning up compute nodes according to specifications?
How does each instance of service hold up under light testing? Medium? Heavy? They should all be the same and our load balancing should distribute load across them appropriately. What happens if one instance starts receiving a significantly heavier load than the others because our load balancing service is having issues?
Systematic Testing of Chaotic Systems Brings Vital Benefits
We test using the scientific method, starting small, and being intentional. We design early experiments to minimize the blast radius, the set of services and components we believe have the potential to be impacted and to minimize the magnitude of the experiment parameters.
Once we are successful here, we can decide to build step by step, growing our confidence in our system or growing our prioritized backlog of improvements we will make. When we make those improvements and retest using the same chaos experiment and parameters, the system will pass and we will know it is more reliable than it used to be.
This is the only way we can learn how our systems actually handle failure out in production, where our customers will ultimately experience the results. If we can find the small problems now, before they have a chance to cascade into big problems, we can make sure that fewer and fewer systemic failures occur.
This makes Chaos Engineering an amazing reliability tool. It is a discipline that helps us do in the cloud and at scale what we used to be able to accomplish in smaller, controlled environments using traditional QA.
This ultimately means fewer large-scale production failures and outages. In fact, when Chaos Engineering is implemented and used consistently, service, and component failures that should be expected to happen in the chaos of the cloud will have no impact on our customers. In fact, they will never even know there was a failure, and that’s the real goal.