What is Chaos Engineering?
Chaos programming is a genre of system design that allows models to be as fault-tolerant as possible. While running a distributed system, unpredictable things happen, a lot! Hard disk failure, Network latency, Traffic surge to name a few. These events bring about a massive loss in performance and trigger undesired behavior. We can never really prevent the errors that lead to such conditions, but on the flip side, we can minimize the impact so as to make systems more resilient.
In a nutshell, Chaos Engineering is a method of experimentation on infrastructure that brings systematic weaknesses to light. It can be thought of as a learning paradigm to explore the nitty-gritty issues that could potentially lower the throughput of a system.
With the advent of microservice architecture, different teams develop and operate on different services independent of each other. To ensure successful implementation of the architecture, it is essential to make sure the system is resilient. To bring that into context, let’s say a distributed system is serving a client with many of its microservices. These are facilitated by API calls that microservices make within themselves so as to relay that bit of information.
- The pattern complexity for information request and retrieval is typically less here but can go exponentially high in large scale systems.
- There are simply too many parts to monitor and control effectively, so even a slight imbalance can lead to disastrous consequences.
- The individual behaviors of the microservices are completely normal.
- Only when applied in tandem under very specific situations, do we end up with an anomaly.
- This interaction is too complex for any human to predict.
- Each of those microservices could have been tested properly a no erroneous behavior would have been highlighted in any test suite or integration environment.
An interesting example apart from microservices and monoliths are deep learning, neural networks, and other machine learning algorithms. Peeking under the hood in one of these systems, we find a series of rational and irrational values of any nontrivial solution that is too complex for an individual to make sense of.
Only the meaning of the system that emits a valid response that can be made sense of by a human. The tradeoff between understandability and velocity & flexibility of operating on data has hereby created an opportunity for chaos engineering.
Why Chaos Engineering is Important?
So how do we address the problem defined above? We needn’t create a state of the art facility for the very purpose. On the contrary, we can reform our programming practices to model on the deficits of a chaotic system. The fact that a large scale architectural fault cannot be stabilized without solving the smaller issues in the software is a testament to Chaos programming that lays stress on defining, integrating and implementing the various smaller aspects of a software development process.
Current Problem Scenario:
- Microservice architecture is tricky to handle sometimes
- Our systems are scaling faster than before
- Services relying on fault tolerance that can fail will fail
- Dependencies on other companies will fail
The chaos Paradigm Answer
- Addresses the issues which may be an incomplete programming task
- Identifies the important issue first and assigns priority to it.
- Combines functionality, trust and behavioural aspects of the system
- The issue is only said to be resolved when it is brought to a point of stability
How does Chaos Engineering Work?
It works by doing experimentation on production environments to find critical vulnerabilities in the whole system before they make the entire system unusable for the customers. Many tools used for adding Chaos Engineering practices. It catches all vulnerabilities and allows devs to inject failures into their services and prevent them from becoming large outages which can affect business. Chaos Engineering is a type of Preventive Medicine for Infra. It works in 5 steps –
Plan for the first Experiment
Keep questioning about all services and environments & find out areas to find potential weaknesses and find fixes to them. Try injecting a failure or potential delay into all of its dependencies help to start in the beginning.
Creation of a Hypothesis
Always Hypothesize on the expected outcomes of events before running it live in production. See how it affects customers, to service, and all its dependencies. Look into all possible scenarios.
Measure impact on latency, requests per second, and all system resources being in use. It helps in understanding how the entire system behaving under stress. Also measure the system’s availability, durability and it’s reliability.
Always have a rollback plan
Always have a backup plan because things can go wrong. Plan to revert the impact of the disaster. If doing things manually, be extra careful, don’t break ssh access to machines.
Troubleshoot and fix it
Once done with running the experiments, there will be two possible outcomes: either it’s verified that the system is resilient to the failure introduced, or found an issue and need to fix.
Both of the above two are good outcomes. In first, increased confidence in the entire system, and on the 2nd, find a potential problem before it causes an outage in production.
Principles of Chaos Engineering
Sidney Dekker famously quotes “ The performance of complex programs is typically optimized at the edge of chaos, just before program behaviour will become unrecognizably turbulent.”
The principles of chaos programming embody this quote to the fullest, and the following points summarize the principles apply.
Hypothesizing About Steady State: A steady-state of a program can be said to be one which under the specified condition, gives the expected output. But, how do we come to know if the program under development is steady? To study the behaviour of the program one needs to try out the various functionalities themselves, test every detail until the very end.
A better approach, however, is to collect data. Data about the system, testing environment, production environment and then set the quality metrics.
Choosing Metrics Wisely: The foremost thing to bear in mind while selecting a performance metric is to analyze its latency and keep it as low as possible. Frequent evaluation of the program needs to be done based on this, so as to ensure it relays the ongoing behaviour of the system accurately, preventing potential pitfalls and deadly traps.
Forming Hypothesis: Once we have the required metrics and understanding of the steady-state behaviour, we can use them to define the hypothesis the program needs to fulfill so that whenever we run a chaos experiment, we would be aware of the situations where the steady state abides by the hypothesis, if it doesn’t the program should fail.
Resiliency Experiments: Carry out resiliency experiments to deliberately cause a noncritical part of the program to fail in order to verify that the program degrades gracefully.
Automate Experiments: Automate the execution of experiments as much as possible along with the analysis of experimental results that will aspire to automate new experiments.
Run Experiments close to Production: Run your experiments as close to the production environment as possible. The ideal implementation runs all experiments directly with the actual input received in the production environment.
Adopting Chaos Engineering Principles
When we are developing a new application, the most exciting thing is launching the service to the consumers. But there’s a catch, we can never be sure that the distributed system we designed will be resilient under severe conditions whilst in production. If there is something that can go wrong, will go wrong! We strive to create quality products that are resilient to such failures. One way to do so is to identify problems that could arise in production and rather than waiting for breakage in production, proactively inject failures in order to prepare for when the lightning strikes. That’s the core idea behind adopting chaos programming.
How can we adopt
- Start by planning experiments and compiling a list of potential failure modes and how to simulate them.
- Anticipate when the trouble for customers can arise.
- Inject failures at various levels: Application, API, Database, Hardware, cloud infrastructure like intentionally terminate cluster machines, kill worker processes, delete database tables, cut off access to internal and external services.
- Monitor and Observe the failures close to the production environment and how it tends to affect your program.
- Minimize blast radius – Small experiments first.
- After each experiment note the actual measured impact.
- For each discovered flaw, make a list of counter-measures and implement them right away whilst maintaining an issue tracker to track active issues.
Tools for Chaos Engineering
The need for having a chaos programming has led to the rise of very powerful tools which carefully orchestrate the chaos engineering. Some of the noticeable tools prevalent today are
- Chaos Monkey: Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures
- Simian Army: Simian Army consists of services in the cloud for generating various kinds of failures, detecting abnormal conditions, and testing our ability to survive them. The goal is to keep the cloud safe, secure, and highly available. The army includes Chaos Monkey, Janitor Monkey, Conformity Monkey
- Pumba: Chaos testing and Network emulation tool for docker
- PowerfulSeal: A powerful testing tool for Kubernetes cluster
- Litmus: Litmus is a chaos engineering tool for stateful workloads on Kubernetes
Chaos Engineering Benefits
- Eliminates improper fallback settings when a service is unavailable.
- Reduces retries counts from improperly tuned timeouts.
- Helps to understand what’s going on in production and make it work better for end users.
- Prevent outages when a downstream service receives too much traffic.
- Monitors the performance of applications.
- Prevents single point of failure crashes.
- It eliminates the need for debugging in a Production environment.
- Chaos Engineering helps in creating self-healing infrastructure.
Use Cases of Chaos Engineering
The importance of chaos programming can be understood in the following instances:
- There is no guide when a power outage can trigger a widespread blackout.
- System operators will be unaware of the malfunction that caused the systems to slow down or in the worse case caused a complete rip-off.
- This led to a condition where the failure deprived them of both audio and visual alerts for important changes in system state.
- Chaos Engineering helps to test monitoring tools, metrics, dashboards, alerts, and thresholds on event-specific triggers such as this.
- Injecting Chaos in a controlled way will lead to building resilient systems.
Uber’s Database Outage
- Master log replication to S3 failed.
- Logs backed up on primary alerts were fired.
- Ignored disk filled up on database leading to deleting of unarchived WAL files.
- Tackled by Argos, uber’s Real-Time Monitoring and Root-Cause Exploration Tool based on principles of chaos engineering.
- Netflix migrated from the data centre to the cloud in 2008.
- Such a widespread migration could potentially disrupt their entire consumer base.
- Vertical scaling in datacenters led to many single points of failure, causing massive interruptions in the delivery. The cloud promised to create an opportunity to scale horizontally and move much of the heavy lifting of running infrastructure to a reliable third party.
- A new approach was required to build services in a way that preserved the benefits of horizontal scaling while staying resilient to instances occasionally disappearing.
- In 2010 they introduced Chaos Monkey to the world and has been extremely successful ever since to build resilient services.
Concluding Chaos Engineering
Not only Netflix and Uber but also premium organizations of the world like Microsoft, LinkedIn, Amazon has successfully implemented Chaos Programming in their tech stack. It has so much potential to optimally curate our systems to unknown faults that might occur when we are working full flow in production. Chaos Automation Platform is fulfilling the potential of running experimentation across the microservice architecture 24/7.
Any organization that builds and operates a distributed system and wishes to achieve a high rate of development velocity will want to add Chaos Engineering to its collection of approaches for improving resiliency. Chaos Engineering is still a relatively new domain, and the techniques and tools are still evolving.