Chaos Engineering Principles and Tools
What is Chaos Engineering?
Chaos Engineering is the practice and discipline of doing experimentation on modern Cloud-Native distributed systems to attain confidence in the infrastructure’s capability to handle failures in real production environments. It is designed to increase the quality, reliability of software dramatically. Usually, need not worry about the functioning of individual services, but the communication between them, which usually results in unpredictable, ambiguous results. Chaos Engineering is an intense practice to enhance how software designed and engineered and deployed for modern Cloud-Native distributed applications.
Why Chaos Engineering Matters?
Following are the reasons why Chaos Engineering matters in modern distributed Microservices -
In modern Microservices, services fail due to an unknown situation. That might be due to network errors, low resources, events in downstream services upon which a service depends.
When managing a distributed system at scale, sometimes bad things happen which are not under control.
Implementing Chaos Engineering builds confidence that all services are resilient to random anomaly situations which happen in the environment.
Enhances reliability, the resilience of the whole system, including applications and underlying infrastructure.
It helps in the proper handling of live production failure events, essential for modern distributed scalable applications.
How does Chaos Engineering Work?
It works by doing experimentation on production environments to find critical vulnerabilities in the whole system before they make the entire system unusable for the customers. Many tools used for adding Chaos Engineering practices. It catches all vulnerabilities and allows devs to inject failures into their services and prevent from them becoming large outages which can affect business. Chaos Engineering is a type of Preventive Medicine for Infra.It works in 5 steps -
Plan for the first Experiment
Keep questioning about all services and environments & find out areas to find potential weaknesses and find fixes to them. Try injecting a failure or potential delay into all of its dependencies help to start in the beginning.
Creation of a Hypothesis
Always Hypothesize on the expected outcomes of events before running it live in production. See how it affects customers, to service, and all its dependencies. Look into all possible scenarios.
Measure impact on latency, requests per second, and all system resources being in use. It helps in understanding how the entire system behaving under stress. Also measure the system’s availability, durability and it's reliability.
Always have a rollback plan
Always have a backup plan because things can go wrong. Plan to revert the impact of the disaster. If doing things manually, be extra careful, don't break ssh access to machines.
Troubleshoot and fix it
Once done with running the experiments, there will be two possible outcomes: either it's verified that the system is resilient to the failure introduced, or found an issue and need to fix.
Both of the above two are good outcomes. In first, increased confidence in the entire system, and on the 2nd, find a potential problem before it causes an outage in production.
Chaos Engineering Benefits
- Eliminates improper fallback settings when a service is unavailable.
- Reduces retries counts from improperly tuned timeouts.
- Helps to understand what’s going on in production and make it work better for end users.
- Prevent outages when a downstream service receives too much traffic.
- Monitors performance of applications.
- Prevents single point of failure crashes.
- It eliminates the need for debugging in a Production environment.
- Chaos Engineering helps in creating self-healing infrastructure.
How to Adopt Chaos Engineering?
Begin by defining the state of the system as the measurable output of a system that indicates normal behavior.
Try to add variables that reflect real-world critical events like servers that crash, malfunctioning of hard drives, network connections that introduce latency, etc. If a weakness detected, have an area for improvement before that chaos replicates in the entire system . Reducing chaos should be the responsibility of both Dev and Ops guys. Devs should build services to handle failures and get back up. It will make outages time to minimal, while infrastructure heals by itself.
Best Practices while adopting Chaos Engineering for Infrastructure
Following are the Best Practices followed while adopting Chaos Engineering for applications in the infrastructure -
Do Experimentation in Production
- Take down nodes which are running apps.
- See how alerts generated based on events.
- Analyze patterns and gain behavioral insights.
- Automate the process of experimentation to run continuously.
- Chaos Engineering should build automation into the infra to enable both orchestration and analysis.
- It should be a continuous process of experimentation against infra.
- Try to generate events which are very close to real-world scenarios.
- Prioritize all events based on its impacts or frequency of its occurrence.
- Take into consideration the events that correspond to hardware, software failures and non-failure events like a spike in traffic or a scaling event.
- Events capable of altering stable state more prioritized in a Chaos experiment.
- Put alerting only for critical events.
Chaos Engineering Adopting Tools
Concluding Chaos Engineering
Chaos Engineering beneficial for production infrastructure. At first, applying this is a change of practices, mindset, adding new tools in place but once handling of events is in place, it will be beneficial on a daily basis as well as in the long term. Adding realistic production failures on a whim, find weaknesses before they become Chaos Engineering makes whole infra more resilient.