Chaos Programming

What is Chaos Programming

Chaos programming is a genre of system design that allows models to be as fault-tolerant as possible. While running a distributed system, unpredictable things happen, a lot! Hard disk failure, Network latency, Traffic surge to name a few. These events bring about a massive loss in performance and trigger undesired behavior. We can never really prevent the errors that lead to such conditions, but on the flip side, we can minimize the impact so as to make systems more resilient.

In a nutshell, Chaos Engineering is a method of experimentation on infrastructure that brings systematic weaknesses to light. It can be thought of as a learning paradigm to explore the nitty-gritty issues that could potentially lower the throughput of a system. You would also love to read about Chaos Engineering in this insight.

Chaotic systems

With the advent of microservice architecture, different teams develop and operate on different services independent of each other. To ensure successful implementation of the architecture, it is essential to make sure the system is resilient. To bring that into context, let’s say a distributed system is serving a client with many of its microservices. These are facilitated by API calls that microservices make within themselves so as to relay that bit of information.

  • The pattern complexity for information request and retrieval is typically less here but can go exponentially high in large scale systems.
  • There are simply too many parts to monitor and control effectively, so even a slight imbalance can lead to disastrous consequences.
  • The individual behaviors of the microservices are completely normal.
  • Only when applied in tandem under very specific situations, do we end up with an anomaly.
  • This interaction is too complex for any human to predict.
  • Each of those microservices could have been tested properly a no erroneous behavior would have been highlighted in any test suite or integration environment.

An interesting example apart from microservices and monoliths are deep learning, neural networks, and other machine learning algorithms. Peeking under the hood in one of these systems, we find a series of rational and irrational values of any nontrivial solution that is too complex for an individual to make sense of.

Only the meaning of the system that emits a valid response that can be made sense of by a human. The tradeoff between understandability and velocity & flexibility of operating on data has hereby created an opportunity for chaos engineering.

Why Chaos Programming is Important

So how do we address the problem defined above? We needn’t create a state of the art facility for the very purpose. On the contrary, we can reform our programming practices to model on the deficits of a chaotic system. The fact that a large scale architectural fault cannot be stabilized without solving the smaller issues in the software is a testament to Chaos programming that lays stress on defining, integrating and implementing the various smaller aspects of a software development process.

Current Problem Scenario:

  • Microservice architecture is tricky to handle sometimes
  • Our systems are scaling faster than before
  • Services relying on fault tolerance that can fail will fail
  • Dependencies on other companies will fail

The chaos Paradigm Answer

  • Addresses the issues which may be an incomplete programming task
  • Identifies the important issue first and assigns priority to it.
  • Combines functionality, trust and behavioral aspects of the system
  • The issue is only said to be resolved when it is brought to a point of stability

Principles of Chaos Programming

Sidney Dekker famously quotes “ The performance of complex programs is typically optimized at the edge of chaos, just before program behavior will become unrecognizably turbulent.”

The principles of chaos programming embody this quote to the fullest, and the following points summarize the principles apply.

Hypothesizing About Steady State: A steady state of a program can be said to be one which under the specified condition, gives the expected output. But, how do we come to know if the program under development is steady? To study the behavior of the program one needs to try out the various functionalities themselves, test every detail until the very end.

A better approach, however, is to collect data. Data about the system, testing environment, production environment and then set the quality metrics.

Choosing Metrics Wisely: The foremost thing to bear in mind while selecting a performance metric is to analyze its latency and keep it as low as possible. Frequent evaluation of the program needs to be done based on this, so as to ensure it relays the ongoing behavior of the system accurately, preventing potential pitfalls and deadly traps.

Forming Hypothesis: Once we have the required metrics and understanding of the steady-state behavior, we can use them to define the hypothesis the program needs to fulfill so that whenever we run a chaos experiment, we would be aware of the situations where the steady state abides by the hypothesis, if it doesn’t the program should fail.

Resiliency Experiments: Carry out resiliency experiments to deliberately cause a noncritical part of the program to fail in order to verify that the program degrades gracefully.

Automate Experiments: Automate the execution of experiments as much as possible along with the analysis of experimental results that will aspire to automate new experiments.

Run Experiments close to Production: Run your experiments as close to the production environment as possible. The ideal implementation runs all experiments directly with the actual input received in the production environment.

Adopting Chaos Programming

When we are developing a new application, the most exciting thing is launching the service to the consumers. But there’s a catch, we can never be sure that the distributed system we designed will be resilient under severe conditions whilst in production. If there is something that can go wrong, will go wrong! We strive to create quality products that are resilient to such failures. One way to do so is to identify problems that could arise in production and rather than waiting for breakage in production, proactively inject failures in order to prepare for when the lightning strikes. That’s the core idea behind adopting chaos programming.

How can we adopt

  • Start by planning experiments and compiling a list of potential failure modes and how to simulate them.
  • Anticipate when the trouble for customers can arise.
  • Inject failures at various levels: Application, API, Database, Hardware, cloud infrastructure like intentionally terminate cluster machines, kill worker processes, delete database tables, cut off access to internal and external services.
  • Monitor and Observe the failures close to the production environment and how it tends to affect your program.
  • Minimize blast radius – Small experiments first.
  • After each experiment note the actual measured impact.
  • For each discovered flaw, make a list of counter-measures and implement them right away whilst maintaining an issue tracker to track active issues.

Tools for Chaos Programming

The need for having a chaos programming has led to the rise of very powerful tools which carefully orchestrate the chaos engineering. Some of the noticeable tools prevalent today are

Use Cases of Chaos Programming

The importance of chaos programming can be understood in the following instances:

Power outages

  • There is no guide when a power outage can trigger a widespread blackout.
  • System operators will be unaware of the malfunction that caused the systems to slow down or in the worse case caused a complete rip-off.
  • This led to a condition where the failure deprived them of both audio and visual alerts for important changes in system state.
  • Chaos Engineering helps to test monitoring tools, metrics, dashboards, alerts, and thresholds on event-specific triggers such as this.
  • Injecting Chaos in a controlled way will lead to building resilient systems.

Uber’s Database Outage

  • Master log replication to S3 failed.
  • Logs backed up on primary alerts were fired.
  • Ignored disk filled up on database leading to deleting of unarchived WAL files.
  • Tackled by Argos, uber’s Real-Time Monitoring and Root-Cause Exploration Tool based on principles of chaos engineering.

Netflix’s Transition

  • Netflix migrated from the data center to the cloud in 2008.
  • Such a widespread migration could potentially disrupt their entire consumer base.
  • Vertical scaling in datacenters led to many single points of failure, causing massive interruptions in the delivery. The cloud promised to create an opportunity to scale horizontally and move much of the heavy lifting of running infrastructure to a reliable third party.
  • A new approach was required to build services in a way that preserved the benefits of horizontal scaling while staying resilient to instances occasionally disappearing.
  • In 2010 they introduced Chaos Monkey to the world and has been extremely successful ever since to build resilient services.

Concluding Chaos Programming

Not only Netflix and Uber but also premium organizations of the world like Microsoft, LinkedIn, Amazon has successfully implemented Chaos Programming in their tech stack. It has so much potential to optimally curate our systems to unknown faults that might occur when we are working full flow in production. Chaos Automation Platform is fulfilling the potential of running experimentation across the microservice architecture 24/7.

Any organization that builds and operates a distributed system and wishes to achieve a high rate of development velocity will want to add Chaos Engineering to its collection of approaches for improving resiliency. Chaos Engineering is still a relatively new domain, and the techniques and tools are still evolving.


Leave a Comment

Name required.
Enter a Valid Email Address.
Comment required.(Min 30 Char)