Chaos … is mainly associated with the disorder and lack of order. So where does the idea to use the assumptions related to this idea come from? And why to combine it with solutions in the cloud?
In today’s post, we will try to outline the concept of “cloud armageddon”, which is successfully used by Netflix engineers in their daily work.
Chaos engineering is a “testing” approach to system behaviour in extreme situations, through the use of an empirical concept. The concept allows for a kind of “conducting an experiment”, which is compared to the activities performed by scientists during the study of physical or social phenomena. It allows you to build stable and fault-tolerant systems. It gives the possibility of early detection of problems and failures of various types, thanks to which the revealed errors can be prevented earlier, and solution that will resist them in the future can be created. It consists in deliberately inducing “disturbances” in a functioning system, in order to create its best possible version. In addition, this approach allows you to explore events such as increased traffic on the site, which can become a great “experimental” case.
So, why is this technique better than others?
The fundamental difference, that exists between chaos testing approach and routine system or application tests, is based on an empirical factor that allows you to acquire new knowledge about a given system. The result of a standard test is a binary value that uniquely determines whether the tested application will work correctly or not. Chaos testing allows you to take new actions affecting the development and improvement of the existing version of the system. Thanks to the introduction of unexpected problems and errors or empirical variables, such as the aforementioned increased traffic on the website, we can check and consequently predict the operation of the system and exclude problems that will cause failures.
Based on numerous experiments with testing, in order to improve chaos testing, in 2011 engineers working at Netflix created a tool called Chaos Monkey. Its operation is based on a purposeful shutdown of servers in the production network of the serial giant. This allows you to check the behaviour of other systems during such a controlled failure, i.e. the impact of the lack or partial non-operation of the service on the entire system. Netflix’s tool was initially part of a set of tools functioning under the name Simian Army, which was used to check the reliability, security and resilience of the AWS-based infrastructure. Currently, the project has been completed and Chaos Monkey is being developed as an independent project, in accordance with the assumptions of the DevOps methodology. It meets the need for continuous testing, ensuring a sufficient level of trust and security of computer systems. It is also part of the Design for failure pattern, which aims at any failure of the underlying system component ( it may be software or hardware). In 2015 Chaos Kong joined Chaos tools, which simulates the unavailability of the entire AWS region, and in 2017 their family was expanded by the Chaos Automation Platform, abbreviated as ChAP, allowing to capture security gaps during the “injection” of micro-service failures.
Having some knowledge about the assumptions and tools of chaos engineering, we can go to the Chaos Rules. They are guidelines that allow the implementation of the title approach in practice:
- Build hypotheses around a permanent state of affairs – focus on the reliable result of the system, not its properties. This rule allows you to check if the system actually works, and not how it works.
- Change real events – prioritize events that have a potential impact on the operation of the system or taking into account the frequency of their occurrence. Events should be understood as “chaos variables”, which are, for example, sudden jumps in motion or scaling. You should not combine such situations like loss of servers and others.
- Run experiments on production – it’s worth experimenting on production to preserve the authenticity and timeliness of the system being created.
- Automate the process of experiments – automation allows you to maintain continuity during the experiment, while controlling the orchestration and analysis.
- Minimize the blast radius – there is a tolerance limit for the temporary negative effects of experiments. They allow you to minimize and reduce these potentials on a large scale.
Chaos engineering, although it is a relatively young approach, it is an extremely powerful practice. It completely changes the way software and entire systems are designed and constructed. At the moment when the rest of the world is working on the speed and flexibility of systems, chaos engineering explores the sphere of systemic uncertainty. Chaos Rules allow quick implementation of innovations on a large scale, ensuring high quality of experience for the recipient.