Accelerating the Introduction of Resilience Testing using Shadow Chaos Engineering

Master Thesis Defense

Anton Mrosek

Otto von Guericke University Magdeburg, ID: 239036

October 15, 2024

Agenda

  • Motivation
  • Basics
  • Method
  • Literature
  • Research Questions
  • Shadow Chaos Engineering
  • Demonstration
  • Conclusion
  • Future Work

Motivation

Environment

  • 78% of larger companies1 in Germany use cloud computing [1]
  • Example: Netflix uses Amazon Web Services as cloud computing provider [2]
  • Service interruption can lead to high financial losses
    • Example: Streaming not possible

Motivation

Relevance

  • Amazon Web Services partial outage in North Virginia [3]
  • Typo during debugging
  • 4 hours downtime
  • Estimated 150 Mio. USD damages

Amazon explained the prolonged restart by saying the two subsystems had not been completely restarted for many years.

Basics

Methods

  • Resilience testing
    • Increase stability
    • Reduce downtime
  • Chaos Engineering
    • Simulate technical issues in production
    • Automated
    • Regular

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. [2]

Methodology

DSRM

Design Science Research Method according to Peffers et al. [4]:

  1. Problem identification and motivation
  2. Objectives of a solution
  3. Design and development
  4. Demonstration
  5. Evaluation
  6. Communication

Literature

Literature Research Questions

Systematic literature review based on LRQs concerning

  1. Definition
  2. Deficits
  3. Derived methods

… of chaos engineering

Literature

Results

Definition:

  • Seminal work: “Chaos Engineering” by Netflix employees [2], [5], [6]

Deficits:

  • Risk of production outages [7]
  • Might deter further adoption [8]

Derived methods:

  • Testing on digital twin instead of production system [7]

Research Questions

RQ1

  • How can the method of chaos engineering be extended by mirroring a productive system in order to minimize risk of catastrophic failure when introducing chaos engineering?

RQ2

  • When minimizing catastrophic failure using a new method, which aspects of the method can be automated?

Shadow Chaos Engineering

Previous Work

  • Conventional Chaos Engineering
    • Risk of production outages
  • Previous work: Testing on digital twin [7]
    • Reduces risk of outages in production
    • Problem: Diverging environments
    • Limited transferability of results

Shadow Chaos Engineering

New Method

  • Copy of production environment
  • Created on demand, short-lived
    • Close to production
  • Chaos Experiment on Shadow Environment
  • Failure occuring:
    • No impact on production
    • Further evaluation possible
    • New experiments with patched system
  • No failure occurring:
    • Experiment in production
    • Confidence in production environment

Shadow Chaos Engineering

Concept

Shadow Chaos Engineering

Demonstration

  • Proof of concept
  • Based on container orchestration system Kubernetes
  • Partial automation of steps

Conclusion

Findings

  • Shadow chaos engineering as new method for resilience testing
  • Lowered risks with Shadow Chaos Engineering
  • Proof of concept shows feasibility
  • Partial automation possible

Conclusion

Future Work

  • Evaluation of similarity between Shadow Environment and production environment
  • Risk assessment of Shadow Chaos Engineering, compared to Chaos Engineering
  • Risk assessment of Shadow Chaos Engineering, compared to Chaos Engineering
  • Expand proof of concept to more resource types
  • Explore new possibilities for automation [9]
    • Experiment generation
    • Experiment evaluation

References

[1]
Statistisches Bundesamt, “Nutzung von Cloud Computing nach Beschäftigtengrößenklassen.” Accessed: Oct. 14, 2024. [Online]. Available: https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Unternehmen/IKT-in-Unternehmen-IKT-Branche/Tabellen/iktu-06-cloud-computing.html
[2]
A. Basiri et al., “Chaos Engineering,” IEEE Software, vol. 33, no. 3, pp. 35–41, 2016, doi: 10.1109/ms.2016.60.
[3]
Y. Sverdlik, AWS Outage that Broke the Internet Caused by Mistyped Command.” Accessed: Oct. 14, 2024. [Online]. Available: https://www.datacenterknowledge.com/outages/aws-outage-that-broke-the-internet-caused-by-mistyped-command
[4]
K. Peffers, T. Tuunanen, M. A. Rothenberger, and S. Chatterjee, “A Design Science Research Methodology for Information Systems Research,” Journal of Management Information Systems, vol. 24, no. 3, pp. 45–77, Dec. 2007, doi: 10.2753/MIS0742-1222240302.
[5]
C. Rosenthal, L. Hochstein, A. Blohowiak, N. Jones, and A. Basiri, “Chaos engineering.” Accessed: Apr. 02, 2024. [Online]. Available: https://www.oreilly.com/content/chaos-engineering/
[6]
A. Basiri, L. Hochstein, N. Jones, and H. Tucker, “Automating Chaos Experiments in Production,” presented at the Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019, 2019, pp. 31–40. doi: 10.1109/ICSE-SEIP.2019.00012.
[7]
F. Poltronieri, M. Tortonesi, and C. Stefanelli, “A Chaos Engineering Approach for Improving the Resiliency of IT Services Configurations,” 2022, pp. 1–6. doi: 10.1109/noms54207.2022.9789887.
[8]
H. Tucker, L. Hochstein, N. Jones, A. Basiri, and C. Rosenthal, “The Business Case for Chaos Engineering,” IEEE Cloud Computing, vol. 5, no. 3, pp. 45–54, 2018, doi: 10.1109/mcc.2018.032591616.
[9]
J. Hernandez-Serrato, A. Velasco, Y. Nifio, and M. Linares-Vasquez, “Applying Machine Learning with Chaos Engineering,” 2020, pp. 151–152. doi: 10.1109/issrew51248.2020.00057.