Long reads

Chaos engineering: The art of failing gracefully

Hamish Monk

Hamish Monk

Reporter, Finextra

A foray into the world of DevOps would be incomplete without a pitstop at chaos engineering, or – as Bola Rotibi, research director, software development and delivery, CCS Insight, calls it – the art of failing gracefully.

So, what is it, how does it work, and why should financial institutions care about this apocalyptic-sounding, computer engineer’s term?

What is chaos engineering?

Chaos engineering – or as it was originally dubbed, failure injection testing – was invented in 2011 by none other than Netflix. Their proprietary tool, Chaos Monkey (now part of a wider suite of tools called Simian Army) was initially established to gauge the resilience of their IT infrastructure, having moved from DVDs to the Amazon Web Services cloud.

When the Chaos Monkey code was finally released in 2012, Netflix declared that the best defence against major unexpected failures is to fail often.

This statement strikes at the heart of chaos engineering, which acknowledges that unpredictable outcomes in distributed systems are an inevitability. Whether they’re caused by network errors, bandwidth limits, security issues, or other general bugs, the real question to address is: how much confidence do I have in the complex systems we put into production?

In an interview with Finextra, CCS Insight’s Bola Rotibi said “chaos engineering is a computational manifestation of risk analysis. It ensures systems can cope with turbulence. Ultimately, the goal is to surface the potential disruption or impact of a bug – and deliver discipline, consistency, predictability and resilience to computer systems.”

Eduard Iacoboaia, senior site reliability engineer, cloud enablement team, Mollie, agreed: “Chaos engineering is the act of injecting controlled errors or failures into complex systems, in order to gain an understanding of how they're going to react. Engineering teams can then ensure the system absorbs, and becomes more resilient to, outages like it in the future.”

So, despite its name, chaos engineering is actually quite sensible.  

How does it work?

There are four steps to the chaos engineering process:

  1. Define the system’s ‘steady state’. What is normal behaviour?
  2. Theorise that the steady state will continue in both the control group and the experimental group.
  3. Introduce variables such as server crashes, hard drive malfunctions, network connection breaks, and so on.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

If errors arise, that’s a good thing – the journey of working out how to avoid the bug has begun. If errors don’t arise, widen the blast radius.  

“The process of chaos engineering is similar to that of the brain’s synapses,” explained Rotibi. “When an individual loses functionality in one area of the brain, synapses can work around the damage by creating new pathways that deliver the same signal to the desired destination. In this sense, the human brain is constantly running chaos engineering.” 

Crucially, the brain does not run these tests manually. They proceed on autopilot – without us knowing. This should be the goal for DevOps teams, too. Unless automated, chaos engineering experiments can be labour-intensive and costly.

What are the outcomes?

Chaos engineering has countless real-world benefits. Let’s take a case study closest to home right now – Covid-19.

“During the pandemic, when a large chunk of the world’s workforce went online, network providers and telco operators were faced with managing an unprecedented deluge of traffic,” said Rotibi. “Thankfully, they rose to the task pretty darn well. Yes, there were some performance road bumps, but the very fact that nothing completely fell over tells us they built strong resilience by practicing robust chaos engineering.”

Netflix, for instance, was able to slightly downregulate its performance during lockdowns, in order to make room for the surge in demand for streaming services.

What’s in it for financial players?

But examples can be misleading. Chaos engineering shouldn’t just be practiced by large enterprises with a sprawling web presence. Given the pandemic-induced shift to online, chaos engineering should be viewed as a necessary DevOps technique for any business with distributed computing deployments – including financial services players. The 2019 Bank of America outage, which left hundreds of customers unable to access their accounts, is proof enough.  

There are three areas within which chaos engineering can provide value for financial services firms:

  1. Customers: Improved availability and durability of systems reduces the chance of outages and reputational damage.
  2. Business: Considerable maintenance costs and revenue losses are circumvented.
  3. Technical: Insights can reduce incidents, on-call burden, and boost DevOps teams’ understanding of system failure modes.

“Under the new digital economy, where much of our lives are online, all financial services firms would be advised to look to advances being made in chaos engineering and to take more of these on board,” noted Rotibi. “It is an imperative that banks gain an understanding of ways to improve performance and reliability for customers conducting transactions via mobile banking – particularly when the network is in high demand.”

However, chaos engineering is not suited to every financial player. DevOps teams working for young firms must be brutal about what jobs they should work on in the present, and what they should aim for in the future, argued Iacoboaia.

Kareem Hewady, cloud enablement team, Mollie, agreed: “Chaos engineering can be useful for any financial services firm, but before you go there, I feel there are other big wins to be had which require smaller investment. For instance, disaster recovery testing – or DRT – allows you to introduce failure in a controlled manner, in small quantities, and in a situation where everybody is prepared.”

Generally speaking, most financial services firms have an error budget. If it does not, it means one of two things: either it is unaware of the risks of software development, or it is afraid of introducing errors and is stifling innovation.

“The error budget must be an equation that the business folks and engineering folks agree upon,” said Hewady.

Engineering chaos: A magic bullet?

System failures have become increasingly tricky to predict thanks to the rise in distributed systems and microservices. To be proactive, firms’ only choice is to learn from failure.

In this sense, then, chaos engineering is born of necessity. Embracing it, however, may require a cultural leap.

“In the near-future,” speculated Rotibi, “chaos engineering will come to the fore once it has been properly socialised into the workplace, through education and experience around its use.” This vehicle of democratisation may well be chaos-as-a-service (CaaS) – offering firms fast, simple, and affordable access to complex tools.

Unfortunately, however, the real world is more chaotic than we often expect, and as such, firms should get creative with how they inject chaos into their systems.

“Chaos engineering is no magic bullet; it’s not a solution forever,” said Iacoboaia. “The best thing you can do is improve the process of fixing issues. Even if your system seems reliable, and you don't encounter bugs in your manual DRT test, you shouldn’t get complacent, because you may be fine for a year, but when a bug does arise, your team is out of practice in systems recovery.”

You’re then in a worse situation than you were before chaos engineering, so don’t stop failing gracefully.

Comments: (0)