Ensuring operational resilience in 2025 – why the status quo no longer works

  2 1 comment

Ensuring operational resilience in 2025 – why the status quo no longer works

Sponsored

This content has been created by the Finextra editorial team with inputs from subject matter experts at the funding sponsor.

Operational resilience is on all UK payments leaders’ minds. In 2024, 95% of business leaders stated that they’re aware of operational weaknesses which leave them vulnerable, yet 48% said their organisations aren’t doing enough to improve resilience.

The European Union (EU)’s Digital Operational Resilience Act (DORA) – having come into effect on 17 January 2025 – is the regulators’ push towards improved operational risk incident management across the industry, but how have financial organisations fared when it comes to readiness? What else needs to be done from an infrastructure perspective to achieve greater resilience?

Finextra spoke with Rob Reid, technical evangelist at Cockroach Labs, about the company’s research on the state of resilience across financial services; how to effectively achieve operational resilience; and what needs to be done to reap benefits that extend beyond mere compliance.

How outages turned into the new normal

While outages have become common in organisations – Cockroach Labs found organisations experience on average 86 a year – it’s major infrastructure blackouts that grabbed the headlines in 2024.

The Bank of England reported seven outages to the UK’s RTGS (Real-Time Gross Settlement) and Clearing House Automated Payment System (CHAPS) in 2024. Most notably, in July 2024, CHAPS failed and delayed large, time-sensitive payments, which had a substantial impact. With CHAPS usually enabling 200,000 payments a day – with an average daily value of £345 billion – a system shutdown of this scale (245 minutes in fact) resulted in considerable losses.

That very same month, the CrowdStrike outage caused global chaos, affecting over 8 million Windows devices. The fallout was immense, with GPs unable to treat patients; hundreds of businesses reporting revenue losses; and planes being grounded globally, leaving travellers stranded in airports.

“Once you've hit rock bottom, the only way is up. And I think we're going to need to see changes,” said Reid, when speaking on the state of operational resilience in financial services. “The technologies and practices used across the industry aren't keeping up with the needs of modern resilience requirements. Fundamentally, if what we had was working, we wouldn't have DORA.”

The state of resilience in light of DORA

The EU’s DORA officially took effect on 17 January 2025, providing a universal framework designed to enhance information and communication technology (ICT) risk management.

“I am a software engineer with an almost comically low-risk appetite. So, as you can imagine, I've been bemoaning lacklustre operational resilience for many years,” commented Reid. “DORA is a much-needed wake up call for the industry. I wish we would have had it years ago, because as a software engineer at the coalface, I would have had something to wield.”

In order to understand the state of resilience going into 2025, Cockroach Labs surveyed 1,000 senior cloud and technology executives. Alarmingly, the data showed that while 94% of technical executives stated that the CrowdStrike outage encouraged their organisations to reassess their risk management, the operational resilience reality still looks bleak:

  • 93% of leaders are concerned about the financial and organisational impacts of outages;
  • 95% are aware of operational weaknesses that leave them vulnerable;
  • 53% of banking and financial services companies report experiencing service disruptions at least weekly;
  • 20% of respondents describe their organisation as fully prepared for outages;
  • 33% have an organised response approach, and less than a third conduct regular failover testing.

Speaking on the results, Reid emphasised: “Every single person we spoke to reported revenue loss as a result of downtime in the last 12 months. On average, businesses are seeing 86 outages per year, with the average downtime lasting more than three hours. In terms of approaches, this hints at an industry-wide tendency of being reactive to downtime, and I would question whether teams are being given the time, space, and resources required to make meaningful, positive changes in preventing it.”

Considering the research was conducted at the end of last year, it is surprising to see how little progress organisations have made toward operational resilience – especially given the DORA deadline. However, considering how much information geared toward DORA readiness has been available, these results show that it might be an issue of agility rather than an issue of understanding.

“Consider DORA from the perspective of a company with aging technology and infrastructure,” commented Reid. “This all serves to reduce their ability to innovate. They're having to manage all of this potentially archaic infrastructure, let alone react with agility. And it’s not only DORA, there is GDPR [General Data Protection Regulation], there is CCPA [California Consumer Privacy Act], and a host of other regulations. Add to that a disaster recovery mindset, necessitated by the presence of primary/secondary architecture, and you've got a perpetuation.”

So how can organisations go beyond the minimum requirements of DORA to develop holistic operational resilience strategies?

Developing modern resilience strategies

For organisations running primary/secondary architecture, failovers and failbacks are key concepts of resilience and disaster recovery. A failover is the process of switching to a backup, secondary system or site when the primary architecture fails – ensuring business continuity – while failback refers to the process of returning to the primary system once the issue is resolved.

Reid explained that many organisations are running primary/secondary architectures “with the hope that things don't go wrong. Because if something goes wrong, they need to fail over, and that is risky. Some businesses never fail back because of the risk associated in failing back to the primary architecture. However, hope is not a strategy. Modern and capable technology must be considered if we are to move beyond the traditional primary/secondary failover mindset, and businesses should be considering technologies that minimise RTO and RPO.”

RTO (recovery time objective) is the amount of time that an organisation will be down following an outage, which, according to Reid, should be measured in seconds, not minutes or hours. RPO (recovery point objective) is the amount of data that an organisation loses in an outage.

“And that should be zero,” he argued. “Let's assume you have a traditional database that you are backing up every hour. That's up to one hour of data that you're going to permanently lose in the event of an outage, simply because you didn't back up more regularly within that time window.”

Thinking beyond the primary/secondary architecture approach, self-healing technology is the more modern approach in achieving effective operational resilience. Referring to applications that are capable of detecting, diagnosing, and repairing their own issues without human intervention, self-healing technology – made even more powerful through machine learning and artificial intelligence (AI) – enables organisations to better manage their systems’ availability.

Crucially, self-healing technology can work both reactively as well as preventatively which, according to Reid, is not just important for systems, but for employees as well. In order to achieve reliable availability, the mindset within organisation needs to start rewarding prevention more than finding solutions to existing issues:

“Do employees get more recognition for putting out fires, or do they get more recognition for preventing fires in the first place? Preventing fires will inevitably be a lot less visible if the reward culture celebrates firefighting,” emphasised Reid. “Businesses can and should be adopting self-healing and distributed technologies. This places the burden of operational resilience on software instead of people, and that frees people up to innovate.”

Operational resilience in 2025 and beyond

In 2025, downtime is no longer tenable. Resilience, in its many forms, must be made a priority. A failure to comprehensively overhaul and modernise systems and processes will inevitably incur disruptions.

“DORA is the recognition that the status quo isn't doing enough to keep businesses online, and it should be seen as an opportunity,” finalised Reid. “DORA will shore up trust in the industry as a whole, and each of those businesses that work within it are going to contribute to that. I have watched organisations reap the benefits of self-healing applications. Modern technology has the potential to completely revolutionise the way we approach operational resilience.”

It is now imperative for financial institutions – both banks and regulated, non-bank financial institutions – to ensure business continuity meets organisational needs in an increasingly volatile global environment.

Channels

Comments: (1)

A Finextra member 

Security of Electricity supply is now under threat from lack of 'Intertia' from Renewables...... does not matter what we do in our industry if our governments cannot guarantee security of electricity supplies in a net-zero world without abundant Nuclear base-load.

Sponsored

This content has been created by the Finextra editorial team with inputs from subject matter experts at the funding sponsor.