Operational resilience is on all UK payments leaders’ minds. In 2024, 95% of business leaders stated that they’re aware of operational weaknesses which leave them vulnerable, yet 48% said their organisations aren’t doing enough to improve resilience.
The European Union (EU)’s Digital Operational Resilience Act (DORA) – having come into effect on 17 January 2025 – is the regulators’ push towards improved operational risk incident management across the industry, but how have financial organisations fared
when it comes to readiness? What else needs to be done from an infrastructure perspective to achieve greater resilience?
Finextra spoke with Rob Reid, technical evangelist at Cockroach Labs, about the company’s research on the state of resilience across financial services; how to effectively achieve operational resilience; and what needs to be done to reap benefits that extend
beyond mere compliance.
How outages turned into the new normal
While outages have become common in organisations –
Cockroach Labs found organisations experience on average 86 a year – it’s major infrastructure blackouts that grabbed the headlines in 2024.
The
Bank of England reported seven outages to the UK’s RTGS (Real-Time Gross Settlement) and Clearing House Automated Payment System (CHAPS) in 2024. Most notably, in
July 2024, CHAPS failed and delayed large, time-sensitive payments, which had a substantial impact. With CHAPS usually enabling 200,000 payments a day – with an average daily value of £345 billion
– a system shutdown of this scale (245 minutes in fact) resulted in considerable losses.
That very same month, the CrowdStrike outage caused global chaos, affecting over 8 million Windows devices. The
fallout was immense, with GPs unable to treat patients; hundreds of businesses reporting revenue losses; and planes being grounded globally, leaving travellers stranded in airports.
“Once you've hit rock bottom, the only way is up. And I think we're going to need to see changes,” said Reid, when speaking on the state of operational resilience in financial services. “The technologies and practices used across the industry aren't keeping
up with the needs of modern resilience requirements. Fundamentally, if what we had was working, we wouldn't have DORA.”
The state of resilience in light of DORA
The EU’s DORA officially took effect on 17 January 2025, providing a universal framework designed to enhance information and communication technology (ICT) risk management.
“I am a software engineer with an almost comically low-risk appetite. So, as you can imagine, I've been bemoaning lacklustre operational resilience for many years,” commented Reid. “DORA is a much-needed wake up call for the industry. I wish we would have
had it years ago, because as a software engineer at the coalface, I would have had something to wield.”
In order to understand the state of resilience going into 2025, Cockroach Labs surveyed 1,000 senior cloud and technology executives. Alarmingly,
the data showed that while 94% of technical executives stated that the CrowdStrike outage encouraged their organisations to reassess their risk management, the operational resilience reality still looks bleak:
- 93% of leaders are concerned about the financial and organisational impacts of outages;
- 95% are aware of operational weaknesses that leave them vulnerable;
- 53% of banking and financial services companies report experiencing service disruptions at least weekly;
- 20% of respondents describe their organisation as fully prepared for outages;
- 33% have an organised response approach, and less than a third conduct regular failover testing.
Speaking on the results, Reid emphasised: “Every single person we spoke to reported revenue loss as a result of downtime in the last 12 months. On average, businesses are seeing 86 outages per year, with the average downtime lasting more than three hours.
In terms of approaches, this hints at an industry-wide tendency of being reactive to downtime, and I would question whether teams are being given the time, space, and resources required to make meaningful, positive changes in preventing it.”
Considering the research was conducted at the end of last year, it is surprising to see how little progress organisations have made toward operational resilience – especially given the DORA deadline. However, considering how much information geared toward
DORA readiness has been available, these results show that it might be an issue of agility rather than an issue of understanding.
“Consider DORA from the perspective of a company with aging technology and infrastructure,” commented Reid. “This all serves to reduce their ability to innovate. They're having to manage all of this potentially archaic infrastructure, let alone react with
agility. And it’s not only DORA, there is GDPR [General Data Protection Regulation], there is CCPA [California Consumer Privacy Act], and a host of other regulations. Add to that a disaster recovery mindset, necessitated by the presence of primary/secondary
architecture, and you've got a perpetuation.”
So how can organisations go beyond the minimum requirements of DORA to develop holistic operational resilience strategies?
Developing modern resilience strategies
For organisations running primary/secondary architecture, failovers and failbacks are key concepts of resilience and disaster recovery. A failover is the process of switching to a backup, secondary system or site when the primary architecture fails – ensuring
business continuity – while failback refers to the process of returning to the primary system once the issue is resolved.
Reid explained that many organisations are running primary/secondary architectures “with the hope that things don't go wrong. Because if something goes wrong, they need to fail over, and that is risky. Some businesses never fail back because of the risk
associated in failing back to the primary architecture. However, hope is not a strategy. Modern and capable technology must be considered if we are to move beyond the traditional primary/secondary failover mindset, and businesses should be considering technologies
that minimise RTO and RPO.”
RTO (recovery time objective) is the amount of time that an organisation will be down following an outage, which, according to Reid, should be measured in seconds, not minutes or hours. RPO (recovery point objective) is the amount of data that an organisation
loses in an outage.
“And that should be zero,” he argued. “Let's assume you have a traditional database that you are backing up every hour. That's up to one hour of data that you're going to permanently lose in the event of an outage, simply because you didn't back up more
regularly within that time window.”
Thinking beyond the primary/secondary architecture approach, self-healing technology is the more modern approach in achieving effective operational resilience. Referring to applications that are capable of detecting, diagnosing, and repairing their own issues
without human intervention, self-healing technology – made even more powerful through machine learning and artificial intelligence (AI) – enables organisations to better manage their systems’ availability.
Crucially, self-healing technology can work both reactively as well as preventatively which, according to Reid, is not just important for systems, but for employees as well. In order to achieve reliable availability, the mindset within organisation needs
to start rewarding prevention more than finding solutions to existing issues:
“Do employees get more recognition for putting out fires, or do they get more recognition for preventing fires in the first place? Preventing fires will inevitably be a lot less visible if the reward culture celebrates firefighting,” emphasised Reid. “Businesses
can and should be adopting self-healing and distributed technologies. This places the burden of operational resilience on software instead of people, and that frees people up to innovate.”
Operational resilience in 2025 and beyond
In 2025, downtime is no longer tenable. Resilience, in its many forms, must be made a priority. A failure to comprehensively overhaul and modernise systems and processes will inevitably incur disruptions.
“DORA is the recognition that the status quo isn't doing enough to keep businesses online, and it should be seen as an opportunity,” finalised Reid. “DORA will shore up trust in the industry as a whole, and each of those businesses that work within it are
going to contribute to that. I have watched organisations reap the benefits of self-healing applications. Modern technology has the potential to completely revolutionise the way we approach operational resilience.”
It is now imperative for financial institutions – both banks and regulated, non-bank financial institutions – to ensure business continuity meets organisational needs in an increasingly volatile global environment.