Long reads

TARGET2 incident illustrates the need for High Availability and DevOps

Andrew Smith

Andrew Smith

Founding CTO, RTGS & ClearBank

The full story behind the recent TARGET2 outage is as yet not in the public domain, but from what we do know, the incident serves as a great use case for investment in DevOps and a move away from traditional Disaster Recovery type models.

In a recent communication, the root cause of the incident was “found to be a software defect of a third-party network device used in the internal network of the central banks operating the TARGET2 service.” Now I am not entirely sure what that means, as to be frank, it is so broad, it is almost a meaningless statement. All I can take from it that there is an infrastructure failure. Another issue that we know about was the fact that a back-up system failed to work and failover to a secondary DR SITE took many hours. Yet we do not know the causes of these issues.

What can we learn?

We can see that the system operates what I would call a legacy set-up, as in it follows a very typical Disaster Recovery (DR) type model. This model has been in place for many years and is pretty much the go to model within the financial services industry.

You have one infrastructure, typically a data centre, which does all the work and you have a second, identical infrastructure (data centre) as a 'hot standby', located geographically a good 40+ km away. The thinking is your primary data centre has an issue, you cannot get it working quickly, so you simply fail services over to the hot standby and then job done.

This thinking is highly embedded within financial services; a DR model even forms an integral part of your journey to gain a banking license. However, this is an outdated model and as this TARGET2 issue shows, is also a broken one.

CIOs and COOs within the financial services sector have to stop thinking of IT services in terms of 'disaster' and then 'recovery'. That thinking drives a very basic approach which is typically one of redundancy, as in we must have duplicates of everything.

This starts with telecoms, cabling, servers, power, switches and then the thinking just gets taken up a notch to the 'data centre' level, meaning we just have a redundant data centre. There are two massive issues with this thinking.

The first is that even small incidents become an issue of disaster. The second issue is that your DR site cannot be the same as your primary, and that is based on the real world understanding that data is ever flowing, therefore 'state' is always changing. This is before we think of software upgrades, testing cycles, hardware updates, etc.

Availability

I often explain availability as a basic rule of three. DR is thought of as a rule of two, a failure and then we switch to the backup. However, in a DR model, you now have zero redundancy until you have fixed the initial issue. With the rule of three, you maintain redundancy and therefore resilience while you fix issues.

This applies to things like data; your data is stored across three different zones – quorum is needed between those zones and all of a sudden, you have protection against data corruption and loss. The rule of three gets expensive, so you must think commercially: “how can I utilise this redundancy, how do I ensure it's not just wasted capacity?” In thinking this way, you also address the issue of state, ensuring everything is the same. That data for example is available at all times in all three zones.

Availability means you are utilising that capacity. In the cloud, availability is something that the big players have, and continue to invest billions in. Microsoft Azure, for example, provides availability zones, which is essentially three separate compounds (think of them as your typical data centre) geographically distanced by something between 10-40km, all run active:active:active with each other.

Essentially your redundancy and resilience is load balanced. Now, in terms of a failure, within this availability model, there is no interruption to the availability of services. In the TARGET2 incident, transactions would have continued to flow and be processed as they should, as only one zone would have been impacted.

There are many other aspects of high availability we should get familiar with: one such example is tiering your infrastructure and the services you provide. By this I mean, tier 1 is highly important, you must have it running at all times, therefore you look to ensure it can run at all times. But a tier 3 application or service can afford to go offline for hours on end without having a material impact on what your business does. This tiering allows you to keep your costs in check and ensures resilience is focussed on the areas that really matter.

Upgrades and maintenance

The challenge with upgrades and maintenance, be they physical hardware or software related, is ensuring upgrades do not impact availability and that if an upgrade was faulty, you could or can contain the issue before it is applied to the rest of your infrastructure. Enter the importance of DevOps.

When we start to think of the very infrastructure that services run on as 'code', we start to version control that very infrastructure (which maps back to the physical kit). We can control how it is deployed and we can ensure that infrastructure is totally consistent. There is also the added benefit that infrastructure becomes repeatable, repeatable at speed.

DevOps is critical to being able to ensure high availability models work, that your software and infrastructure is repeatable and that you can upgrade parts of your infrastructure without negatively impacting your services. As a CIO/COO, I strongly recommend you get familiar with the term 'rolling upgrade'. Think of it as upgrading some, but not all, your services. Checking they are functioning before continuing the upgrade to those services or infrastructures that has yet to be upgraded, rolling upgrades enable you to identify issues and then contain them, while at the same time, always providing your services (ensuring they are highly available).

Systemic importance

TARGET2 is systemically important. The cloud with high availability models, coupled with the growth in DevOps, has shown how we can ensure services remain pretty much always on. 100% is not achievable -but is the goal.

Incidents will and always do happen; the key is being able to make your services available while an incident is ongoing. Then if you suffer a real disaster, such as a natural disaster that destroys an entire data centre, you are still able to have services available. As a systemically important service, we must all ditch the thinking of Disaster Recovery and move to the concept of High Availability.

There is a great deal of 'chat' regarding resilience within financial services and the use of the cloud. However, the financial services sector simply cannot afford to invest the amount of money into core underlying infrastructure as that which is invested in the public cloud, especially by providers such as Microsoft, Amazon and Google.

At some point we have to acknowledge that important infrastructure is destined to run in the cloud, be that within an individual financial institution or as systemically important payment service, such as TARGET2.

Comments: (2)

Marcel Klimo
Marcel Klimo - Vacuumlabs - Bratislava 09 November, 2020, 09:541 like 1 like

Why is it that the financial service sector cannot afford to this? Can you help me undestand that please?

Bob Lyddon
Bob Lyddon - Lyddon Consulting Services - Thames Ditton 12 November, 2020, 12:231 like 1 like

Just to share my delvings into what the disaster recovery for the TARGET2 SSP involves.

The TARGET2 SSP has three “regions”. Regions 1 and 2 (payments and accounting) are run in data centres of the Bundesbank and Banca d’Italia; Region 3 (Customer related service system) is run in data centres of the Banque de France.

The Regions are 1,000 km apart, rather than 20 km, although the "sites" within the "Regions" may well be closer together.

The Regions are connected normally through the 4CB network, or through CoreNet if 4CB goes down (User Book 1 for v14.0 p366). “4CB” presumably refers to the above three central banks plus the ECB.

Region 1 is the live system; Region 2 is the disaster recovery and the testing system. They are replicas of one another, kept in synch via 4CB.

The role of being Region 1 is rotated between the Bundesbank and Banca d’Italia: when the Bundesbank has Region 1, Banca d’Italia has Region 2 and vice versa.

Each Region contains two sites (User Book 1 for v14.0 p367): a primary site and a recovery site. These are presumably in the separate data centres of each central bank that they use as disaster recovery for their own, non-TARGET2 systems (e.g. for the Bundesbank they might be somewhere like Eschborn and Kassel). It is implied that the two sites run by the same central bank are also connected through 4CB.

So the first fallover for live operation is the recovery site within Region 1; the second fallover is the primary site in Region 2; the third fallover is the recovery site in Region 2. If all of this fails, as it seems to have done on 20th October, there is a “Restart after disaster” as per User Book 1 p380.

The inference of all of above and what the ECB has said is that 4CB went down, CoreNet could not be brought online, and none of the intra-Region or inter-Region communications could take place.

That in turn points to the failure of a switching component in Region 1, and one that could not be bypassed.