Blog article
See all stories »

Network Outages Highlight Key Areas of Weakness

The proposed Regulation SCI (Systems Compliance and Integrity) from the SEC may require further strengthening, but principles such as mandatory testing of disaster recovery procedures would ensure that financial organisations recognise some of the key weaknesses in their system infrastructures.  

One of the examples of system failure cited by the SEC was the software glitch that resulted in BATS Global Markets cancelling its IPO on its own exchange. BATS, like many other operators in the securities market, is no stranger to unexpected failures. At the end of 2011, trading at BATS Chi-X was halted for an entire day after a hardware failure. Whilst the exact cause of the failure was not confirmed, similar instances have been attributed to network devices.

In 2012, over confidence in the reliability of the Tokyo Stock Exchange’s systems eventually led to the stoppage in trading for a few hours when a switching procedure, triggered by a hardware failure, was unsuccessful.

Looking at the recent outages to RBS and NatWest it’s easy to see that whilst networks should be designed with possible failures in mind, outages still affect even the most resilient networks. The problem for many organisations is that they’ve traditionally concentrated on protecting their data, implementing automated fail safes, but have ignored the nuts and bolts (the network) that hold it together.

As companies deliver more real-time services over the network, the outage stakes haven risen and network configurations need to be better managed. With misconfiguration being a contributing factor in over 65% of network outages, business leaders need to understand the risks of even a small failure and have a strategy to recover.

Even when a hardware outage occurs, the trouble often begins just when the IT team thinks the panic is nearly over. With the failed hardware replaced, all that’s left is to restore the settings. It should be a simple matter of few clicks, but this is normally the time that organisations discover that the backup of the previous working configuration is not up-to-date.

A typical infrastructure might have hundreds of network devices from dozens of different vendors, each requiring manual intervention by skilled engineers to create back-ups of those configuration settings that drive the network. Because it’s a time consuming task, network configuration backups are often put to one side due to other business as usual activities.  As network and security devices such as firewalls are changed fairly frequently, unforeseen risks and compliance failings start to build up as the distance between backup cycles grow.

When an outage strikes, as a result of hardware failure or due to human error, the effected organisations network engineers are against the clock to resume normal operations. Without a current backup, they are often faced with making live changes to the network, to rebuild configurations to their last known state.

Even when engineers have created scripts to automate the configuration backup process, recovery operations are seldom tested. The recovery process is also usually manual, requiring skilled engineers to be available. Without centralised automation of both the back-up and the recovery of network device configurations, delays in restoring systems and downtime costs will be inevitable.

Whilst organisations are generally well prepared with regards to their server infrastructure, network devices are often overlooked and inadequacies in business continuity plans only come to light when a device needs to be restored. Events that activate a disaster recovery situation are rarely predictable. By mandating the testing of disaster recovery procedures, Reg SCI will ensure that whilst the cause of the disaster can’t be predicted, the recovery path can.

2836

Comments: (1)

A Finextra member
A Finextra member 09 May, 2013, 14:08Be the first to give this comment the thumbs up 0 likes

Thanks for this good analysis, which leads towards identifying complexity as the root cause for most of these outages ...

In the good old days when systems were reliable, there used to be some kind of "big iron" as the underlying infrastructure, having its own integrated communication subsystem for network connectivity. Configurations used to be pretty simple and easy to manage.

In the brave new world of virtualization and sophisticated server farms, things are not that easy any more. Too many components and too many software products/layers from too many different vendors make it very difficult for humble human brains to keep up with the ever rising complexity.

Hardware cost becomes lower, but this is more than offset by rising cost for services and software. So IT budgets are still rising, whereas IT reliability is deteriorating ...

Retired Member

Member since

19 Mar 2009

Location

Blog posts

6,102

Comments

6,320

This post is from a series of posts in the group:

Banking Architecture

A community for discussing the latest happenings in banking IT. Credit Crunch impacting Risk Systems overall, revamp of mortgage backed securities, payment transformations, include business, technology, data and systems architecture capturing IT trends, 'what to dos?' concerning design of systems.


See all