As trading resumed in New York after a two-day Hurricane Sandy-inflicted pause, the effectiveness of Wall Street's back-up plans to deal with disasters was being called into question.
It's not unusual for banks to have their DR site located fairly close to the other side of the nearest river (East or Hudson). Office in lower Manhattan near the Hudson, DR in Jersey City, etc. Moronic, of course, but not uncommon. And DR testing is frequently
not a high priority, though that did improve some after 9/11. But Sandy, far more then 9/11, should wake up some of the most complacent managers. One US bank operation of a foreign bank that I know is probably mostly shut down now - its primary site is near
WTC, and it's DR is just across the river.
Putting DR centers further away and on higher grounds would certainly make common sense. But such an approach would not be very compatible with high speed gambing (aka HFT), due to longer signal transit times ...
Not sure if this is an instance of regulators sitting in ivory towers and passing judgement at banks and FIs who are the ones really facing the brunt of the disaster. I partly blame technology vendors for conveying the impression that you can throw in an
extra RAID here, install an extra blade server there, run a redundant cable between here and there, and be completely assured of business continuity. Having been through a couple of DR tests, it's virtually impossible to verify that the DR site can be activated
and will work fine when the catastrophe actually strikes. Besides, all this talk of technology ignores the people angle. Amidst all the travel disruption that usually accompanies natural disasters, it's not easy to get the right people - who normally work
out of the primary site - to the DR site in time, especially if it's located far away from the primary site.
Having said that, banks and FIs should do more than rely on a mop and bucket as their chief DR strategy - which is what one bank allegedly did - in the event of their data centers getting flooded!
Well - actually the organisations running such critical systems are to blame, if they believe some marketeers painting low-cost plain vanilla PC technology (which today sits inside of those complex server farms) as being as reliable as the big iron that
has been deployed there before. Once upon a time, typically fault tolerant (= failsafe) servers have been used in such trading applications - and those were also designed to easily and reliably switch over to the remote DR system if needed. Of course, that
DR switchover has also been tested regularly.
And yes, those systems were also built to be run in a lights-out environment, and the (rather small) operational staff could operate them remotely. No need to rush them to the DR site just to activate it ...
I agree with your point about falling for sales pitches for PC / Wintel systems claiming to be as failsafe as Tandem, Stratus and other big-iron that supported true redundancy.
As for remote operations, certain activities - e.g. changing switch encyrption keys - require the personal visit of one of very few highly specialized engineers and can't be done remotely for security reasons. I've also come across more than one bank where
certain tasks can only be performed onsite. I don't know why such policies exist but they do pose severe challenges to keep the lights on in the event of a disaster.
I've been working in IT and related functions since 1979 (and studying it since about 1971). I was taught some basics about DR and reliability along the way;
1. DR sites should whenever possible be on separate electric grids, not in major danger zones (flood, tornado, earthquake, etc.), have independent power, have enough capacity to carry your business for months, and have systems hot and backed up either continuously
or daily from production.
2. DR scenarios should be tested to varying degrees multiple times per year. Everything should be tested including incoming data, outgoing trade or like systems, web sites, intranet, backup, etc.
3. It's not a bad idea where practical to split production load (think Market Data backbones, for instance) between prime and DR sites as long as either one can handle the full load if need be. Carefully handled licensing costs might be somewhat higher
but then you know that the DR site is fully operational at all times. It is critical though to have diverse routing from vendor feed sites, something frequently overlooked.
4. A financial firm should have its own network designed to work well even if one or more production locations are lost, and this connectivity should be frequently tested. Back around 1990, Bankers Trust hubbed its network at 130 Liberty St - across from
#1 WTC. I pointed out at a DR planning meeting that there really needed to be some way to connect if the basement (where the hub was located) got damaged or flooded (I won't repeat the crude words aimed at me for that comment). But we all know what happened
to that building on 9/11 - eleven years later.
I read this morning that the New York Daily News had lost both its offices in lower Manhattan, and its printing plant / DR site in Jersey City. I think we all understand that this sort of siting is at best, stupid (I think NYDN has now figured that out
too). So banks that have primary in lower Manhattan and backup either at say Metro Tech Brooklyn or in or near Jersey City or some such combo may finally be learning this lesson.
DR has long been a joke at many firms; I'd hoped that 9/11 would teach the needed lesson, but clearly it has not. Maybe Sandy will - or not.
to £60k base, £100k OTEAnywhere, UK
© Finextra Research 2014