Blog article
See all stories »

Dealing with Outages - Are we Ready?

Outages and glitches seem to be becoming more and more frequent – with a wide-reaching ripple effect ensuring that the impact of such outages are felt more widely than ever. This is dangerous for business – reputation is so fragile in this fickle economy that businesses just can’t afford to allow their IT to let them down. So why does it keep happening? Amazon’s cloud going down, Natwest’s two week outage, the BATS IPO failure - these outages come in many different forms, and effect many different types of businesses. The recent United Airlines outage is the first widely reported airline to have fallen foul of this, and suffered the consequences in costs of reimbursing flight costs and twitter outrage. It’s beginning to look like outages are becoming something that we have to get used to.

I understand that the nature of IT these days means that outages are almost certainly unavoidable, but the point I want to make is that in order to maintain business continuity; enterprises need to take responsibility for planning their outage contingencies.

The problem is that business processes, applications and computing infrastructure are too intertwined and dependent on each other.  If the infrastructure isn’t configured just right or is unavailable, the business process stops.  The industry has made great strides in abstracting the physical computing infrastructure from the applications it supports.  Amazon and VMware have created tremendous value and built businesses by abstracting (or insulating) applications and users from hardware diversity and failures.

However, the industry has only started to abstract the business process from the applications and infrastructure that supports it.  To work around an outage on the scale of Amazon EC2, organizations really need to utilize more than one provider to avoid a single point of failure.  Yet in order for the business to be successful at this there needs to be the ability to re-route and re-run the process in their own data center or an alternative service provider.  This is where higher-level process automation comes in.

The recent outages at RBS, BATS Global Markets and others demonstrate the inability not only to abstract the process from the infrastructure but to see the inter-dependencies and the failures that plague complex IT systems as well.  In those particular outages, it took minutes to fix the problem but days to find it.

Process automation that keeps track of the complex inter-dependencies between applications, infrastructure and business workflows can help identify, or even predict problems.  Then in the case of an unavoidable outage, the business workflows would be re-routed to an available data center.

Most process automation done today is low level IT administrative tasks for provisioning servers, handling backup or startup routines, and generally doing infrastructure tasks that require little decision making that could affect the line of business.  This is necessary and important, but not sufficient to preserve the user experience or business process integrity in the face of increasingly complex IT environments where, statistically, something is always failing. 

Enterprises must step up their IT process automation to the point that they can manage business workflows not just servers or IT tasks.

If the businesses dependent on Amazon had these capabilities, they would drastically reduce the outages they experienced. Orchestrating business workflows and associated data across applications and infrastructure is easier said than done.  However, it can, and is, being done by many enterprises to assure service-levels. 

Being able to ‘roll-back’ failed system updates to previous working versions, spotting process failures before they create an unrecoverable backlog, and the ability to run a workflow on newly provisioned environments is the type of higher-level process automation that abstracts inevitable outages from the user or business experience.

As enterprises get more serious about higher-level process automation, they will spend less time bemoaning outages and more time abstracting their processes from specific infrastructures and application environments.

Ready or not, the business is doing whatever it can to gain a competitive edge in today’s market by becoming more agile and responding to a quickly changing market and customer base.  As business and IT people work together to create new internal capabilities and customer-facing features to outmaneuver the competition, this means developing exponentially more software applications at a faster pace – and being able to launch them quickly on highly virtualized infrastructure. Speed often translates into a lack of organization and infrastructure sprawl, while agile IT practices result in more fluidity as to where applications actually run. All of this causes IT complexity and application-to-infrastructure dependencies to skyrocket.

These inter-dependencies, which represent potential breakage points, are beyond human ability alone to manage.  IT organizations are now forced to deal with these new realities while Cloud, Big Data, DevOps and ITaaS pressures get added to the mix in the name of providing more business agility.  With all these moving parts, something needs to be stable and act as the IT backbone.  It’s increasingly obvious that it’s the process and process control. 

The days of designing the process to accommodate the shortcomings of infrastructure are over.  Enterprises must abstract, insulate and protect their business processes from the applications and infrastructures that support them.   The need for improved IT process automation is rising as the services and brand impact of on-line outages grows.

3340

Comments: (4)

A Finextra member
A Finextra member 13 September, 2012, 08:51Be the first to give this comment the thumbs up 0 likes

Correct observation: We do see more and more outages and glitches in critical IT systems these days.

Wrong conclusion: These incidents are NOT unavoidable, if we choose the right technology for our critical IT infrastructure !

Today's mainstream IT technology is in essence low-cost PC technology blown up to incredible scale. This results in server farms of incredible complexity - too many components, too many interconnections, too many layers of different software, too many interfaces, too much space for human error and technology failure.

But there is other technology available - coming with less complexity and hence, with more reliability. There is even fault tolerant technology - designed to be failsafe - which would be the ideal fit for critical IT systems increasingly affected by outages and glitches these days.

That technology is well proven, for instance it supported the trading systems of the world's major exchanges for many years - until it was replaced by faster PC technology, for the sake of high frequency trading ...

Ketharaman Swaminathan
Ketharaman Swaminathan - GTM360 Marketing Solutions - Pune 13 September, 2012, 11:38Be the first to give this comment the thumbs up 0 likes

@GerhardS: By "that technology is well proven, ... many years", if you're referring to Tandem and NonStopSQL, I always thought it got replaced - where it did - by PC technology for reasons of cost. I'm shocked to learn that PC technology was deemed faster than that!

A Finextra member
A Finextra member 13 September, 2012, 12:31Be the first to give this comment the thumbs up 0 likes

@KetharamanS: There is always a tradeoff between reliability and speed, and when the "need for speed" in high frequency trading became so much prevailing the choice was clear. Cost is another factor - but while a single server box in PC technology is pretty cheap, price tags for those big and complex server farms aren't cheap either. Cost considerations also do change when looking beyond price tags and considering operational cost and downtime cost too ... 

A Finextra member
A Finextra member 13 September, 2012, 16:56Be the first to give this comment the thumbs up 0 likes

On high level business process automation:

My take here would be to cope with outages and glitches very early on – at the infrastructure level, where they occur – rather than trying to fix or work around the resulting mess much later, when it all reaches the business process level and becomes widely visible and really expensive.

Now hiring