Managing unplanned IT outages

Though unplanned outages are IT issues, they are, more importantly, business issues.

Comments

Computers fail and when they do they can become one of the most costly experiences a CIO will oversee; and yet, they are also one of the most ignored, that is, until they occur. Incident managers are the IT professionals who are given the task of restoring service in the event of an unplanned outage. Unplanned IT outages are known events that occur at unknown times. Organisations often fail to focus on what it is their incident management teams actually do to restore service and they fail to focus on the inventory of unplanned outages they handle in order to optimise the services of the former, or to understand the value of the latter.

Research recently completed by a doctoral student at Australian Catholic University’s School of Business (Incident Management: Human Factors and Minimising Mean Time to Restore (MTTR)) has investigated the characteristics and problem-solving approaches used by incident managers to restore service. This research has given some insight into practices and approaches that can be used to reduce the time taken to restore services.

The cost of unplanned IT outages

Though unplanned outages are IT issues, they are, more importantly, business issues. Unplanned outages never produce revenue for the company experiencing them. In 2008, the average revenue cost of an unplanned application outage was estimated to be nearly US$2.8 million dollars per hour, according to a report by IBM Global Services. IT disruptions can affect the entire delivery chain of the products and/or services provided by a business. Twenty-four percent of organisations surveyed by KPMG state that an unplanned outage of greater than two hours is unacceptable. Another 48 percent state that they cannot manage when unplanned outages exceed 24 hours, according to an article on managing business continuity at Data Based Advisor.

Unplanned IT outages always cost money. A gap exists, and is widening, between the cost of unplanned downtime and an organisation’s ability to be effective with traditional response activities, according to the aforementioned report. Traditional response activities - including server reboots, retrying activities to restore services that have failed in previous attempts, incident management’s inability to understand the technical environment or any other set of circumstances - do, indeed, slow restoration and cost the business money.

When an unplanned outage occurs, associated costs are not, necessarily, insignificant nor are they, necessarily, directly related to the technical impacts the unplanned outage has caused.

Corporate-brand damage can be a component of the costs of the unplanned IT outage, as The Australian reported was the case when Qantas Airlines announced flight delays the first morning a new software computer system was used by ground staff in July 2008. This was later (negatively) echoed by a server failure that occurred the morning of January 3rd, 2010 that resulted in the check-in failures that morning with problems that continued through the day, according to media reports. Both outages resulted in delayed and cancelled flights. And worse, the latter resulted in lots of Twitter and front-page, above-the-fold-news coverage. The costs of unplanned outages can include any number of individual or combination of components.

Costs of unplanned outages can include a lowered stock price, as experienced by Amazon.com, a major US bookseller, with only an online presence, which was offline for two hours on one Friday in 2008 because of an unspecified technology system failure. The company’s stock fell 4.6 percent during trading on the same day. Those costs were coupled with the lost revenue Amazon.com experienced during the same unplanned outage. Reporting revenues of US$ 1.8 million dollars per hour, Amazon.com lost sales, alone, from its unspecified technology outage of US$ 3.6 million dollars, as reported by the Investor’s Business Daily.

Costs of unplanned outages can include any combination of the costs already cited and others, including compensatory payments, corporate replies to media inquiries, corporate replies to media reports, costs incurred to restore lost services, delays in the shipment of customer-purchased products or services, lost customer loyalty, and lost customer satisfaction. This list does not include one cost that is always incurred because all companies accept that unplanned outages will occur. That cost is accepting that some number of hours of unplanned outages will occur and the unknown value of the costs that will be incurred when unplanned outages do happen.

Adding 7 + 2 + 7

Unplanned outage types can be put into seven categories, including Acts of Nature, Hardware, Humans inside the Affected Company, Humans outside the Affected Company, Software, System Overload, and Vandalism. Though able to be typed, each unplanned outage may require a different way to restore service. Irrespective of an unplanned outage being one type, or a combination of those types, the goal of the incident manager responsible for restoring service is to restore it as quickly as possible.

Focus on the incident manager

As mentioned previously, in large organisations the Incident Manager is the person with the responsibility of restoring service. The research undertaken investigated two different problem-solving approaches that Incident Managers used when restoring service and also investigated the characteristics displayed by those Incident Managers. Questionnaires were completed by 154 incident managers working at one Fortune 500 corporation and one ASX 100 corporation.

The approaches to solve problems are problem-focused (PFA) and solution-focused (SFA). Famous among problem-focused approaches to problem solving include Kepner-Tregoe Problem Solving and Decision Making process, Ishikawa diagrams, and Total Quality Management. Each is used to determine the root cause of a problem and eliminate it. Alternately, solution-focused problem solving acknowledges a problem exists, but all effort is put into finding and implementing a solution to that problem, not, necessarily, removing it, but removing its impact.

The seven characteristics of incident managers that were studied and assessed were Being Decisive, Being Entrepreneurial, Being Authoritative, Being Demanding, Being Pragmatic, Being Facilitative, and Being Communicative.

The sum is incident management

The study found that Being Authoritative and using a solution-focused approach to solve problems resulted in the lowest MTTR ( Minimising Mean Time to Restore) attained, irrespective of the type of unplanned IT outage experienced. Being Communicative and using a problem-focused approach to solving problems resulted in the highest MTTR attained. Does it matter?

Only when your interest in restoring service from an unplanned IT outage includes its cost to the business experiencing it. The research assessed the effectiveness of each individual characteristic displayed and its effect on MTTR, the two problem-solving approaches and their effect on MTTR, and the combination of the characteristic and the problem-solving approach and their effect on MTTR.

What was discovered, as shown in Figure 1 below, is that the most effective combination of characteristic and problem-solving approach was Being Authoritative and using a solution-focused approach. The top 10 effective combinations of characteristic and problem-solving approach are also shown.

Note: In Figure 1, the smaller the value on the y-axis (the significance), the more effective the approach used and characteristic displayed are to restore an unplanned outage.

Figure 1: Actual Effective Characteristics and Problem-Solving Approaches to Minimise Unplanned IT Outages

The incident managers who responded to the incident management questionnaire identified themselves as not being pragmatic, and stated that being pragmatic was the least frequently displayed of all characteristics studied (See Figure 2 below). However, being pragmatic proved to be among the most effective characteristics to display when restoring service from an unplanned IT outage, but only when used in conjunction with a solution-focused approach to solve problems.

Figure 2: Preference of Characteristics Displayed by Incident Managers

If we consider an analogy between an unplanned IT outage and a person having a heart attack, it is obvious that restoring service is more critical than understanding the root cause. Incident managers need to be trained to take a solution-focused approach to restoring service and to be made aware of the beneficial results that can be attained when being authoritative and pragmatic.

And what does that mean to you? It means that your unplanned IT outages will continue to occur; however, their duration can be minimised and their costs lowered.

About the authors:

Katherine O’Callaghan is director, KOZADAR Consulting. Sugumar Mariappanadar is a senior lecturer in management/HRM at the School of Business at the Australian Catholic University in Melbourne. Theda Thomas is national associate dean of Arts and Sciences at Australian Catholic University.

Follow CIO NZ on Twitter @cio_nz

Click here to subscribe to CIO.