Master of disaster

By their nature, large and complex systems are prone to failure. But it's the response to a major crisis that sets leading CIOs apart from their peers.

Comments

It's the nightmare of every chief information officer – a systems meltdown that unplugs vital services, causing embarrassment in the burning glare of media attention. But while working through a crisis is an integral part of every organisation's disaster plan, how does a CIO pick up the pieces after the initial problems are temporarily fixed?

Dealing with turbulence is a topic well understood by Murray Harrison, former CIO of the Australian Customs Service (now the Australian Customs and Border Protection Service). In late October 2005, as billions of dollars worth of shipments were being processed in time for the Christmas season, disaster struck – a new system he was bringing online failed in dramatic fashion.

Many of the systems used by Customs to track and offload shipments required technology that was failing day after day, as the cargo management system refused to co-operate.

But ships and teams of dockers slowly grinding to a halt were only the start of the problem for Harrison's technology department. Six years on, he remembers the incident with clarity. "It was a very difficult period and it took a lot of Customs officials working very long hours to get through it all," he says.

News of the problem spread like wildfire across the media, but it was only when radio shock jocks and tabloids raised the possibility of Christmas gifts not arriving in time that the heat really started to burn.

"The media distracts your attention from the issue," Harrison says. "I took the view from the start that I needed to take as much of that role away from the technicians working on the problem as possible.

"The problem is that the media attracts the attention of a whole lot of different people, who then ask questions that you have to spend time answering."

Politicians, fearful of being labelled as the Grinches who stole Christmas, sent urgent inquiries and made wild assumptions as accusations rang out between the state and federal governments.

To make things worse, IT experts contacted by journalists speculated further and discussed worst-case scenarios, forcing an already pressured technology team to use valuable time denying allegations.

"A lot of experts pop up who may or may not know what they're talking about," Harrison says. "But because they're experts they need to be responded to in some way or another.

"The hardest part was that some of the media coverage was simply wrong."

Harrison says it is vital for CIOs to acquit themselves well during the incident – given it will inevitably be analysed in minute detail for months after the event. "A CIO has to think of how to deal with the whole issue and not tactically about the latest development, even though everyone else is thinking that way," he says. "The most pressing might be the latest media article but it's not necessarily the most important thing to deal with.

"It's no use playing the blame game and saying 'it wasn't my fault' because it's your job, your responsibility, and no one is going to give you tissues if it all falls over."

Former Qantas and Telstra CIO, Fiona Balfour, says the media is best handled by the corporate communications team, at arm's length from the technicians being paid to fix the actual problem.

"The CIO shouldn't be the pivotal point if a massive customer-facing system goes down," she says. "They should be appropriately briefing colleagues and a disaster room should be set up with all the communications lines, and everything should be co-ordinated from that room."

While it might sound obvious, the core strategy must always be to solve the problem as quickly and smoothly as possible.

"The first step in preventing a disaster is to plan for it as you implement the system," says Balfour, who is now a private consultant and board member. "Many organisations have become a little lazy in how they plan their systems because the technology platforms are much more resilient than they were 30 years ago.

"Your business should have business processes in the event that there is no availability and rehearse them from time to time.

"This means when disaster occurs people respond by executing the plan rather than panicking, and so a lot of heat goes out of the situation."

But fixing the problem is only part of the battle. Once the initial problem has subsided, and journalists move on to cover a new crisis somewhere else, the uncomfortable but vital internal investigations begin.

Balfour says post-incident review (PIR) should immediately become the number- one priority of any technology team to ensure an embarrassing repeat performance does not occur.

"The PIR must be conducted quite clinically," she says. "The interesting problem is that some staff don't speak the truth and [they] exaggerate the impact.

"As soon as you have systems restored you must work out what went wrong, why, and whether or not it can be prevented from happening again."

Ovum research director and former AusIndustry CIO, Kevin Noonan, says it's important for any executive to take control as well as responsibility where it's due, no matter how tough the reviews and audits become.

"There's a tendency to go to ground, hope it blows over and pretend it never happened," Noonan says. "People try to find someone else to blame or dive into technical details.

"You shouldn't go down dark holes of what potentially could have happened. If you don't know what happened, just admit it."

While high-profile private-sector failures can be a horrifying prospect for a company, Noonan says government agency CIOs have it much harder because inquiries come on three fronts; through the media, post-incident audits and painfully political committee hearings.

"The hearings inevitably mean another turn through the media wringer," he says. "You tell the truth and stay with the same story and facts through all of them.

"There are a dozen ways things can go wrong so it's important to get out and admit to mistakes when you know about them."

No matter how tempting it may be to promise an incident or problem will never happen again, when asked by journalists and politicians, Noonan warns that it is never a good idea because IT is all about minimising risk rather than deleting it entirely.

"This is where auditors can become your new best friend," he says. "Internal performance auditors look at whether risks are correctly covered.

"This is a chance to either ask for more dollars or for better facilities on the basis of fact because it's hard to convince CEOs that IT is money well spent until it's been the basis of an actual problem."

He says the best way to work with auditors is to be honest and open while treating them as human beings, rather than the enemy.

"You have to make sure the auditors are engaged both when problems happen and when contracts are entered into in the first place," Noonan says. "That way these issues are thought through and CIOs can turn to clear documentation trails where these things have been considered and dealt with."

Harrison says every executive and manager must ultimately accept they may occasionally be at fault and take some portion of the blame where needed.

"We had a number of reviews, some public and some internal," he says. "They were very clinical about the decisions we took and, from a CIO's point of view, the one thing you can't do is try to cover up the issues.

"Just explain what you did [during the crisis], why you made those decisions and what the outcomes were before allowing others to pass judgment."

Once the recommendations come through, the CIO's job is to implement and not actively fight them out of pride.

"Just make sure you don't take it personally because you make decisions based on whatever information you have and some turn out to be sound, while others turn out to be questionable," he says. "That's just the way of the world."

Of course, technology management is only part of a chief information officer's role. Harrison says managing people is vital after the event because staff morale can take a hit during inquiries.

"Morale is rarely an issue during the crisis because you put the right people to their task and they know what they're doing," he says. "A positive point is that normally the IT people who know the situation are very sympathetic and understand why something has happened."

Balfour says that, if anything, major incidents tend to have the opposite effect as teams get brought together and work better than ever to fix the problem.

"If you do have an outage and you can get back-up by successfully following the plan, then it actually gives a huge boost to morale because it says 'our process has worked'," she says. "But if you have a disaster and your processes fail, that's when morale gets damaged."

For Harrison, it's important for all CIOs to remember that there is light at the end of the tunnel. In his case, the tech failures at Customs lasted for a very painful 10 days but not the months that some media outlets warned. Christmas (and Santa) came, and most gifts were delivered with only slight delays.

"Despite the incident and the difficult implementation, we had our busiest October on record," Harrison recalls.

"And most of the reports we've heard say it is a very good system that has held up ever since."

Sidebar: How to cope when trouble hits

Don't take short cuts when planning a system implementation – the best way to cope with disaster is to avoid it in the first place.
If disaster does strike, consider the problem as a whole – avoid getting caught up in the latest developments being discussed on talkback radio.
For those with the task of fixing the problem, protect them from the media glare – answering questions while there are fires to put out will be an unwanted distraction.
Don't play the blame game – pointing fingers will cause divisions in the ranks and make matters worse rather than better.
Work through the recovery plan – this should help restore normal working order and provide a timely reminder that processes usually work.
Conduct a post-incident review – once disaster has struck, one of the biggest priorities must be to ensure that the same mistake doesn't happen again.
Never promise there will never be another failure – IT's job is to minimise risk, not to eradicate it.
Don't take things personally – decisions are taken on information available at the time and everybody makes mistakes.
Use adversity to your advantage – teams pull together when they have their backs against the wall and will emerge stronger with good leadership. MIS Australia

To comment on this article, please email the editor.

Follow CIO on

Twitter @cio_nz

Facebook

LinkedIn

Sign up to receive CIO newsletters.

Click here to subscribe to CIO.

Send news tips and comments to divina@cio.co.nz