ITIL Major Incident Management - Article 4
1) when a single incident results in total service failure for more than one customer
2) when a major customer has a total loss of failure of a service at either a critical location or a critical time in their business calendar
Case 1) speaks for itself but case 2) requires a little discussion because it requires a level of intimacy with the major customers business. This level of intimacy can only really come by using a knowledge base or at least some form of system support to provide this information to your service desk team. As examples the critical location may be a distribution centre for a retail organisation and if their network connection or main wharehouse system is down then they will not be able to receive JIT orders from their retail outlets. Another example might be a specific printer that is critical to producing board presentations for the two days prior to each board meeting or finally it may be the billing system of a telecommunications company during the first few days of each month in order to get the bills produced and mailed out to customers. I have first hand experience of each one of these examples but you can appreciate the wide variety there might be.
So now we know what a major incident might be, how do we go about managing it?
Heres where a little confusion reigns because although this is a major incident it's normally managed by, in ITIL terms, the Problem Manager (more about the role in a seperate article later). I have in previous assignments referred to this role as the Incident Manager rather than use the ITIL term and this alleviates this confusion.
Due to the impact of a major incident either on multiple customers or on a very important customer, the likelyhood is that senior management within your organisation will find out about it via a customer if a structured approach to the management and communication of the incident is not followed. The Incident Manager is responsible for both the co-ordination of resources to restore service and the communication process to keep customers and senior management up to date with progress. I cannot sress the importance of this communication process enough.
An approach that I have used with a global investment bank and also with a global telco is as follows;
The incident manager sets up a conference call with all resolver groups (internal and suppliers if neccessary) and identifies a technical leader who will remain in touch with progress from the engineering staff. That technical leader will be the spokesperson on that conference call and will keep the incident manager updated with progress. He will also be required to put a time line together for the diagnostics and work that the engineering teams will cary out to share with the incident manager. This has the benefit of keeping the buzzards of the back of the engineers while they are working to restore service. The incident manager will act as the catalyst for escalation if this is required in order to restore service.
The incident manager then moves to the communication process and sets up a conference call with affected customers or in the case a larger organisation like the global Telco the customer service managers. This is an update call to give progress and communicate the diagnostic plans to those customers. The incident manager needs to be confident, clear and concise in this conference call, where on the technical call there might be some element of discussion about progress, negotiation on time scales and escalation. In the customer conference call he needs to giving clear and confident information about progress towards service restoration. he may also take information from customers about additional service affected and business impact.
The incident manager is also responsible for taking the question of whether to escalate or not out of the hands of the engineering staff. It maybe that a level of expertise is required that is only available from suppliers whereas the internal engineering staff may be reluctant to call them because they want to resolve the incident themselves.
Once the service has been restored by either a workaround or a permanent solution then this moves into the problem and the change management processes and these will be covered in later articles.