Incident Management Statistics: Lies, Damned Lies and Falsehoods

“Facts are stubborn, but statistics are more pliable.”
—  Mark Twain

Statistics can be interpreted to support a weak argument or manipulated to deflate a strong one yet they are the undisputed yardstick by which we measure service performance.  How is it something as simple as measuring the time involved in resolving an incident can be open to interpretation?  How can there be differing opinions on what “first contact” means?

When developing their component of a Balanced Scorecard, IT organizations often focus more on internal processes than the customer experience.  They also tend to develop processes and procedures that suit their needs or capabilities without asking the recipient of their services what they want, need, or can afford.

When measuring a service provider’s performance it is essential we measure and report on both the ability of that organization in meeting customer expectations and the internal processes in providing that service.

I have seen organizations manipulate their performance statistics by attempting to interpret seemingly indisputable data such as business hours.  Internally they would measure their performance against their working hours even though these were different than the client’s working hours.  A service recipient operating on a 24×7 basis clearly has a different view of a 36 hour impact than a service provider operation on a 5×9 basis.  Obviously, the appropriate resolution here is to have true Service Level Agreements in place defining and reconciling these differences.  Without this, each group may have their own definition of a day and naturally manipulate the statistics to their favor.

I have also witnessed organizations attempt to “stop the clock” when measuring internal performance while organizations or processes external to theirs are called in to play.   They somehow mistakenly believe they have no accountability to this subordinate work again failing to take in to account the customer perspective.

Let’s say you and I were sitting in a restaurant and were going to share a pepperoni pizza.  Our server takes our order, places it with the kitchen, the chef promptly rolls out the pizza dough and puts the sauce on top.  Whoops… no more mozzarella!  He calls out to server and explains the problem.  The server then informs the manager of the missing ingredient.  The dutiful manager grabs a $20 bill from the till, runs out to the local grocery store and brings back the missing cheese.   The sous-chef then grates the cheese and hands it off to the chef who places it on the sauce along with the pepperoni.  Next, the chef places it in the oven and 12 minutes later it’s out of the oven and he alerts our server.  Finally the pizza is whisked to our table and we’re served.  Sounds like great customer service on the part of the manager, right?  Well, maybe.

How long did it take to make that pizza?  The chef spent less than two minutes actually making the pizza.  The actual baking time was only twelve minutes.  The problem here is the 45 minutes we sat waiting for the manager to run to the grocery store.  If I were the chef I could argue I only spent two minutes making the pizza.  The server delivered it to our table in less than a minute after it was done baking, etc.  However, we, as the customers, would say it took an hour to be served and never patronize that restaurant again.

Lastly, at the end of the week the restaurant owner reviews the receipts and reprimands the chef for taking an hour to make a single pizza.  In defending his performance the chef argues that this was an anomaly and shouldn’t be counted in his weekly performance as it skewed the data.

Had either the manager or our server reset our expectations when alerted to the missing ingredient we would have been given ability to either order a different item or an appetizer while we waited.  It’s all about communication and expectations.  Perhaps we didn’t receive great customer service after all but I’ll save that for a later installment.

It is easy to see how some can interpret performance metrics in different ways.   Is the chef correct in his assessment that the hour long pizza preparation is anomalous and “shouldn’t be counted against” him or do we need to include it, report it as an outlier, and with it identified work toward eliminating those outliers?

This authoritative report entitled Incident Management Statistics provides a clear, concise, rational method for gauging performance metrics, the importance of outliers and why they must be included when measuring performance.  Grab a copy and read through the real-world, though provoking examples.

Add to FacebookAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to TwitterAdd to TechnoratiAdd to Yahoo BuzzAdd to Newsvine


~ by Marc Hedish on January 13, 2010.

2 Responses to “Incident Management Statistics: Lies, Damned Lies and Falsehoods”

  1. Very strong points, however, I am struggling with the “stop the clock” portion and have some questions.

    I work for an organization that has SLAs in place for our customer contracts, we are on the hook for a response time of within 8 business hours on a medium level incident. This is what the client agreed to pay, although we are a 24 hour support shop and offered more agressive response options to avoid downtime.

    I will use the example of a medium level incident where the client was impacted but could still work. Let’s use an Internet Explorer issue where the client could not access a specific site in IE, but could in FireFox. Both apps are allowed by the company and are within the standard build of the workstation so the user has access and rights to use either one based on preference.

    Sometime during the night an GPO update was performed and the allowed exe list for approved applications was corupted. Due to the change induced from the AD side of the house, IE was no longer allowed to execute on the local workstation. FireFox was still in the approved list and functioning. Now, my group does not have access to modify the allowed exe list GPO and we have to redirect to another group once we validate a workaround is in place. My team cannot do the work, we do not have the access or the rights to perform the work. Once we redirect the ticket to the proper group, how do you suggest we measure the time? Do we continue to clock the incident as active? Do we turn the whole issue over to the next level? Are we still accountable for the issue with the customer even though we are no longer working the issue? It would seem that once we have done all we have access to do, then we are no longer responsible for the issue since we validated a workaround and passed the issue on to the proper group to correct.

    What are your thoughts on this situation? We see it very often and especially during high traffic issues where people want to pass off tickets quickly to look like they are doing a great job.


  2. In your example, the clock issue really doesn’t come in to play. As you described, your service desk was able to provide a workaround. At that point, the user is no longer impacted. There is no longer a reduction of service therefore the incident can be closed and the clock would immediately stop.

    If, as you describe, the service desk was able to correlate the service impact with the RFC for the GPO update we would now have a known error. A known error is a condition identified by successful diagnosis of the root cause of a service issue, and the subsequent development of a workaround.

    The known error would be tracked independently of the original incident. At no time does an incident become a known error just as an incident does not become a problem. The clock for the known error and its subsequent correction are independent of the original incident.

    What I was referring to was an instance where a known error was not determined.

    If, for example, IE was impacted by the GPO update and FireFox wasn’t an option the incident would remain open until either the GPO update was reversed or FireFox were installed as a workaround. I have seen instances where the service desk was attempting to put the clock “on hold” until the GPO update was backed out. As long as service is impacted and no workaround is available, the clock continues to run.

    A better example would be if a user were impacted by a dead hard drive. A viable workaround may be a hot swap of either a replacement drive or an entire system. If a hot swap were unavailable and a your technicians needed to order replacement parts the incident clock would continue UNTIL SERVICE IS RESTORED. You could not put the incident “on hold awaiting parts.”

    Your technician may only spend five minutes diagnosing the issue and another 20 minutes replacing the drive/system but if they had to wait 3 days for the parts, the incident length must span all 3 days.

    My suggestion is to create subordinate tickets within your TTS to track the parts order. You would then have the ability to report all the levels of performance.

    1. The overall incident length
    2. The time involved in diagnosing and replacing the impacted system
    3. The time and information involved in obtaining the replacement parts

    Item three is especially important if the replacement parts were obtained from a vendor or even a separate department within your organization.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: