Control Charts: A ‘how to’ guide

A key component of Deming’s ‘Theory of Profound Knowledge’ is in relation to the measurement of performance (of a system) and the ‘Theory of Variation’.

I’ve noticed over the years that, whilst the foundational points around variation can be well understood, the use of control charts within operational practice can be ‘absolutely butchered’ (technical term 🙂 ).

 

This caused me to write a ‘how to’ guide a while back, for me and my colleagues.

I recently ‘dusted it down’ and tidied it up into a version 2.0 in order that I can share it more widely, for anyone who can find value within.

I attach it as a pdf document for anyone interested:

Control charts – a how to guide V2.0

It doesn’t replace the excellent writings of Donald Wheeler…though it hopefully makes you curious to ‘pull’ his writings towards you.

It doesn’t tell you what to measure…because it couldn’t!

It doesn’t ‘do it for you’…but, hopefully, it does give you enough so that you can experiment with doing it for yourself.

…and it can’t beat working alongside someone who knows what they are doing, and can act as your coach.

 

Note: If you do end up using/ sharing this guide then I’d be grateful if you could add a simple comment at the bottom of this page so that I am aware of this. Not because I’m going to invoice you (I’m not!)…but because I would find this knowledge useful (#feedback).

You might tell me: what you thought of it (warts and all), where you might use it, whether you have shared it with others (and whether they appreciated this or not!)… and if it has improved your measurement practices.

Thanks, Steve

Targets on measures of targets on measures of things

In this post I’m going to differentiate between:

  1. Measures of things
  2. Targets on (measures of things)
  3. Measures of (targets on (measures of things)); and
  4. Targets on (measures of (targets on (measures of things)))

Wow, that last one is hard to write, let alone say out loud! You might think that it’s a nonsense (which it is) but, sadly, it’s very common.

Note: I added the brackets to (hopefully) make really clear how each one builds on the last.

I’ll attempt to explain…

1. Measures of things:

Seems straight forward enough: I’m interested in better understanding a thing, so I’d like to measure it1.

Some examples…

A couple of personal ones:

  • What’s my (systolic) blood pressure level? or
  • How quickly do I ride my regular cycle route?

A couple of (deliberately) generic work ones:

  • how long does it take us to achieve a thing? or
  • how many things did we achieve over a given period?

Here’s a graph of a measure of a thing (in chronological order):

Nice, we can clearly see what’s going on. We achieved 13 things in week 1. Each thing took us anything between 2 and 36 days to achieve…and there’s lots of variation in-between.

It doesn’t surprise me that it varies2 – it would be weird if all 13 things took, say, exactly 19 days (unless this had been structurally designed into the system). There will likely be all sorts of reasons for the variation.

However, whilst I ‘get’ that there is (and always will be) variation, the graph allows us to think about the nature and degree of that variation: Does it vary more than we would expect/ can explain?3 Are there any repeating patterns? Unusual one-offs? (statistically relevant) Trends?

Such a review allows us to ask good questions, to investigate against and learn from.

“Every observation, numerical or otherwise, is subject to variation. Moreover, there is useful information in variation.” (Deming)

2. Targets on (measures of things):

Let’s say that we’ve been asked to achieve a certain (arbitrary4) target.

Here’s an arbitrary target of 30 days (the red line) set against our measure:

And here’s how we are doing against that target, with some visual ‘traffic lighting’ added:

Instance (X)12345678910111213
Target of 30 days met? (Yes/No)NYYNYYYYYYYNY

We’ve now turned a rich analogue signal into a dull digital ‘on/off’ switch.

If we only look at whether we met the target or not (red vs. green), then we can no longer see the detail that allowed us to ask the good questions.

  • We met ‘target’ for instances 2 and 3…but the measures for each were quite different
  • Conversely, we met ‘target’ for instances 5 all the way through to 11 and then ‘suddenly’ we didn’t…which would likely make us think to intensely question instance 12 (and yet not see, let alone ponder, the variation between 5 and 11).

The target is causing us to ask the wrong questions5, and miss asking the right ones.

3. Measures of (targets on (measures of things)):

But I’m a fan of measures! So, let’s show a measure over time of how we are doing against our target.

In week 1 we met our 30-day target for 10 out of our 13 instances, which is 77%. Sounds pretty good!

Here’s a table showing how many times we met target for each of the next five weeks:

Week12345
Things achieved1315141112
Number meeting 30-day target10141278
% meeting  30-day target77%93%86%64%67%

Let’s graph that:

It looks like we’ve created a useful graph, just like in point 1.

But we would be fooling ourselves – we are measuring the movement of the dumbed-down ‘yes/no’ digital switch, not the actual signal. The information has been stripped out.

For example: There might have been huge turbulence in our measure of things in, say, week 3 whilst there might have been very little variation in week 4 (with lots of things only just missing our arbitrary ‘target’)…we can’t see this but (if we want to understand) it would be important to know – we are blind but we think we can see.

4. Targets on (measures of (targets on (measures of things))):

And so, we get to the final iteration:

How about setting an arbitrary target on the proportion of things meeting our arbitrary target…such as achieving things in 30 days for 80% of the time (the red line)…

And here’s the table showing how we are doing against that target:

Week number:12345
80% Target on 30-day Target met?NYYNN

Which is a double-dumbing down!

We’ve now got absolutely no clue as to what is actually going on!!!

But (and this is much worse) we ‘think’ we are looking at important measures and (are asked to) conclude things from this.

The table (seemingly) tells us that we didn’t do well in week’s 1, 4 and 5, but we did in week’s 2 and 3…

The base data series used for this example:

In order to write this post, I used the Microsoft Excel random number generator function. I asked it to generate a set of (65) random numbers between 1 and 40 and then I broke these down into imaginary weeks. All the analysis above was on pure randomness.

Here’s what the individual values look like when graphed over time:

(Noting that instances 1 – 13 are as per the graph at point 1, albeit squashed together)

Some key points:

  • There is nothing special about any of the individual data points
  • The 30-day target has got nothing to do with the data
  • There is nothing special about any of the five (made up) weeks within
  • The 80% target on the 30-day target has got nothing to do with anything!

The point: Whilst I would want to throw away all the ‘targets’, ‘measures of target’ and ‘targets on measures of target’…I would like to understand the system and why it varies.

This is where our chance of improving the system is, NOT in the traditional measures.

Our reality:

You might be laughing at the above, and thinking how silly the journey is that I’ve taken you on…

…but, the ‘targets on (measures of (targets on (measures of things)))’ thing is real and all around us.

  • 80% of calls answered within 20 seconds
  • 95% of patients discharged from the Emergency department within 4 hours
  • 70% of files closed within a month
  • [look for and add your own]

Starting from a position of targets and working backwards:

If you’ve got a target and I take it away from you…

…but I still ask you “so tell me, how is [the thing] performing?” then what do you need to do to answer?

Well, you would now need to ponder how has the thing been performing – you would then need to look at a valid measure of a thing over time and ponder what this shows.

In a nutshell: If you’ve got a target, take it away BUT still ask yourself ”how are we doing?”

A likely challenge: “But it’s hard!”

Yes… if you peel back the layers of the ‘targets on targets’ onion so that you get back to the core of what’s actually going on, then you could be faced with lots of data.

I see the (incorrect) target approach as trying to simplify what is being looked at so that it looks easy to deal with. But, in making it look ‘easy to deal with’, we mustn’t destroy the value within the data.

“Everything should be made as simple as possible, but no simpler.” (attributed to Einstein)

The right approach, when faced with a great deal of data, would be to:

  • Look at it in ways that uncover the potential ‘secrets’ within (such as in a histogram, in a time-series plot); and
  • understand how to disaggregate the data, such that we can split it up into meaningful sub-groups. We can then:
    • compare sub-groups to consider if and how they differ; and
    • look at what’s happening within each sub-group (i.e. comparing apples with apples)

To close:

If you are involved in ‘data analysis’ for management, I don’t think your role should be about ‘providing the simple (often 1-page) picture that they’ve asked for’. I would expect that you would wish your profession to be along the lines of ‘how can I clearly show what’s happening and what this means?’

If you are a manager looking at measures: why would you want an (overly) simple picture so that you can review it quickly and then move on to making decisions? Wouldn’t you rather understand what is happening and why … so that good decisions can be made?

Footnotes

1. Measurement of things – a caution: We should be careful not to fall into the trap of thinking that everything is measurable or, if we aren’t measuring it, then it doesn’t matter.

There’s plenty of stuff that we know is really important even though we might not be measuring it.

2. Variation: If you’d like to understand this point, then please read some of my earlier posts, such as ‘The Spice of Life’ and ‘Falling into that trap’

As a simple example: If you took a regular reading of your resting heart rate, don’t you think it would be weird if you got, say, 67 beats per minute every single time? You’d think that you’d turned into some sort of android!

3. Expect/ can explain – clarification: this is NOT the same as ‘what we would like it to be’.

4. Arbitrary: When a numeric target is set, it is arbitrary as to which number was picked. Sure, it might have been picked with reference to something (such as 10% better than average, or the highest we’ve ever achieved, or….) but it’s arbitrary as to which ‘reference’ you choose.

5. Wrong questions: These wrong questions are then likely to cause us to jump to wrong conclusions and actions (also known as tampering). Such actions are likely to focus on individuals, rather than the system that they work within.

6. ‘Trigger’: The writing of this post was ‘triggered’ the other day when I reviewed a table of traffic-lighted (i.e. against a target) measures of targets on measures of things.

Counts, categories and computations

This post sits squcalculatorarely within the ‘measurement’ section of this blog – a topic dear to me, given the vagaries of measurements that we are subjected to or are required to produce in our working lives1.

The catalyst for writing it was from revisiting a ‘Donald Wheeler’ chapter2 and reminding myself of being around some ‘daft work assignments’ of years ago.

I’ll start with an ordinary looking table that (let’s say) represents3 the feedback received by a presenter (Bob), after running a 1 hour session at a multi-day conference.

I’ve deliberately used a rather harmless-looking subject (i.e. feedback to a presenter) so that I can cover some general points…which can then be applied more widely.

Bobs presentation tableSo, let’s walk through this table.

Conference attendees were asked to evaluate Bob’s session against five perfectly reasonable questions, using a five-point rating scale (from ‘Poor’ through to ‘Excellent’). The body of the table (in blue) tells us the percentage of evaluators that awarded each rating per question (and, as you would expect, the ratings given for each question sum to 100%).

Nice, obvious, easy….but that table is sure hard to read. It’s just a blur of boring numbers.

Mmm, we’d better add some statistics3 (numbers in red)…to make it more, ahem, useful.

Pseudo-Average

So the first ‘analysis’ usually added is the ‘average score per question’. i.e. we can see that there is variation in how people score…and we feel the need to boil this down into the score that a (mythical) ‘average respondent’ gave.

To do this, we assume a numerical weighting for each rating (e.g. a ‘poor’ scores a 0…all the way up to an ‘excellence’ scoring a 4) and then use our trusty spreadsheet to crunch out an average. Looking at the table, Bob scored an overall 1.35 on the quality of her pre-session material, which is somewhere between ‘average’ (a score of 1) and ‘good’ (a score of 2).

…and it is at this point that we should pause to reflect on the type of data that we are dealing with.

“While numbers may be used to denote an ordering among categories, such numbers do not possess the property of distance. The term for numbers used in this way is ordinal data.” (Wheeler)

There is a natural order between poor, average, good, very good and excellent…however there is no guarantee that the distance between ‘excellent’ and ‘very good’ is the same as the distance from ‘good’ to ‘average’ (and so on)…yet by assigning numbers to categories we make distances between categories appear the same5.

If you compute an average of ordinal data then you have a pseudo-average.

“Pseudo averages are very convenient, but they are essentially an arbitrary scoring system which is used with ordinal data. They have limited meaning, and should not be over interpreted.”

Total Average

Okay, so going back to our table of Bob’s feedback: we’ve averaged each row (our pseudo-averages)…so our next nifty piece of analysis will be to average each column, to (supposedly) find out how Bob did in general…and we get our total average line. This shows that Bob mainly scored, on average, in the ‘good’ and ‘very good’ categories.

But what on earth does this mean? Combining scores for different variables (e.g. the five different evaluation questions in this case) is daft. They have no meaningful relationship between themselves.

It’s like saying “I’ve got 3 bikes and 10 fingers….so that’s an average of 6.5”. Yes, that’s what the calculator will say…but so what?!

“The total average line (i.e. computing an average from different variables) is essentially a triumph of computation over common sense. It should be deleted from the summary.”

Global Pseudo-Average

And so to our last piece of clever analysis…that table of numbers is quite hard to deal with. Is there one number that tells us ‘the answer’?

Well, yes, we could create a global pseudo average, which would be to compute a pseudo average from the total average line. Excellent, we could calculate a one-number summary for each presenter at our conference…and then we could compare them…we could even create a (fun!) league table 🙂

Oh, bugger, our Bob only got a 2.4. That doesn’t seem very good.

To compute a global pseudo-average would be to cross-pollinate the misleading pseudo-average with the nonsensical total average line and arrive in computation purgatory.

The wider point

which wayLet’s move away from Bob’s presentation skills.

Who’s seen pseudo-averages, total average lines and global pseudo-averages ‘used in anger’ (i.e. with material decisions being made) on ordinal data?

A classic example would be within software selection exercises, to (purportedly) compare competing vendors in a robust, objective and transparent manner.

  • In terms of pseudo averages, we get situations where 10 ‘nice to have’ features end up supposedly equaling 1 ‘essential’ function;
  • In terms of total average lines, we get variables like software functionality, support levels and vendor financial strength all combined together (which is akin to my bikes and fingers);
  • …and at the very end, the ‘decider’ between selecting Vendor A or B might go down to which one has been lucky enough to garner a slightly superior global pseudo-average. “Hey, Vendor B wins because they got 6.85”

The above example refers to software but could be imagined across all selection exercises (recruitment, suppliers,….).

Ordinal data is used and abused regularly. The aim of this short post is just to remind (or educate) people (including myself) of the pitfalls.

Side note: as a rule-of-thumb, my ‘bullshit-ohmmeter’ usually starts to crackle into life (much like a Geiger counter) whenever I see weightings applied to categories…

In summary

Before ‘playing with numbers’, the first thing we should do is think about what we are dealing with.

“In order to avoid a ‘triumph of computation over common sense’ it is important for you to think about the nature of your data…

…a spreadsheet programme doesn’t have any inhibitions about computing the average for a set of telephone numbers.”

Addendum: ‘Back to school’ on data types

This quick table gives a summary of the traditional (though not exhaustive) method of categorising numerical data:

Data types table

Footnotes

1. It’s not just our working lives: We are constantly fed ‘numbers’ by central and local government, the media, and the private sector through marketing and advertisement.

2. Wheeler’s excellent book called ‘Making Sense of Data: SPC for the service sector’. All quotes above (in blue) are from this book.

3. ‘Represents’: If you are wondering, these are not real numbers. I’ve mocked it up so that you can hopefully see the points within.

4. Statistic: “a fact in the form of a number that shows information about something” (Cambridge Dictionary).

We should note, however, that just because we’ve been able to perform a calculation on a set of numbers doesn’t make it useful.

5. Distance: A nice example to show the lack of the quality of distance within ordinal data is to think of a race: Let’s say that, after over 2 hours of grueling racing, two marathon runners A and B sprint over the line in a photo finish, whilst runner C crawls over the line some 15 minutes later…and yet they stand on the podium in order of 1st, 2nd and 3rd. However, you can’t comprehend what happened from viewing the podium.

Bobs presentation table26. Visualising the data: So how might we look at the evaluation of Bob’s session?

How about visually…so that we can easily see what is going on and take meaningful action. How about this set of bar graphs?

There’s no computational madness, just the raw data presented in such a way as to see the patterns within:

  • The pre-session material needs working on, as does the closure of the session;
  • However, all is not lost. People clearly found the content very useful;
  • …Bob just needs to make some obvious improvements. She could seek help from people with expertise in these areas.

Note: There is nothing to be learned within an overall score of ‘2.4’…and plenty of mischief.

 

 

 

 

How good is that one number?

Lottery ballsThis post is a promised follow up to the recent ‘Not Particularly Surprising’ post on Net Promoter Score.

I’ll break it into two parts:

  • Relevance; and
  • Reliability

Part 1 – Relevance

A number of posts already written have explained that:

Donald Wheeler, in his superb book ‘Understanding Variation’, nicely sets out Dr Walter Shewhart’s1 ‘Rule One for the Presentation of Data’:

“Data should always be presented in such a way that preserves the evidence in the data…”

Or, in Wheeler’s words “Data cannot be divorced from their context without the danger of distortion…[and if context is stripped out] are effectively rendered meaningless.”

And so to a key point: The Net Promoter Score (NPS) metric does a most excellent job of stripping out meaning from within. Here’s a reminder from my previous post that, when asking the ‘score us from 0 – 10’ question about “would you recommend us to a friend”:

  • NPS scaleA respondent scoring a 9 or 10 is labelled as a ‘Promoter’;
  • A scorer of 0 to 6 is labelled as a ‘Detractor’; and
  • A 7 or 8 is labelled as being ‘Passive’.

….so this means that:

  • A catastrophic response of 0 gets the same recognition as a casual 6. Wow, I bet two such polar-opposite ‘Detractors’ have got very different stories of what happened to them!

and yet

  • a concrete boundary is place between responses of 6 and 7 (and between 8 and 9). Such an ‘on the boundary’ responder may have vaguely pondered which box to tick and metaphorically (or even literally) ‘tossed a coin’ to decide.

Now, you might say “yeah, but Reichheld’s broad-brush NPS metric will do” so I’ve mocked up three (deliberately) extreme comparison cases to illustrate the stripping out of meaning:

First, imagine that I’ve surveyed 100 subjects with my NPS question and that 50 ‘helpful’ people have provided responses. Further, instead of providing management with just a number, I’m furnishing them with a bar chart of the results.

Comparison pair 1: ‘Terrifying vs. Tardy’

Below are two quite different potential ‘NPS question’ response charts. I would describe the first set of results as terrifying, whilst the second is merely tardy.

Chart 1 Terrifying vs Tardy

Both sets of results have the same % of Detractors (below the red line) and Promoters (above the green line)…and so are assigned the same NPS score (which, in this case would be -100). This comparison illustrates the significant dumbing down of data by lumping responses of 0 – 6 into the one category.

I’d want to clearly see the variation within the responses i.e. such as the bar charts shown, rather than have it stripped out for the sake of a ‘simple number’.

You might respond with “but we do have that data….we just provide Senior Management with the single NPS figure”….and that would be the problem! I don’t want Senior Management making blinkered decisions2, using a single number.

I’m reminded of a rather good Inspector Guilfoyle poster that fits perfectly with having the data but deliberately not using it.

Comparison pair 2: ‘Polarised vs. Contented’

Below are two more NPS response charts for comparison….and, again, they both derive the same NPS score (-12 in this case) …and yet they tell quite different stories:

Chart 2 Polarised vs Cotented

The first set of data uncovers that the organisation is having a polarising effect on its customers – some absolutely love ‘em …whilst many others are really not impressed.

The second set shows quite a warm picture of contentedness.

Whilst the NPS scores may be the same, the diagnosis is unlikely to be. Another example where seeing the variation within the data is key.

Comparison pair 3: ‘No Contest vs. No Show’

And here’s my penultimate pair of comparison charts:

Chart 3 No contest vs No show

Yep, you’ve guessed it – the two sets of response data have the same NPS scores (+30).

The difference this time is that, whilst the first chart reflects 50 respondents (out of the 100 surveyed), only 10 people responded in the second chart.

You might think “what’s the problem, the NPS of +30 was retained – so we keep our KPI inspired bonus!” …but do you think the surveys are comparable. Why might so many people not have responded? Is this likely to be a good sign?  Can you honestly compare those NPS numbers? (perhaps see ‘What have the Romans ever done for us?!’)

….which leads me nicely onto the second part of this post:

Part 2 – Reliability

A 2012 article co-authored by Fred Reichheld (creator of NPS), identifies many issues that are highly relevant to compiling that one number:

  • Frequency: that NPS surveys should be frequently performed (e.g. weekly), rather than, say, a quarterly exercise.

The article doesn’t, however, refer to the essential need to always present the results over time, or whether/ how such ‘over time’ charts should (and should not) be interpreted.


  • Consistency: that the survey method should be kept constant because two different methods could produce wildly different scores.

The authors comment that “the consistency principle applies even to seemingly trivial variations in methodologies”, giving an example of the difference between a face-to-face method at the culmination of a restaurant meal (deriving an NPS of +40) and a follow-up email method (NPS of -39).


  • Response rate: that the higher the response rate, then the greater the accuracy – which I think we can all understand. Just reference comparison 3 above.

But the article goes to say that “what counts most, of course, is high response rates from your core or target customers – those who are most profitable…” In choosing these words, the authors demonstrate the goal of profitability, rather than customer purpose. If you want to understand the significance of this then please read ‘Oxygen isn’t what life is about’.

I’d suggest that there will be huge value in studying those customers that aren’t your current status quo.


  • Freedom from bias: that many types of bias can affect survey data.

The authors are clearly right to worry about the non-trivial issue of bias. They go on to talk about some key issues such as ‘confidentiality bias’, ‘responder bias’ and the whopper of employees ‘gaming the system’ (which they unhelpfully label as unethical behaviour, rather than pondering the system-causing motivations – see ‘Worse than useless’)


  • Granularity: that of breaking results down to regions, plants/ departments, stores/branches…enabling “individuals and small teams…to be held responsible for results”.

Owch….and we’d be back at that risk of bias again, with employees playing survival games. There is nothing within the article that recognises what a system is, why this is of fundamental importance, and hence why supreme care would be needed with using such granular NPS feedback. You could cause a great deal of harm.

Wow, that’s a few reliability issues to consider and, as a result, there’s a whole NPS industry being created within organisational customer/ marketing teams3…which is diverting valuable resources from people working together to properly study, measure and improve the customer value stream(s) ‘in operation’, towards each and every customer’s purpose.

Reichheld’s article ends with what it calls “The key”: the advice to “validate [your derived NPS number] with behaviours”, by which he explains that “you must regularly validate the link between individual customers’ scores and those customers’ behaviours over time.”

I find this closing advice amusing, because I see it being completely the wrong way around.

Rather than getting so obsessed with the ‘science’ of compiling frequent, consistent, high response, unbiased and granular Net Promoter Scores, we should be working really hard to:

“use Operational measures to manage, and [lagging4] measures to keep the score.” [John Seddon]

…and so to my last set of comparison charts:

Chart 4 Dont just stand there do something

Let’s say that the first chart corresponds to last month’s NPS survey results and the second is this month. Oh sh1t, we’ve dropped by 14 whole points. Quick, don’t just stand there, do something!

But wait…before you run off with action plan in hand, has anything actually changed?

Who knows? It’s just a binary comparison – even if it is dressed up as a fancy bar chart.

To summarise:

  • Net Promoter Score (NPS) has been defined as a customer loyalty metric;
  • There may be interesting data within customer surveys, subject to a heavy caveat around how such data is collected, presented and interpreted;
  • NPS doesn’t explain ‘why’ and any accompanying qualitative survey data is limited, potentially distorting and easily put to bad use;
  • Far better data (for meaningful and sustainable improvement) is to be found from:
    • studying a system in operation (at the points of demand arriving into the system, and by following units of demand through to their customer satisfaction); and
    • using operational capability measures (see ‘Capability what?’) to understand and experiment;
  • If we properly study and redesign an organisational system, then we can expect a healthy leap in the NPS metric – this is the simple operation of cause and effect;

  • NPS is not a system of management.

Footnotes

1. Dr Walter Shewhart (1891 – 1967) was the ‘father’ of statistical quality control. Deming was heavily influenced by Shewhart’s work and they collaborated together.

2. Blinkered decisions, like setting KPI targets and paying out incentives for ‘hitting it’.

3. I should add that, EVEN IF the (now rather large) NPS team succeeds in creating a ‘reliable’ NPS machine, we should still expect common cause variation within the results over time. Such variation is not a bad thing. Misunderstanding it and tampering would be.

4. Seddon’s original quote is “use operational measures to manage, and financial measures to keep the score” but his ‘keeping the score’ meaning (as demonstrated in other pieces that he has written) can be widened to cover lagging/ outcome/ results measures in general…which would include NPS.

Seddon’s quote mirrors Deming’s ‘Management by Results’ criticism (as explained in the previous post).

Not Particularly Surprising

pH scaleHave you heard people telling you their NPS number? (perhaps with their chests puffed out…or maybe somewhat quietly – depending on the score). Further, have they been telling you that they must do all they can to retain or increase it?1

NPS – what’s one of those?

‘Net Promoter Score’, or NPS, is a customer loyalty metric that has become much loved by the management of many (most?) large corporations. It was introduced to the management world by Fred Reichheld2 in his 2003 HBR article titled ‘One number you need to grow’.

So far, so what.

But as most things in ‘modern management‘ medicine, once introduced, NPS took on a life of its own.

Reichheld designed NPS to be rather simple. You just ask a sample of subjects (usually customers3) one question and give them an 11-point scale of 0 to 10 to answer it. And that question?

‘How likely is it that you would recommend our company/product/ service to a friend or a colleague?’

You then take all your responses (which, incidentally, may be rather low) and boil them down into one number. Marvellous…that will be easy to (ab)use!

But, before you grab your calculators, this number isn’t just an arithmetic average of the responses. Oh no, there’s some magic to take you from your survey results to your rather exciting score…and here’s how:

  • A respondent scoring a 9 or 10 is labelled as a ‘Promoter’;
  • A scorer of 0 to 6 is labelled as a ‘Detractor’; and
  • A 7 or 8 is labelled as being ‘Passive’4.

where the sum of all Promoters, Detractors and Passives = the total number of respondents.

NPS calculation.jpgYou then work out the % of your total respondents that are Promoters and Detractors, and subtract one from the other.

You’ll get a number between -100 (they are all Detractors) and +100 (all Promoters), with a zero meaning Detractors and Promoters exactly balance each other out.

And, guess what…a positive score is desirable…and, over the long term, a likely necessity if you want to stay in business.

Okay, so I’ve done the up-front explanatory bit and regular readers of this blog are probably now ready for me to go on and attempt to tear ‘NPS’ apart.

I’m not particularly bothered by the score – it might be of some interest…though exceedingly limited in its usefulness.

Rather, I’m bothered by:

  1. what use it is said to be; and
  2. what use it is put to.

I’ve split my thoughts into two posts. This post deals with the second ‘bother’, and my next one will go back to consider the first.

Qualitative from Quantitative – trying to ‘make a wrong thing righter’

The sane manager, when faced with an NPS score and a ‘strategic objective’ to improve it, wants to move on from the purely quantitative score and ‘get behind it’ – they want to know why a score of x was given.

Reichheld’s NPS method covers this obvious craving by encouraging a second open-ended question requesting the respondent’s reasoning behind the rating just given – a ‘please explain’ comments box of sorts. The logic being that this additional qualitative data can then be provided to operational management for analysis and follow up action(s).

Reichheld’s research might suggest that NPS provides an indicator of ‘customer loyalty’, but…and here’s the key bit…don’t believe it to be a particularly good tool to help you improve your system’s performance.

There are many limitations with attempting to study the reasons for your system’s performance through such a delayed, incomplete and second-hand ‘the horse has bolted’ method such as NPS.

  • Which subjects (e.g. customers) were surveyed?
  • What caused you to survey them?
  • Which subjects chose to respond…and which didn’t?
  • What effort from the respondent is likely to go into explaining their scoring?
  • Does the respondent even know their ‘why’?
  • Can they put their (potentially hidden) feelings into words?…and do they even want to?

If you truly want to understand how your system works and why, so that you can meaningfully and sustainably improve it, wouldn’t it just be soooo much better (and simpler) to jump straight to (properly5) studying the system in operation?!

A lagging indicator vs. Operational measures

One of my very early posts on this blog covered the mad, yet conventional, idea of ‘management by results’ and subsequent posts have delved into ‘cause and effect’ in more detail (e.g. ‘Chain beats Triangle’).

My ‘cause and effect’ post ends with the key point that:

“Customer Purpose (which, by definition, means quality) comes first…which then delivers growth and profitability, and NOT the other way around!”

Now, if you read up on what Reichheld has to say about NPS, he will tell you that it is a leading measure, whereas I argue that it is a lagging one. The difference is because we are coming from opposite ends of the chain:

  • Reichheld appears to be concerned with growth and profitability, and argues that NPS predicts what is going to happen to these two financial measures (I would say in the short term);

  • I am concerned with customer purpose, and an organisation’s capability at delivering against its customers’ needs. This means that I want to know what IS happening, here and now so that I can understand and improve it …which will deliver (for our customers, for the organisation, for its stakeholders) now, and over the long term.

You might read the above and think I am playing with semantics. I think not.

I want operational measures on the actual demands coming in the door, and how my processes are actually working. I want first hand operational knowledge, rather than attempting to reverse engineer this from partial and likely misleading secondary NPS survey evidence.

“Managers learn to examine results, outcomes. This is wrong. The manager’s concern should be with processes….the concentration of a manager should be to make his processes better and better. To do so, he needs information about the performance of the process – the ‘voice of the process’. “ [‘Four Days with Dr Deming’]

Deming’s clear message was ‘focus on the process and the result will come’ and, conversely, you can look at results all you like but you’d be looking in the wrong place!

NPS thinking fits into the ‘remote control’ school of management. Don’t survey and interrogate. ‘Go to the gemba’ (the place where the work occurs).

 “But what about the Lean Start-up Steve?”

Some readers familiar with Eric Ries’ Lean Start-up movement might respond “but Eric advocates the use of customer data!” and yes, he does.

But he isn’t trying to get a score from them, he is trying to deeply engage with a small number of them, understand how they think and behave when experiencing a product or service, and learn from this…and repeat this loop again and again.

This fits with studying demand, where it comes in, and as it flows.

The Lean Startup movement is about observing and reflecting upon what is actually happening at the point of customer interaction, and not about surveying them afterwards.

To close – some wise words

After writing this post I remembered that John Seddon had written something about NPS…so I searched through my book collection to recover what he had to say…and he didn’t disappoint:

“Even though NPS is completely useless in helping service organisations improve, on our first assignment [e.g. as system improvement interventionists] we say nothing about it, because we know the result of redesigning the system will be an immediate jump in the NPS score…and because when this is reported to the board our work gets the directors’ attention.

It makes it easy to see why NPS is a waste of time and money. First, it is what we call a ‘lagging measure’ – as with all customer satisfaction measures, it assesses the result of something done in the past. Since it doesn’t help anyone understand or improve performance in the present, it fails the test of a good measure5 – it can’t help to understand or improve performance.” [Seddon, ‘The Whitehall Effect’]

Seddon goes on to illuminate a clear and pernicious ‘red herring’ triggered by the use of NPS:  the simple question of ‘would you recommend this service to a friend’ mutates to a hunt for the person who delivered the particular instance of service currently under the microscope. Management become “concerned with the behaviour of people delivering the service” as opposed to the system that makes such behaviour highly likely to occur!

I have experience of this exact management behaviour in full flow, with senior management contacting specified members of staff directly (i.e. those who handled the random transaction in question) to congratulate or interrogate/berate them, following the receipt of particularly outstanding6 NPS responses.

This is to focus on the 5% (the people) and ignore the 95% (the system that they are required to operate within). NPS “becomes an attractive device for controlling them”.

Indeed.

The title of this post follows from Seddon’s point that if you focus on studying, understanding and improving the system then, guess what, the NPS will improve – usually markedly. Not Particularly Surprising.

My next post called ‘How good is that one number’ contains the second part of my NPS critique.

Footnotes

1. This post, as usual, comes from having a most excellent conversation with a friend (and ex-colleague) …and she bought me lunch!

I should add that the title image (the pH scale) is a light-hearted satire of the various NPS images I found i.e. smiley, neutral and angry faces arranged on a coloured and numbered scale.

2. Reichheld has written a number of books on customer loyalty, with one of his more recent ones trying to relabel ‘NPS’ from Net Promoter Score to Net Promoter System (of management) …which, to put it mildly, I am not a fan of.

It reminds me of the earlier ‘Balanced Scorecard’ attempting to morph into a system of management. See ‘Slaughtering the Sacred Cow’.

Yet another ‘management idea’ expanding beyond its initial semblance of relevance, in the hands of book sellers and consultants.

Sorry, but that’s how I feel about it.

NPS is linked to the ‘Balanced Scorecard’ in that it provides a metric for the customer ‘quadrant’ of the scorecard …but, as with financial measures, it is still an ‘outcome’ (lagging) measure of an organisation’s people and processes.

3. The original NPS focused on customers, but this has subsequently been expanded to consider other subjects, particularly employees.

4. Being British (i.e. somewhat subdued), I find the labelling of a 7 or 8 score as ‘Passive’ to be hilarious. A score of 7 from me would be positively gushing in praise! What a great example of the variety inherent within customers…and which NPS cannot reveal.

5. For the ‘tests of a good measure, please see an earlier post titled ‘Capability what?’

6. Where ‘outstanding’ means particularly low, as well as high.

’80 in 20’…erm, can we change that?!

80 in 20This is a bit of a ‘back to basics’ post, inspired by refreshing my memory from reading a superb book. It’s long…but hopefully interesting 🙂

Some years back I was working with a most excellent colleague, who managed a busy contact centre operation. Let’s call her Bob. She was absolutely committed to doing the best she could, for her staff and her customers.

Bob came to me one day for some help: Things weren’t going well, she had a meeting with senior management coming up and she was going to ask them to approve a radical thing – to change, by which I mean relax, their current call handling target.

I didn’t know too much about contact centres back then…so I started by asking some dumb questions. And it went something like this:

Me: “What’s this ‘80 in 20’ measure about?”

Bob: “It’s our main ‘Key Performance Indicator’ (KPI), called ‘Grade of Service’ (or GOS for short) and it means that we aim to pick up 80% of all incoming calls within 20 seconds of the customer calling.”

Me: “Oh…and where do these figures comes from?”

Bob: “It’s an industry recognised KPI. All ‘up to date’ contact centres use it to measure how they are doing and ‘80 in 20’ is Best Practise.”

Me: “…what ‘industry body’ and where did they get these figures?”

Bob: “The [insert name of a] ‘Contact Centre Association’…and I’ve got no idea where the figures come from.”

Me: “So, we have a target of picking up a customer’s call within an arbitrary 20 seconds…and we have an arbitrary target on meeting this target 80% of the time? …so it’s a target on a target?”

Bob: “Yes…I suppose it is…but we are having a real tough time at the moment and we hardly ever achieve it.”

Me: “Okay…but why do you want to ask senior management to ‘relax’ this target-on-a-target? What will this achieve?”

Bob: “Because we publish our GOS results against target for all our contact centre team leaders to see…and frankly there’s not much they can do about it…and this is really demoralising. If I could just get senior management to relax it to, say, 70% in 30 seconds then my staff could see that they at least achieve it sometimes.”

…and that’s how my discussion with Bob started.


I have just finished reading Donald Wheeler’s superb book ‘Understanding Variation – the key to managing chaos’ and my work with Bob1 all those years ago came flooding back to me…and so I thought I’d revisit it, and jot down the key points within. Here goes…

Confusing ‘Voice of the Customer’ and ‘Voice of the Process’

VoPI’ll start with clarifying the difference between the customer and the process. In the words of Donald Wheeler:

“The ‘voice of the customer’ defines what you want from a system.

The ‘voice of the process’ defines what you will get from a system.”

The difference in words is subtle, but in meaning is profound.

In Bob’s case, she has determined that customers want the phone to be picked up within 20 seconds2. However, this wishful thinking (a target) is completely outside the system. Bob could set the customer specification (target) at anything, but this has got nothing to do with what the process can, and will predictably3, achieve.

What we really want to see is what the system (‘handling4 customer calls’) is achieving over time.

A target is digital (on/off) – either ‘a pat on the back’ or ‘not good enough!’

On off switch “A natural consequence of this specification [target] approach…is the suddenness with which you can change from a state of bliss to a state of torment. As long as you are ‘doing okay’ there is no reason to worry, so sit back, relax, and let things take care of themselves. However, when you are in trouble, ‘don’t just stand there – do something!’ …This ‘on-again, off again’ approach is completely antithetical to continual improvement.” (Wheeler)

Unfortunately, Bob is constantly the wrong side of the (current) specification and therefore has the unwavering torment of ‘don’t just stand there – do something!’

But do what? And how would Bob know if whatever they try is actually an improvement or not? Using a target is such a blunt (and inappropriate) tool. Future results:

  • might ‘beat target’ (gaining a ‘pat on the back’) and yet simply be noise5; or
  • might still be lower than target (receiving another ‘kick’) and yet contain an important signal.

Bob cannot see the true effects of any experimentation on her system whilst relying on her current Industry best practise ‘Grade of Service’ KPI. She does not have a method to separate out potential signals from probable noise.

Thinking that a target can change things for the better

pressure“When people are pressured to meet a target value, there are three ways they can proceed:

  1. They can work to improve the system;
  2. They can distort the system; or
  3. They can distort the data.               

(Wheeler, referencing Brian Joiner)

What can a call agent do to ‘hit’ that target? Well, not much really. They can’t influence the number of calls coming in or what those customers want or need. They CAN, however, try to ‘get off the phone’ so as to get to the next call. Mmm, that’s not going to help the (customer-defined) purpose…and is probably likely to create failure demand, complaints and re-work…and make things worse.

What can the contact centre management (from team leaders and upwards to Bob) do to ‘hit’ that target? They could try to improve the system* (which, whilst being the right thing to do, is also the hardest) OR they could simply ask for the target to be relaxed. If they aren’t allowed to do either, then they might begin to ‘play games’ with the data…and hide what is actually happening.

* To improve the system, Bob needs contextual data presented such that it uncovers what is happening in the system…which will enable her to listen to the process, see signals, ask relevant questions, understand root cause, experiment and improve. She, and her team, cannot do this at present using her hugely limiting KPI.

In short, the target is doing no good…and probably some (and perhaps a lot of) harm.

It’s perhaps worth reflecting that “Bad measures = bad behaviours = bad service” (Vanguard)

What’s dafter than a target? A target on a target!

stop that its very sillyWhy? Well, because it removes us from the contextual data, stripping out the necessary understanding of variation within and thus further hiding the ‘voice of the process’.

It’s worth noting that, in Bob’s ‘20 seconds to answer’ target world:

  • A call answered in 3 seconds is worth the same as one answered in 19 seconds; and, worse
  • A call answered in 21 seconds is treated the same as one answered in, say, 480 seconds….and beyond…perhaps even an hour!

Note: I’ve added an addendum at the end of this post with a specific ‘target on a target’ example (hospital wait times). I hope that it is of use to demonstrate that using a ‘target on a target’ is to hide the important data underneath it.

“Setting goals [targets] on meeting goals is an act of desperation.” (Wheeler)

Worse still, a ‘target on a target’ can fool us into thinking that we are looking at something useful. After all, I can still graph it…so it must be good…mustn’t it?

Here’s a control chart of Bob’s ‘Grade of Service’ (GOS) KPI:

I-MR

 You might look at it and think “Wow, that looks professional with all that I-MR control charty stuff! I thought you said that we’d be foolish to use this ’80 in 20’ target on a target?”

You can see that Bob’s contact centre never met the ’80 in 20’ target-on-a-target6 (and, with the current system, isn’t likely to)…and you can perhaps see why she wants to ‘relax’ it to ’70 in 30’….but we can’t see what really happens.

What’s the variation in wait times? (times of day, days of week etc.)

Do some people get answered in 5 seconds? Is it common for some people to wait for 200 seconds? (basically, what’s actually gong on?!)

Is the variation predictable? Are there any patterns within?

Are those months really so comparable?…are any games being played?!

Okay, so I’ve shot at what Bob has before her…but what advice can I offer to help?

Does Bob need to change her ’80 in 20’ KPI?  Yes, she does….but not by relaxing the target.

‘The right data, measured right’ (‘what’, and ‘how’)

what how whyAt its very simplest, Bob’s measures need to help her (and her people) understand and improve the system.

To do this, they need to see:

WHAT matters to the customer? …which could be uncovered by:
“Don’t make me queue” Volume of calls, time taken to answer, abandonment rate.
“I want you to deal with me at my first point of contact” % of calls resolved at first point of contact (i.e. didn’t need to be passed on).
“Don’t put me on hold unnecessarily” % of calls put on hold (including reason types and frequencies).
“I want to deal with the right person (i.e. with the necessary knowledge, expertise, and authority)” % of calls passed on (including reason types and frequencies).
“I want you to action what you have promised, when I need it…and to do so first time.” Failure demand, either chasing up or complaining (including reason types and frequencies).

Now, Bob (and most contact centres) might reply “We already measure some of that stuff!”

Yes, I expect you do.

What also matters is HOW you measure it.  Measures should be:

  • shown over time, in chronological order (i.e. in control charts, to show variation), with control limits (to separate out signals from noise);

  • updated regularly (i.e. at meaningful intervals) and shown visually (on the floor, at the gemba), providing feedback to those working in the system;

  • presented/ displayed together, as a set of measures, to show the system and its interactions, rather than a ‘Grade of Service’ KPI on a dashboard;

  • monitored and analysed to identify signals, and consider the effect of each experimental change towards the customer purpose;

  • devoid of a target! The right measures, measured right will do just fine.

Why are control charts so important? Wheeler writes that:

“Instead of attempting to attach a meaning to each and every specific value of the time series, the process behaviour [i.e. control] chart concentrates on the behaviour of the underlying process.”

aeroplane dashboardWhy do we need to see a set of measures together? Simon Guilfoyle uses the excellent analogy of an aeroplane cockpit – you need to see the full set of relevant system measures to understand what is happening (speed, altitude, direction, fuel level…). There isn’t ‘One metric that matters’ and it is madness to attempt to find one.

Looking at Bob’s proposed set of capability measures (the table above), you can probably imagine why you’d want to see them all together, so as to spot any unintended consequences to changes you are experimenting with.

I.e. if one measure appears to be improving, is another one apparently worsening? Remember – it’s a system with components!

To summarise:

In a nutshellIf I am responsible for a process (a system) then I want to:

  • see the actual voice of the process;
  • get behind (and then drop) any numerical target;
  • split the noise from any signals within;
  • understand if the system is ‘in control’ (i.e. stable, predictable) or not; and
  • spot, and investigate any special causes7

and, perhaps more important, I want to:

  • understand what is causing the demand coming into the system (rather than simply treating all demand as work to be done);
  • involve all of the people in their process, through the use of visual management (done in the right way); and then
  • experiment towards improving it…safe in the knowledge that our measures will tell us whether we should adopt, adapt or abandon each proposed change.

Bob and I continued to have some great conversations 🙂


I said that I would add an addendum on the subject of ‘a target on a target’…and here it is:

Addendum: An example to illustrate the point

I’ll borrow two diagrams8 from a really interesting piece of analysis on NHS hospitals (i.e. in the UK) and their Accident and Emergency (A&E) wait times.

The first chart is of Alder Hey Children’s hospital. It shows a nice curve of the time it takes for patients to be discharged:

Alder Hey

The second chart is of Croydon University Hospital. Same type of chart, but their data tells a vastly different story!

Croydon

Q1: Do you think that an activity target has been set on the A&E system and, if so, where do you think it has been set?

I’d bet (heavily) that there is an A&E ‘time to discharge’ target, set from management above, of 4 hours (i.e. 240 minutes). It’s sort of evident from the first graph…but ‘smacks you between the eyes’ in the second.

Two further questions for you to ponder9: Looking at the charts for these two hospitals…


Q2: Which one has a smooth, relatively under control A&E system, and which do you think might be engaged in ‘playing (survival) games’ to meet the target?

I’d say that Alder Hey is doing rather well, whilst Croydon is (likely) engaged in all sorts of tricks to ship patients somewhere (anywhere!) ‘before the 4 hour buzzer’ – with a likely knock-on effect to patient experiences and outcomes;


Q3: Which one looks better on a ‘% of patients that met the 4 hour target’ league table? (i.e. a target on a target)

It is typical for health services to set an A&E ‘target on a target’ of, say, ‘95% discharged from A&E within 4 hours’10. This is just like Bob’s ‘80% in 20 seconds’.

Sadly, Croydon will sit higher up this league table (i.e. appear better) than Alder Hey!

If you don’t understand why, have a closer look at the two charts. Look specifically at the volume of patients being discharged after the 240 min. mark. Alder Hey has some, but Croydon has virtually none.

Foot notes

1. Just in case you hadn’t worked it out, she (or he) wasn’t called Bob!

2. Customer Target: Setting aside that the customer target shouldn’t (and indeed can’t) be used to improve the ‘handling calls’ system, I have two problems with the 20 second ‘customer specification’.

a. An industry figure vs. reality: rather than assuming that a generic industry figure of 20 seconds is what Bob’s customers want, I asked Bob to provide me with her call abandonment data.

I then graphed a histogram of the time (in seconds) that each customer abandoned their call and the corresponding volume of such calls. This provided us with evidence as to what exactly was happening within Bob’s system…which leads me on to:

b. An average customer vs. variety: There’s no such thing as ‘an average customer’ and we should resist thinking in this way. Some people were abandoning after a couple of seconds, others did so after waiting for two minutes. We can see that there is plenty of customer variety within – we should be thinking about how we can absorb that variety rather than meet some non-existent average.

3. Predictably, assuming that it is stable and there is no change made to the process.

4. Handling: I specifically wrote ‘handling’ and not ‘answering’. Customers don’t just want their call answered – they want their need to be met. To properly understand a system we must first set out its purpose from the customer’s perspective, and then use an appropriate set of measures that reveal the capability of the system against this customer purpose. ‘Answering calls’ may be necessary, but it’s not sufficient.

5. Noise vs. Signal: I’m assuming in this post that you understand the difference between noise and signals. If you don’t (or would like a refresh) then an earlier (foundational) posts on variation might assist: The Spice of Life

6. A clarification in respect of the example ‘I’ control chart: The Upper Control Limit (UCL) red line (at 80.55%) does not represent/ is not the 80% target. It just happens to be the case that the calculated UCL for Bob’s data works out to be nearly the same as the arbitrary target – this is an (unfortunate) fluke. A target line does not belong on a control chart!

7. Special Cause tests: The most obvious signal on a control chart can been seen when a point appears outside the upper or lower control limits. There are, however, other types of signals indicating that something special has occurred. These include ‘trends’, ‘shifts’, and ‘hugging’. Here’s a useful diagram (sourced from here):

special causes

8. Hospital charts: The full set of charts (covering 144 NHS hospitals for the period 2012-13) is here. I’ve obviously chosen hospitals at both extremes to best illustrate the point.

I can’t remember where I first came across these hospital charts – which annoys me!…so if it was via a post on your blog – I’m sorry for my crap referencing/ recognition of your efforts 🙂 

9. Here’s a 4th and final question to ponder: If, after pondering those two questions, you still think that a ‘target on a target’ makes sense then how do you cope with someone not always meeting it? Do you set them a target…to motivate them?

How about a target for the ‘target on a target’???

  • A 95% target of achieving an ‘80% of calls answered in 20 seconds’ target
  • A 90% target of achieving an ’95% of patents discharged within 4 hours’ target
  • ….

…and, if you are okay with this…but they don’t always meet it then how about setting them a target…where does the madness end?!

We are simply ‘playing with numbers’, moving ever further from reality and usefulness.

10. Hospital ‘Emergency department’ League tables:

Emergency tableHere’s a New Zealand ‘Emergency departments’ league table, ranking district health boards against each other (Source).

Notice that it shows:

  • A ‘target on a target’ (95% within 6 hrs)
  • A single quarter’s outcome
  • A binary comparison ‘with last quarter’
  • A (competitive) ranking

All of which are, ahem, ‘problematic’ (that’s me being polite 🙂

You can’t actually see how each district is performing (whether stable, getting better…or worse)

…and you certainly can’t see whether games are being played.