2013-06-23
'Human error' is not a root cause of problems
Whenever something bad happens, like people changing files that are controlled by an automated system and then having their modifications overwritten, it is tempting to blame the person, to say 'the root cause of this incident is human error'. This is both wrong and a mistake. What we call 'human error' is basically always really a failure of process or the surrounding environment, in at least three different ways.
First (and famously) the person who committed the error may have been working in an environment and with an interface that magnified the chances of errors, in some cases making it almost certain that someone would make a mistake sooner or later. Confusing interfaces, incomplete information, overwhelming flows of information, there are lots of ways to fail here. Bad interfaces and environments don't make errors certain (if they do, they get fixed), but they make it more likely. Unless you're lucky this will not be an obvious thing because people very rarely build interfaces that are obviously bad; instead, they tend to build interfaces that look superficially okay but have hidden flaws.
(Often the people who build the interfaces are not well placed to see how they lead people astray. You need a skeptical outside eye.)
But even with good interfaces, people sometimes make errors because they are sleepy or under pressure of some emergency or any number of other reasons. This too is a failure of process, specifically a failure to understand that people inevitably make mistakes and to improve your overall environment to deal with this. A resilient environment needs to work even in the face of occasionally sleepy or forgetful or over-stressed people, because all of these are going to keep happening.
(You can try to do something about some of these causes with high-level process changes. For example you could decide to deal with the sleepy people problem by saying 'no more midnight downtimes for work, we'll do them during the day when people are fully awake even if it's a bit more disruptive'.)
Finally, maybe you can say that you tried all of this and someone just can't keep from making mistakes anyways. Perhaps you have a sysadmin who just keeps editing files directly despite lots and lots of attempts to educate them otherwise. In this case you still have a failure of process; to put it bluntly, how did you manage to miss the problem when you hired this person and how come they are still working for you? Hiring and retaining bad or incompetent people is itself a failure that you need to address.
Understanding that human error is not the root cause is important because your goal should be to stop problems from happening again and to do that you must understand why people commit those errors. Very few people deliberately do things that they know are wrong. Either they do things that they think are right, in which case you need to figure out why they thought they were doing the right thing, or they make a mistake in execution and you should figure out how that mistake was possible and was so damaging.
(Note that 'ignorance' is not really a good explanation for why someone thought they were doing the right thing and even if it's correct, it leads to the process failure questions of why this ignorance wasn't fixed before the incident and also why the ignorance wasn't detected.)
(None of this is original to me and if I had planned this entry ahead of time I would have all sorts of links for you. Much of this information ultimately comes from general system safety research and especially aviation safety research and has reached me through sources like John Allspaw and the general Twitter sphere I follow. See eg here and this excellent presentation, which are the best recent links I could find in my Twitter stream right now.)
Sidebar: on what is and isn't a mistake
Note that there are a whole bunch of situations where people are not making mistakes in an ordinary sense, in that they are doing things that get them the results that they want but in a way that you don't like. These are 'mistakes' from your perspective but not from theirs, and in this situation it is even more important to understand why these people are taking the actions that they are. Preemptively declaring these cases a 'mistake' that you then define as being made due to 'human error' is two mistakes in one decision and is basically guaranteed to not solve your problems and to give you increasingly toxic relations with those people to boot.