Traditional reliability texts identified three groups of causes of accidents: mechanical failure, operator error and natural catastrophe. However over the last 50 years it has become increasingly apparent that mechanical failure (component failure) is in the vast majority of cases itself the result of human error – design oversight, improper maintenance, improper installation, and so on.
In the June edition of Nuclear Engineering International Ken Ellis presented an elegantly argued essay, ‘Putting people in the mix,’ on the vital importance of a proper understanding of the nature of human error and how the risk imposed by human performance errors can be managed. We need to understand, he emphasised, why people operating in complex systems such as nuclear power plants did what they did and concludes that "real risk reduction requires getting to the bottom of human performance issues." This is absolutely unarguable; however, it could be argued that our investigations into the nature and significance of human performance should not be limited to those ‘operating in complex systems’ since those systems themselves operate in an environment populated by human beings whose decisions will have palpable influences.
Considering Costa Concordia
The opening illustrative example Ellis used was the 2012 maritime casualty of the cruise ship Costa Concordia resulting in 32 fatalities. This was a well-found, modern (2004) vessel, fitted with all the latest navigational mod cons and carrying a full set of up-to-date charts. How then could such a vessel, in calm sea conditions, collide with a charted reef at a speed sufficient to tear open a 160 ft gash 26 ft below the water line? The ‘how’ turns out to be quite simple; at about 20:10 the Captain took his vessel off its planned course to carry out a ‘sail-past’ of Isola del Giglio. It was after dark, he was navigating visually and had switched off the alarms on the ship’s computer navigation system. Thirty-five minutes later the vessel had come within less than half a mile of the shore when it struck the reef. So why did a presumably experienced and qualified master place his ship in such a situation? Why did the other officers on the bridge allow this to happen? One needs to look outside the ‘complex system’ of the ship and examine the broader environment.
The first thing to note is that any skipper, whether commanding a 12-ton yacht or a 114,000-ton cruise ship, should know that close inshore sailing is inherently hazardous; for large vessels especially so since their manoeuvrability becomes seriously impaired in shallow water. The actions of the captain of Costa Concordia seem incomprehensible. However it became apparent the ship’s deviation from its course and close approach to land was not a unique event. That the company took no action to end this hazardous practice suggests that either they were totally unaware that it was hazardous, or that they wilfully ignored the hazard. In either event, management influences external to the ‘complex system’ of the Costa Concordia were a significant influence. This in no way detracts from the ultimate authority and responsibility borne by a ship’s captain. But captains are answerable to their shipowners, and if shipowners are aware of hazardous manoeuvres at sea and take no action to end them, then such manoeuvres have their tacit approval. The decisions and actions of those in the ‘complex system’ are moderated by the decisions and actions of those outside that system.
Beyond complex systems in nuclear
For the purposes of this discussion perhaps we should consider the ‘complex system’ Ken Ellis identifies as being that concerned with the operation of a nuclear power plant, including all those technical services needed to support routine operation. External to that system would be the management structure of the owner-operator and design, safety analysis and other services supporting a number of nuclear installations.
The nuclear field offers many examples of where ‘external’ errors or omissions have been significant contributory factors to accidents and incidents, the Windscale fire of 1957 being one of the earliest. The initiating event of this accident was the decision by the pile physicist to apply a second nuclear heating to trigger Wigner energy release in the graphite moderator. At first blush this could be identified as a clear case of operator error exacerbated by misleading in-pile instrumentation. However as the report on the accident made clear, ‘errors’ made by the operating staff were attributable to an almost complete absence of operating documentation and other "deficiencies and inadequacies of organisation." The only operational documentation the pile physicist had to guide him in a delicate reactor manoeuvre was a memo of less than one hundred words in length. In this case those inside the complex system of Windscale Pile No. 1 were essentially set up to fail by omissions of those outside the system.
In the case of the Three Mile Island accident, it is unarguable that this accident demonstrated the existence of major blind spots in the US nuclear industry’s management of operational safety, including in the regulatory sphere, most notoriously the response to the precursor event at Davis Besse. In fact the seriously misleading nature of pressuriser level indication was identified by two B&W technical staff following the Davis Besse incident. They recorded their concern about the matter in unambiguous memoranda to B&W technical management in February 1978, but response was dilatory and irrelevant; the real concern was not effectively communicated through the organisation. Once again, those external to the specific ‘complex system’ of TMI Unit 2 helped to set the stage for the 1979 accident.
The Chernobyl power transient was an even starker illustration of how external influences can pave the way to disaster. The immediate response to the accident was to place the blame squarely on the shoulders of the operators. It eventually became clear that this was very far from the truth. The release in 1991 of the root cause report by the State Committee on the Supervision of Safety in Industry and Nuclear Power revealed that while it was certainly true the operators placed their reactor in a dangerously unstable condition (in fact in a condition which virtually guaranteed an accident) it was untrue that in doing so they violated a number of vital operating policies and principles. No such policies and principles had been articulated. Additionally, the operating organisations had not been made aware either of the specific vital safety significance of maintaining a minimum operating reactivity margin (ORM), or the general reactivity characteristics of the RBMK which made low power operation extremely hazardous.
It is clear that not all safety challenges originate within the ‘complex system’ many of the most serious ones are initiated and nurtured within the external management structure that ultimately governs the operation of the complex system.
Learning from accidents
In Learning from Accidents in Industry Trevor Kletz identifies ‘managerial ignorance’ as a recurrent element in accidents, noting the "failure to learn from the experience of the past … Organisations have no memories. Only people have memories and they leave." Failing to learn from experience (or failing to properly record those lessons and ensure they remain part of the collective consciousness) is a major factor.
But there are others-indeed Ken Ellis identifies several important ones, including ‘unofficial’ messages from management, and acceptance of abnormalities (or "the insidious acceptance of slowly-degrading standards").
The interesting question is how such problems may be approached outside the formalised structure of the ‘complex system’ itself. What is being discussed here is human failure in the management chain, remote in time, place and corporate hierarchy from the human-machine interface, and it is this that distinguishes it from the broader concept of safety culture. A strong, healthy safety culture will provide an environment in which the probability of this kind of failure will be significantly reduced. But it will not be eliminated, due to the inherent sensitivity of any organisational culture to senior management influences.
The following suggestions could help to reduce the probability and severity of human failure in the management chain.
- The decision ‘log’
- When things go wrong in the management chain, just as in the control room, it is essential to know who did what, when they did it, and why they did it. For this reason it is essential that management decisions be properly documented, complete with summaries of the rationales for those decisions. The objective here is not to pillory somebody for making an ‘incorrect’ decision, but to see how decision-making can be improved and the likelihood of bad decisions reduced.
- Integrity of information flow
- In the nuclear industry, as in any engineering or technologically-based enterprise, decisions must be founded on sound technical information. This flows up the management chain from the technical specialists, and as it does so its meaning and significance may become distorted through selective quotation or omission of specific qualifications. This should be countered, in part, by a policy that requires technical documents or their summaries to record what technical or editorial modifications have been made, and by whom. This would help to ensure that conclusions and recommendations are not taken out of context. Of course, it is impossible to ensure that distorted information will not be passed up the line, but it should be possible to increase the probability that distortions will be detected.
- Effective problem identification and corrective action
- A familiar kind of management failure is the ‘evaporating issue syndrome,’ when an issue is raised, and is then shunted through the organisation until it finishes up on some metaphorical siding, unresolved and, for the most part, forgotten. Any safety issue raised should be properly tracked, and at each stage of its progress through an organisation there should be written confirmation of to whom it has been referred, when and for what purpose. This information should be fed back to the originators so they can confirm that the matter is not being misunderstood.
- Redundant, separate and diverse communication
- There are occasions when safety issues are simply not addressed. This may be because the issue is regarded as trivial, because it is something that is outside the ‘mindset’ of the organisation, because the person raising the issue lacks credibility, because management intimidation prevents the issue being raised, or simply because the matter gets lost in the pressure of other work. Whatever the reason, an organisation must ensure that there is an alternate path available for staff to raise safety issues. Even malcontents and eccentrics can sometimes be right, and time and resources applied to encourage the free flow of safety discussion cannot be said to be wasted. Safety meetings, if conducted with more rigour and formality and reported via comprehensive and widely distributed minutes, could well have a part to play here. However, application of the ‘diversity’ and ‘separation’ principles suggests that in addition an ombudsman function is needed. The provision of diverse and redundant safety systems is a basic principle of nuclear safety – it is surely rational to apply a similar philosophy to the management of nuclear safety.
- Pressures on managers
- There are multitudinous reasons why any manager may make inappropriate decisions. Stress, in all its forms, and illness (including alcohol or drug abuse), are undoubtedly significant contributors. Obviously stress cannot be eliminated from the working environment, but an organisation can ensure that employee health is properly monitored, as is exposure to life-event stressors. And the working environment should be such that people will feel free to self-report without fear of punitive repercussions, when they feel unfit, for whatever reason. External pressures are another important possible cause, particularly for top management (board of directors and CEO). These executives are the principal barrier between an organisation and those (for example, politicians or shareholders) who make demands that will challenge safety performance.
- The role of the regulator
- The regulatory authority must be prepared and resourced to investigate promptly the effects of any organisational changes in strategy or structure which may impinge on the management of a nuclear power plant. As well, regulatory oversight should include monitoring the functioning of the corrective action programme, the extent of provision of alternative communication channels and the functionality of those channels. The regulator would need to maintain an active and intrusive role in investigating licensee events, and not just events at nuclear plants themselves, but failures in the safety management process outside plant boundaries, and should be resourced to field investigation teams at quite short notice in order to anatomise such events and determine the root cause within the institution.
Conclusion
Since the Three Mile Island accident the nuclear industry throughout the world has made great strides in improving human performance at nuclear installations, inside the ‘complex system.’ Indeed, today any visitor to a nuclear power plant almost anywhere in the world could not fail to be impressed by the dedication and professionalism of all those working there.
With the aid and encouragement of organisations such as WANO, many first-rate methods have been developed to establish, monitor and maintain very high standards of operational competence and to ensure that the lessons of experience are widely promulgated. As Ken Ellis argues, putting people in the mix is essential. We should make sure that all the people who should get put in the mix, are.
About the author
David Mosey is the author of Reactor Accidents (2nd edition 2006, ISBN # 1-903-07745-1, from the publishers of Nuclear Engineering International, available to buy online). He worked for 30 years in the Canadian nuclear industry, including 18 years in nuclear safety functions of Ontario Hydro and its successor companies.