Periodically, we discuss the investigation or root cause analysis of a disaster in an industry other than healthcare. Most often that is regarding an accident in the aviation or other transportation industry. We do this because the typical cascade of human and system errors and root causes that come to light are so similar to those that occur in healthcare incidents that result in patient harm. Hopefully, the lessons learned from these accidents may be used to avert similar incidents in healthcare.
AirAsia Flight QZ8501 crashed into the Java Sea on December 28, 2014, killing all 162 people aboard (KNKT 2015). It was cruising at 32,000 feet on a flight from Indonesia to Singapore when the cascade of precipitating events took place. The weather and meteorological conditions did not appear to be factors. The pilot in charge had over 20,000 hours of flying experience and the second-in-command over 2000 flying hours, most in the Airbus 320, the type of aircraft involved in the crash. Both had passed their last proficiency checks in the month prior to the crash. Though there was an issue regarding approval for flight (an issue about what days this airline could fly), that did not appear to be a factor in the crash.
At the center of the cascade: cracked soldering on a small circuit board. This was on the Rudder Travel Limiter Unit (RTLU). There was cracking of a solder joint of two channels that resulted in loss of electrical continuity and led to RTLU failure. Maintenance records showed that there had been 23 Rudder Travel Limiter problems starting from January 2014 to 27 December 2014 and that the interval of the malfunctions became shorter in the last 3 months even though maintenance actions had been performed since the first malfunction was identified in January 2014.
Just 3 days prior to the day of the accident one of the pilots of the fatal flight was flying another flight on this aircraft and had problems with an alarm indicating RTLU dysfunction. Airline mechanics addressed the problem while the pilot was observing. One of the corrective actions involved pulling and resetting a circuit breaker. The pilot asked if he would be able to do that if the alarm again went off and was told he would be able to if the computer/dashboard message indicated so.
On the fateful day, the Airbus 320 was cruising at 32,000 feet when the alarm regarding the RTLU triggered. In fact, it triggered multiple times. The first 3 times the pilot-in-command (PIC) and the second-in-command (SIC), who was flying the plane at the time (under autopilot), responded appropriately to the alarm and performed the specified actions. The fourth time, however, they did something different – they pulled the circuit breaker.
These events led to disengagement of the autopilot and auto-thrust and meant that the pilot flying (PF) was now manually flying the aircraft. You’ll recall from some of our prior columns the phenomenon of “automation surprise”, a concept describing what happens when computers working in the background cease functioning and multiple factors visible to the computer are not readily apparent to the now manually flying pilot. Some of those factors may include speed of the plane, attitude (orientation to the ground), rotation, etc.
One of the parameters that changed was airspeed. Aerodynamic stalls occur when airplane speed drops below the speed at which airflow over the wings and the “angle of attack” fails to provide the “lift” that keeps the plane airborne. The plane also began to change its attitude and rotation. The “stall” alarm sounded.
Pilots, of course, train for aerodynamic stalls as a core element of their training and preparation for flying and do stall simulations as well. Most stalls, however, occur when a plane is changing speeds or angles of attack such as on takeoffs and landings. Stalls occurring on “level” flying at high altitude cruising are much rarer.
The usual response to an aerodynamic stall is to put the nose downward until lift is restored. This is done via a “stick” or lever next to the pilot that is somewhat like the joystick you might be familiar with on computer games. To go down you push the lever forward. To go up you pull it back. But when the stall began, the PIC (pilot-in-command) yelled “pull down…pull down” (repeated 4 times). This order is ambiguous because if you pull the level/stick down, the plane goes up and accentuates a stall.
In addition, the plane has dual controls in that there is a second stick/lever at the side of the pilot not flying (PNF). But if one stick/lever system is not deactivated both provide input and the net result is the algebraic sum of the action of both sticks/levers. It appears in this incident that the PIC did begin using his stick/lever but it was at the same time the SIC was using his and they were likely moving them in opposite directions, negating the actions of each other. The net effect was that the desired downward angle did not occur. In fact, the nose further rose and the plane began to rotate and spin.
In addition, immediately increasing the thrust of the engines may result in the nose of the plane lifting up, which further accentuates the stall. You may actually need to reduce the thrust temporarily if the nose is not angled down.
Note that this is all somewhat counterintuitive. The natural human response might be to point up and increase the speed. That is why training for such emergencies is critical. The pilots need to have immediate knowledge of what to do in such cases and have practiced it at least in simulation exercises. The pilots were trained and had experience of recovery from the approaching stall. But the condition of stall at zero pitch (as in this instance) had never been trained as the training for stall was always with a high pitch attitude.
The plane was now spinning out of control in what is known as an “upset”. Upset Recovery training was included in the aircraft operators training manual. But the aircraft operator told investigators that the flight crew had not been trained for the upset recovery training on Airbus A320, reflected in the Flight Crew Training Manual “Operational Philosophy”: “The effectiveness of fly-by-wire architecture, and the existence of control laws, eliminates the need for upset recovery maneuvers to be trained on protected Airbus”. (And apparently the regulatory body in Indonesia did not find that to be out of compliance.)
Below we summarize some of the key factors that played a role in this tragic accident. In some cases we provide analogies to healthcare incidents. But, in general, we leave it to your imagination to envision how such human factors and system failures seen in this case might occur every day in various venues in your healthcare system.
In healthcare sentinel events, communication breakdowns contribute in 75-85% or more of events. The same obviously occurs in aviation or other transportation accidents. There were several examples of such miscommunication in the current case.
Prior to removal of the circuit breaker there should have been a discussion between the PIC and SIC about what might happen if and when the circuit breaker was removed. In their training manual there is a chapter on “Computer Reset” that says: “ In flight, as a general rule, the crew must restrict computer resets to those listed in the table. Before taking any action on other computers, the flight crew must consider and fully understand the consequences.” The investigators note this statement was potentially ambiguous to the readers and might be open for multiple interpretations. But had it been followed, the crew might have realized the autopilot would disengage and be prepared for that.
A series of serious miscommunications occurred once the stall alarm triggered. The PIC shouted “level…level…level” (repeated 4 times). But it was not clear whether he meant to level the wings or level the “attitude” or orientation of the plane to the ground. Then he followed with the command to “pull down…pull down” (repeated 4 times). As above, this order is ambiguous because if you pull the level/stick down, the plane goes up and accentuates a stall.
Another miscommunication was one that did not take place but should have. When the PIC began to manipulate his stick/lever, standard operating procedure would have been to call out “I HAVE CONTROL” and responded by the other pilot transferring the control by call out “YOU HAVE CONTROL”. Had that happened, perhaps the cancelling action of operating to sticks/levers simultaneously would not have occurred.
“Hearback” is a feature of good communication that is a critical feature of cockpit resource management. In all likelihood the events in this case unfolded so rapidly that there was not enough time for it. But that is one of the reasons that training and simulation exercises are so important. They prepare you for what to say and do and hear and respond under the most emergent circumstances.
One factor not commented upon in this case that has been a contributing factor in other transportation accidents is language/cultural disparity. The PIC here was Indonesian and the SIC was French. We don’t know whether this might have led to any difficulties in communication. And we are not simply talking about language. In some of the other accidents, hierarchical cultural factors have led to SIC’s being afraid to challenge PIC’s.
Several of our previous columns have discussed automation surprises, whether that is a computer changing things in the background or a system operating with several “modes” controlled via a single switch. Here the obvious automation surprise was not seeing all the issues arising when control of the plane was abruptly turned over to the pilot flying (PF) when the autopilot disengaged.
Most of you have already personally experienced automation surprises. Have you ever been driving on a highway and started slowing down, only to find that the cruise control on your car (you forgot you had set it!) speeds you up?
In healthcare, much of our equipment in ICU’s (and other areas) is high tech and computers are often doing things in the background that we are not immediately aware of. Similarly, we often do have equipment that runs several different “modes” off a single switch. We gave an example of such a surprise back in 2007 when we described a ventilator operating on battery power when all thought it was operating on AC current from a wall socket.
When we talk about alarm fatigue we are usually discussing situations in which the overabundance of clinically irrelevant alarms causes us to ignore alarms that are clinically important. But in this accident one of the proximate causes was the inordinate attention placed upon shutting off the alarm that had previously produced so many false alarms. So it really was a form of alarm fatigue.
Loss of situational awareness
Loss of situational awareness is often a factor contributing to many accidents. In this case, there was some such loss in that the SIC probably was not immediately aware of multiple important flight parameters when the autopilot abruptly disengaged.
Loss of spatial orientation
Loss of spatial orientation occurs in a couple ways. The investigators note “Mulder’s Law” in which there is a threshold (called Mulder’s Constant) below which accelerations are not sensed by the human vestibular system. That certainly might have come into play as the plane spiraled out of control. The other would be not being able to see the horizon or other land structures as the plane nosed upward. Such disorientation has been raised in previous aircraft accidents like the Mt. Erebus crash in the Antarctic or the crash in the Atlantic involving John Kennedy, Jr.
The investigators note that for pilots the main effects of the startle reflex are the interruption of the on-going process and distraction of attention towards the stimulus. These happen almost immediately, and can be quickly dealt with if the cause is found to be non-threatening. However, the distraction can potentially reduce a pilot‟s concentration on flight critical tasks.
When we discuss diagnostic error we note there are numerous cognitive biases (see our numerous columns on diagnostic error). Particularly when we are in the “pattern recognition” or “fast thinking” mode we are more prone to such biases. In this case “availability bias” was likely operative in that the PIC resorted to pulling the circuit breaker because he had recently seen the mechanic do that to fix the faulty RTLU alarm.
In our August 7, 2007 Patient Safety Tip of the Week “Role of Maintenance in Incidents” we noted that many of the most famous disasters in industry history have followed equipment or facilities maintenance activities, whether planned, routine, or problem-oriented. Well-known examples include Chernobyl, Three-Mile Island, the Bhopal chemical release, and a variety of airline incidents and oil/gas explosions. In this case the failure to perform a permanent fix for the faulty circuit board was a critical factor.
In that column we noted James Reason and Alan Hobbs in their 2003 book “Managing Maintenance Error. A Practical Guide” do an outstanding job of describing the types of errors encountered in maintenance activities, where and under what circumstances the various types of error are likely to occur, and steps to minimize the risks.
Failure to prepare for emergencies
Emergencies, particularly those involving events that rarely occur, are ripe for disastrous outcomes if all parties are not prepared to respond appropriately. Pilots use checklists to help them not only with routine procedures but also to help them in many emergent situations. We’ve also discussed the use of checklists for some OR emergencies (see our August 16, 2011 Patient Safety Tip of the Week “Crisis Checklists for the OR” and our February 2013 What's New in the Patient Safety World column “Checklists for Surgical Crises”). In the OR you might have a checklist to help you respond appropriately to an unexpected case of malignant hyperthermia.
But some emergencies, such as that in the current case, merit responses that don’t allow time to consult checklists. They need immediate rehearsed responses that can only be prepared for by simulation exercises or drills. While the pilots in the current case had practiced stall recoveries, they had inadequate training for “level attitude” stalls at high speed and no “upset recovery” training. In healthcare an example needing such training or drills would be a surgical fire.
Other training issues
One very interesting item in the investigation report was that the airline does not practice full “role reversal” in its operations training. In flights there is task sharing and duties allocation between the pilots and copilots, the PIC (pilot-in-command) and SIC (second-in-command), or between the PF (pilot flying) and PNF (pilot not flying) and one sits in the “right-hand seat” and the other in the “left-hand seat”. The implication by the investigators is that it would be very useful during training to have these various actors switch places and roles and get a better understanding of the others’ perspective.
What a novel idea! This is one we most definitely should adopt in some of our simulation exercises in healthcare!
What better example of overconfidence than the Flight Crew Training Manual “Operational Philosophy”: “The effectiveness of fly-by-wire architecture, and the existence of control laws, eliminates the need for upset recovery maneuvers to be trained on protected Airbus”. That’s like saying the Titanic is so well designed and built it will never sink so don’t train for evacuating it.
Other root causes
Most of the above are more proximate causes or contributing factors. But if we dig deeper we undoubtedly would find other root causes. For example, why wasn’t the faulty circuit board simply replaced rather than attempting to just solder it each time when clearly such resoldering was failing? Were there financial considerations that led to that?
The same questions apply to the lack of training. Was that truly because the manual implied such events could never happen on the Airbus 320 (an obviously erroneous assumption)? Or were there financial or other operational barriers to providing that training?
And what about the regulatory oversight of the airline industry in Indonesia? Was it an oversight that they did not require all airlines to provide “upset recovery” training for all pilots? Should they not have seen that one particular part had required 23 maintenance interventions?
Failure to learn from previous accidents or incidents
The similarities to the 2009 Air France Flight 447 crash and the 2009 crash of Continental Flight 3407 near Buffalo, New York are strikingly similar. In the former, ice crystals obstructed the aircraft’s pitot tubes, leading to inconsistencies between the displayed and actual airspeed measurements, and causing the autopilot to disconnect, after which the crew reacted incorrectly and ultimately led the aircraft to an aerodynamic stall from which they did not recover (Wikipedia 2015). In the latter accident (see our Patient Safety Tip of the Week “Learning from Tragedies. Part II”) the pilot also did the opposite of what he should have done when the plane went into an aerodynamic stall.
Lingo and abbreviations
In our Patient Safety Tip of the Week “Learning from Tragedies. Part II”) we noted the dizzying array of abbreviations used in the aviation industry. When one reads the investigation report of the AirAsia Flight QZ8501 crash one wonders how pilots could ever function in cockpits without confusing abbreviations and terms. Again, we think aviation clearly needs to have a “do not use” abbreviation list similar to what we use in healthcare (see our Patient Safety Tips of the Week for July 14, 2009 “Is Your “Do Not Use” Abbreviations List Adequate?” and December 22, 2015“”).
We don’t pretend to be experts in design from a human factors perspective. But having a system with sticks/levers that make it counterintuitive to operate in an aerodynamic stall situation seems to us to be an obvious design flaw. And a system that allows two pilots to simultaneously use controls that may counterbalance each other? We’ll bet Don Norman, author of the classics “The Design of Everyday Things” and “The Design of Future Things” (see our November 6, 2007 Patient Safety Tip of the Week “Don Norman Does It Again!”) would never let that happen!
The second in command was flying
Perhaps it’s coincidence but in this crash, in the Air France Flight 447 crash, in the Asiana Flight 214 crash (see our January 7, 2014 Patient Safety Tip of the Week “Lessons from the Asiana Flight 214 Crash”) and others where there were aerodynamic stalls from which recovery failed, the plane was being flown by copilots or second-in-command pilots. That does not necessarily imply that these generally less experienced pilots are more likely to err. Remember, they are often flying those parts of a flight that are considered relatively safe (i.e. the “cruise” part). And we do understand the importance of allowing the copilots to get the hours of airtime experience.
But maybe we are also deluding ourselves and letting our guard down once we get into the “safe” part of a flight. Isn’t this a little like when the attending physician leaves one OR to begin a second case in another OR because the first case is now in the “safe” portion (see our November 10, 2015 Patient Safety Tip of the Week “”)?
Yes, almost every incident with serious harm, whether in healthcare or aviation, involves a combination of human error and system error. Many of the press reports on this accident seemed to vilify the pilots, using terms such as “unfathomable” to describe some of their actions. But the bottom line is that their actions are not really so “unfathomable” given that other pilots and copilots have made similar mistakes in similar circumstances. It took us many years to realize that giving fatal doses of concentrated potassium to patients was really a system problem rather than “unfathomable” actions of a few nurses left at the “sharp end”.
Interestingly, some of the other factors we have often seen contributing to such accidents (eg. fatigue, violation of the sterile cockpit rule, etc.) do not appear to have contributed to the current accident.
And the most important lesson? Just as in serious healthcare incidents, there was a cascade of errors and events with multiple contributing factors that led to the disaster. Prevention of almost any one of many factors might have prevented this crash. What if the airline had provided training for “level” high speed stalls? What if the sticks/levers had been designed so that both could not be used simultaneously and cancel each other out? What if terminology used was standardized and not ambiguous? What if the circuit board had simply been replaced rather than attempted to be repaired by resoldering? What if the pilots had discussed what would happen when the circuit breaker was removed and realized that autopilot would disengage? What if the actions of the pilots in the prior Air France or Buffalo crash were incorporated into training? What if the regulator mandated the airline to fix the recurrently faulty circuit board?
Hopefully, some of the lessons learned from the investigation of this crash may be used to help prevent other disasters, not only in aviation but also in healthcare or other industries.
See some of our previous columns that use aviation analogies for healthcare:
May 15, 2007 “Communication, Hearback and Other Lessons from Aviation”
August 7, 2007 “Role of Maintenance in Incidents”
August 28, 2007 “Lessons Learned from Transportation Accidents”
October 2, 2007 “ ”
May 19, 2009 “Learning from Tragedies”
May 26, 2009 “Learning from Tragedies. Part II”
January 2010 “ ”
May 18, 2010 “Real Time Random Safety Audits”
April 5, 2011 “More Aviation Principles”
April 26, 2011 “Sleeping Air Traffic Controllers: What About Healthcare?”
May 8, 2012 “Importance of Non-Technical Skills in Healthcare”
March 5, 2013 “Underutilized Safety Tools: The Observational Audit”
April 16, 2013 “Distracted While Texting”
August 20, 2013 “Lessons from Canadian Analysis of Medical Air Transport Cases”
December 17, 2013 “The Second Victim”
January 7, 2014 “Lessons from the Asiana Flight 214 Crash”
KNKT (KOMITE NASIONAL KESELAMATAN TRANSPORTASI). Aircraft Accident Investigation Report. PT. Indonesia Air Asia Airbus A320-216; PK-AXC. Karimata Strait. Coordinate 3°37’19”S - 109°42’41”E. Republic of Indonesia. 28 December 2014. KNKT 2015
Wikepedia. Air France Flight 447. Wikipedia accessed December 23, 2015