SMS PART TWO: MAKING IT USEFUL

SMS PART TWO: MAKING IT USEFUL

Safety Management Systems: they seem complicated. But in this series we are aiming to help make them simple to implement.

In the first article, we examined some hazard identification strategies. In this article we’ll begin looking at the process of using risk assessment to analyze our identified hazards. If you don’t remember how to identify hazards, then look back at the Spring 2021 issue to refresh your memory (it is available online).

The point of identifying hazards is to identify the things that could go wrong in your system. In the January issue, we suggested documenting hazards in a centralized and comprehensive hazard log. We specifically recommended using a database. A database will allow you to analyze trends in hazards, reference the mitigations associated with each hazard, and even serve as a tool for change management (we will address all of these in future articles). Before we can start tying our hazards to mitigations, though, we are going to first examine how to assess the risk posed by each hazard.

We assess risk for a number of reasons.

One reason for assessing risk is to better allocate limited resources. If you know that you have three hazards that you could mitigate, but you can only mitigate them one at a time, then having a mechanism for deciding which hazard is most important to address would help you to decide how to allocate your resources.

A second reason for assessing risk is to decide when you have done a sufficient job in reducing the risks posed by the hazard. By assessing risk, you can set a metric for when risks are considered to be adequately contained. This tells you, prima facie, when a mitigation is considered to be “good enough.”

A third reason for assessing risk is to permit the system to engage in constant improvement. If you assess the risk levels posed by a set of hazards, then you can mitigate the risks to the acceptable level that has been set by the company. Once the known hazards have all been mitigated to the acceptable level, the company can decide to pursue a higher level of safety by changing the acceptable level of risk! For example, if you create a system that assigns risk values to hazards, and you successfully build a system that mitigates all of the hazard-based risks to a value of 10 or less, then after achieving that goals, you might next seek to mitigate the risks valued at 9 and 10 to a value of 8 or less.

A fourth reason for assessing risk is to have a mechanism for judging your company’s progress on the safety continuum. By assessing and assigning numerical risk values to each hazard, you have an opportunity to record and assess the progress your company is making on its path toward safety. You can set risk-based goals (“performance indicators”) like reducing every risk below a certain metric or reducing the average of all risks in a system below a metric.

So what does it mean to assess risk?

We typically assess risk in an SMS system by assigning two values to each hazard. The first value is “likelihood,” and the second value (which we’ll examine in next month’s article) is consequence. Together, they can provide a measure of the risk posed by a particular hazard.

Likelihood reflects the prospect that the hazard condition will manifest itself. The purpose of this assignment is to rank more likely occurrences higher than less likely occurrences. Therefore it is typically not an absolute measure of probability. The values used may vary based on the system, and the needs of the system. For example, in a manufacturing environment, you might assess likelihood values related to failures of the manufactured product based on probability of hazard occurrence per operational hour. In the FAA Certification system, a likelihood measured at one occurrence in less than 100,000 hours of operation is considered to be probable; while a likelihood measured at one occurrence in more than 1,000,000,000 hours of operation is deemed to be extremely improbable. These two metrics reflect the bookends of the likelihood range in an FAA certification project. The United States military uses safety management and deems a hazard to be probable if it will occur several times in a system, but has another value – frequent – which describes hazards that are likely to occur frequently in the life of a system. In other systems, the values may distinguish hazards that will certainly arise in the life of a system (100% chance) from those that are expected but may not arise 100% of the time, to those that are remote in the sense that they have not yet arisen but are nonetheless feasible.

The scale that you use should be tailored to the particular hazards in your system, and the best factors that will provide you with meaningful distinctions to permit useful differentiation among the hazards being analyzed. For example, FAA certification distinctions may not be appropriate for a repair station, because the repair station may want to identify hazards that happen every day and distinguish them from those that happen once per week and distinguish those from hazards that arise once per month. All three categories likely fall into the “probable” likelihood on the FAA certification scale, but if they all fall into the same category, then the likelihood metric is not being successfully used to distinguish them.

In a repair station environment, you will encounter hazards such as human factors issues that arise on a more regular basis than the basis described in the FAA Certification probabilities, so the FAA Certification range probably does not provide the appropriate metrics for judging the likelihood of hazards in a repair station. For purposes of this article, we shall use the following as our likelihood values:

Notice that these values are based on narrative descriptions, rather than hard numerical probabilities. This is because the typical repair station may be unable to classify its hazards based on strict numerical probabilities. A repair station will also have to consider the scope of the narrative descriptions (which may be based in part on the sources of hazard data). For example, if you are examining the failure of a particular OEM part, then the repair station’s experience may suggest it is a level 2 likelihood (“never has occurred but the hazard could reasonably occur”); but expanding the scope to include data from other repair stations might shift it to level 3 (“has occurred, and without mitigation, the hazard would probably occur less often than once per month OR never has occurred but the hazard is likely to occur in the future”).

Let’s say that the hazard in question is the release without final inspection of a unit that was subject to overhaul procedures. Let’s also say that this hazard is identified because it occurred in the facility. Because it actually happened, this automatically gives it a level 3, 4, or 5 likelihood (based on the definitions, above). It might be assigned a risk level based on past experience (if this has happened before, then the prior occurrence experience might help assign a likelihood level) or based on the intuition of the inspector responsible for the assignment. In this scenario, there is no precise answer, and therefore it makes sense to have one person or one group assessing the likelihood level in order to ensure risk assignments follow a reasonably standard pattern (so you do not have radically different risk assignments based upon different opinions of the narrative descriptions).

Because different people can come up with differing opinions about likelihood, a more objective standard can be valuable (so please do not assume that the likelihood values in the above table reflect an ideal). When you are establishing likelihood values and narratives, don’t be afraid to adjust them to suit the needs of your business (including the need to distinguish more-likely events from less-likely events). If you do adjust your values, though, then you may need to re-analyze past risk assessments to update them to the new standard so you can compare hazards according to the same metrics.

The table includes five different levels. Your table may include more or less levels. The important thing is that the table you develop for your own system must distinguish among hazards in a way that is useful to your analysis of those hazards.

Your likelihood assessments should permit you to distinguish hazards based upon the difference in the likelihood. If likelihood was the only metric that you used, then this would permit you to focus first on the most likely hazards, and then save the less likely hazards to be mitigated later.

Likelihood is not the only metric we typically use to assess risk. Next issue, we will examine the metric known as “consequence,” which will help us to distinguish the most damaging hazards from the less damaging hazards. Using likelihood and consequence together, we will be able to judge which hazards pose a greater risk to safety.

In the next issue, we’ll look at the process of using “consequence” as part of our risk assessment, and we will examine how to examine our identified hazards in a risk assessment environment. Want to learn more? We’ve been teaching classes in SMS elements, and we’ve advised aviation companies in multiple sectors on the development of SMS processes and systems. Send us an email if we can help you with your SMS questions.

Jason Dickstein is an aviation lawyer based in Washington, D.C. You can contact him at jason@washingtonaviation.com

SMS – What Makes it Useful?

SMS – What Makes it Useful?

Traditional safety approaches in aviation used to start with smoking holes in the ground. Once we had an accident, we analyzed the accident and developed corrective actions to prevent such an accident from recurring. The problem with this approach was that it required an accident before we could recognize a problem to solve. Aviation has evolved to an industry constantly seeking a better way to solve safety problems without waiting for the smoking hole.

SMS, or Safety Management Systems, is the buzzword on everyone’s lips; but what is SMS all about, anyway?

People have lauded SMS as the next great safety system. If it is implemented correctly then it has the potential to create a safer environment than traditional quality management systems because it is focused on proactive identification of hazards. This means that it doesn’t wait for a problem to find the solution … instead it identifies future potential issues and proactively mitigates them before they become real problems.

SMS is a system for managing a company’s compliance with the safety regulations of the relevant regulatory authorities. This makes it seem like a quality management system. While SMS does have some things in common with a quality management system, at its roots it is very different. It is important to understand this “root difference” when implementing SMS because misunderstanding it means that the implementer could miss many of the key benefits of SMS.

SMS, in aviation circles, is composed of four “pillars.” The four pillars are: 1. Safety Policy and Objectives; 2. Safety Risk Management; 3. Safety Assurance; and 4. Safety Promotion.

I like to focus on the second pillar: Safety Risk Management.Safety Risk Management has three elements: Hazard Identification, Safety Risk Assessment, and Safety Risk Mitigation.

The most important differences between SMS and other quality systems is hazard identification. It is normal in quality systems to identify occurrences, perform root cause analysis and apply corrective action based on the root cause analysis. SMS takes this one step further by using processes to identify all of the possible hazards that could arise. Ideally, they are identified before they can become occurrences. This includes hazards that are adequately mitigated, today. These hazard identification processes can be quite resource and time intensive, and they are often ongoing processes, as the company continues to supplement its database of possible hazards.

A reasonable method for a repair station to approach hazard identification – a method that can focus hazard identification on the processes most likely to yield hazards – is to begin with the processes contained in the repair station’s manuals, and then continue with the specific maintenance processes (usually from manuals) that are most often used in the facility. For example, in an engine shop this might be the overhaul manual for the engine that the facility most often handles. In each case, divide the processes into manageable chunks and analyze them. There are many ways to analyze the processes for hazards. One formal mechanism is a hazard and operability (HAZOP) study.

As you identify hazards, think about what can go wrong with each element of the process you are studying. If you are examining a cleaning step, then what happens if you skip the step? What happens if you use the wrong cleaning chemical? What happens if you apply too much or too little of the cleaning chemical? What happens if you apply the cleaning chemical using the wrong applicator (such as an abrasive applicator)? What could happen if the technicians are inadequately trained? These are the sorts of questions that a facilitator asks in a formal hazard study of cleaning processes.

As hazards are recognized, they should be documented in a centralized and comprehensive hazard log. The hazard log can be a database that is tied to mitigations (corrective actions). Among other benefits, such a log allows repeat hazards to be recognized and mitigated as a group. This allows a better assessment of the success of risk mitigation activities. The hazard log can also help to organize mitigations (including those already implemented) in order to build a successful safety assurance program (including auditing of the mitigations to ensure they are in place and working as expected).

It is normal to use a taxonomy – a tree-like structure of classifications – to group hazards together. The taxonomy will have high-level categories (like “Maintenance Instructions” “Human Factors” etc.). Below these will be additional sub-categories. For example, the Maintenance Performance Category might have a sub-category like “Tooling” and that might, in turn, have sub-sub-categories like “calibration,” “tool missing,” “wrong tool.” A robust taxonomy allows similar hazards to be identified together in order to mitigate their risks together, and to look for trends. For example, a single tool that mysteriously does not have calibration records might lead you to send the tool out for calibration. A series of tools that do not have calibration records might lead you to look for a more systemic issue, like calibration records being removed by well-meaning cleaning staff!

A robust taxonomy also makes it easier to track similar hazards for precursor data that might suggest the onset of contributing factors that could affect (or effect) the hazard. A set of hazards with a uniform cause (even if it is not the “root cause”) might be corrected through a single mitigation that targets that identified causal factor.

In other cases, grouping similar hazards together can help to recognize and document common corrections that already exist and that are already mitigating the risks posed by the hazards. For example, an inspection might be catching potential hazards and preventing those hazards. This could be identified as a mitigating action for each of those hazards. If this is the case, then the inspection needs to be identified in the hazard database as an important mitigation related to each of those hazards. Using the database to research a future decision to potentially eliminate the inspection should reveal that the inspection is an important mitigation related to a number of hazards; before eliminating the inspection, the facility will want to ensure that each of the hazards is adequately mitigated using other mechanisms. By creating our hazard database, we are taking the first step in creating a tool that will help to manage safety throughout the life cycle of the business.

Time to Audit Your Capability List!

Time to Audit Your Capability List!

Recently, a repair station lost its certificate because of the repair station’s failure to properly complete and maintain the paperwork associated with the capability list. This was not the first time I have seen this – I have represented other repair stations accused of the same sort of failure; but it shows how serious the FAA is taking capability lists. Unless you are willing to jeopardize your repair station certificate, you should be taking the capability list process just as seriously.

This article reviews the history and evolution of repair station capability lists in the United States, makes recommendations about how to audit capability lists, and examines specific problems that arise with capability lists and offers solutions designed to facilitate compliance and ease business operation.

A Brief History of Capabilty Lists

Repair stations are typically given one or more ratings and operations specifications. The ratings and operations specifications provide the limits within which the repair station is authorized to perform maintenance and alterations.

The FAA issues airframe ratings. It was very common for repair stations to obtain an airframe rating. In order to maintain the airframe rating, a repair station applicant needed to “provide suitable permanent housing for at least one of the heaviest aircraft within the weight class of the rating.” For repair stations that wanted to maintain a class four rating (large all metal airframes) in the 1980s, that meant housing sufficient for a 747. Thus, hangars would be designed around the largest civil transport category aircraft: the 747.

That did not mean that the facilities actually serviced 747s. Usually, they did not. A facility that was designed around the 747-100 could fit two 737 classics side-by side with extra room for equipment. The class four airframe rating meant that the repair station could generally service any airframe, so long as it could meet the other applicable regulatory requirement (like personnel, equipment, etc.).

The 747-400 was a problem.

The 747-400 added about five meters of wingspan to the aircraft. When it entered into service in 1989, repair station facilities that were designed precisely around the 747-100/200/300 were no longer qualified to hold the class four rating. If they could only push the walls out by five meters!

The FAA recognized he foolishness of requiring a repair station to push out its walls to accommodate the 747-400, when that repair station only worked on narrow-body aircraft. In order to accommodate these repair stations, the FAA began using capability lists. These were lists of the airframes that the repair station was capable of repairing. The idea was that this self-imposed limitation on the rating meant that the repair station only had to accommodate the heaviest aircraft on the capability list.

The capability list was not a 100% novel solution. The regulations already required similar lists for class 2 propeller rating and accessory ratings. These propeller and accessory lists were constrained by a need to obtain FAA-approval of changes. The requirement for FAA involvement made it difficult to make timely revisions to these lists, a fact noted by the FAA itself.

As time marched on, the capability list became popular, and it was added to the FAA’s regulations when the FAA revised the repair station regulations in 2001 (it became effective in 2003). The new capability list rule permitted any repair station with a limited rating to adopt a capability list (the new regulation thus promoted adoption of limited ratings). The list would describe each article on which the repair station was authorized to work. The list could be amended upon a repair station self-evaluation; the self-evaluation would ensure that the repair station had the required facilities, equipment, materials, technical data, processes, housing, and trained personnel in place to properly perform the work on the article being added. The repair station would be required to notify the FAA periodically of the amendments, but the period and process for notification were intended to be set by the repair station, in its manual.

The new focus on limited ratings was a bit of a change. In fact, the original proposed rule (published in 1999) would have applied the capability list as a requirement for all repair stations. The final rule made it an option for limited rating repair stations. One of the specific reasons for codifying the capability list rule was to make it easier for repair stations to add capabilities, by removing the FAA approval that is required for operations specifications changes.

Today, it is common for repair stations to rely on capability lists; but a list that was once looked-upon as a problem-solving device is now creating problems of its own.

Audit Your Capability List System

Capability list issues are a known problem in the industry and FAA penalties for capability list issues can be severe. It is important for repair stations to ensure that their capability lists are complete, and are being properly created and maintained.

Repair stations should certainly be auditing their capability lists; but capability list problems arise in repair stations that did not notice the problem because they were operating on a business-as-usual basis. We recommend using periodic third-party auditing to ensure compliance.

What should the auditors look for? I usually like to have two major focus areas. First, ensure that the capability list accurately reflects the work you do (completeness). Second, ensure that you are creating, maintaining, and communicating the list correctly (correctness).

Auditing the capability list for completeness means ensuring that each job you’ve done was within the scope of your ratings and capability list at the time it was done. It is not enough that the job is listed on today’s capability list. It has to have been listed on the capability list at the time it was performed. For this analysis, it is important to be able to identify the dates on which the capability list was amended.

You ought to have the self-evaluation records – retaining documentation of each capability list self-evaluation was intended by the FAA. The FAA anticipated when it codified the capability list rule that such lists could be maintained electronically, and if you keep your records electronically then this may make it easier to pull and review those records.

What if you discover incongruities, like work that was performed before the article was added to the capability list? If you discover issues like this, then you should discuss the possibility of self-disclosure with an aviation attorney. You should also perform a root cause analysis to identify how this happened, so you can work on your capability list system to ensure it is corrected to prevent problem recurrence.

You may be able to use your information technology system as both an audit tool and as a compliance assurance tool.

As an audit tool, you can run a program to check when work began on each project and compare that date against the date on which the article (the subject of the work) was added to the capability list. This is far more efficient than reviewing paper records one-by-one. It does require a sufficiently robust taxonomy, though, to ensure that the capability list can be searched. For example, if the capability list permits work on a higher-level-assembly, and all subsidiary detail parts, then you have to anticipate that future work will include overhauls of those detail parts. If the program checks a detail part number again the capability list, then it might not generate a positive result if the program cannot identify the article in question as a detail part of the high level assembly. Obviously, one way to implement this might be to reference the illustrated parts catalog (IPC) as a part of the system. The IPC listing should not necessarily reflect the depth of your capability list. Listing a higher-level-assembly, and including “and all subsidiary detail parts” is a much better approach to creation of the capability list. The IPC reference should instead reflect a background function that could be programmed into the checking program. Because IPCs change over time – part numbers can be added or subtracted based on commercial variations like a change in vendors – it makes sense for the checking program’s log to include the date of the search and the revision level of any reference.

Using a program to check past compliance, though, reflects a lagging indicator; and that is why it is important to ensure that you audit the system surrounding the creation, maintenance, and communication of the capability list. If properly drafted, the process should be designed to promote compliance, and should allow management to manage compliance.

If you create a checking program to audit past capability list compliance, then this can also serve as the basis for a program that runs in the background to ensure continued compliance.

As a compliance assurance tool, you can program the system to refuse to begin a project until the article in question has been added to the capability list. Practical ways to implement this can include:

  • refusing to permit the printing of a project traveler;
  • refusing to permit ordering of spare parts specific to the project;
  • refusing to permit the requisition of spare parts from inventory for the project; or
  • any other process that effectively prevent the project from beginning until the necessary capability list element has been successfully added.

This sort of compliance assurance may seem frustrating to those who want to start a project, but it forces compliance with the repair station’s capability list procedures before a non-compliance can occur.

Check Your Self Evaluation Mechanisms

The regulations permit the repair station to establish many of the details of the capability list process in the repair station manual, like the self-evaluation process. Check the process defined in the repair station manual for two different dimensions. First, is the process sufficient to establish that the repair station is fully capable of performing the work?

The second check has to be for ease of use and compliance. Are there unnecessary steps in the process? For example, I have seen facilities that completed a checklist for the self-evaluation, and then added the capability to a database that was not the capability list. As a third step, the article would be added to the capability list. The database was the resource that the staff would check for a capability, but the separate capability list was the resource that was shared with the FAA to describe the repair station’s capabilities. If a disconnect happened between adding a capability to the database and adding it to the capability list, then this could create a non-compliance for the company. A much better process would be to define the database of capabilities as the capability list. This eliminates an unnecessary step that added no value, and that created a compliance risk. Remember, the repair station is responsible for creating the processes in the repair manual, and also for complying with them. It is like being allowed to write the final exam for a class you want to ace. It doesn’t make sense to write a process that you can’t follow and it doesn’t makes sense to add steps that add no value.

Once you are comfortable with the written process, then check the implementation of the process. Is it being followed? Every time? Follow the process from beginning to end with the employees who do this every day. Are they following the process as it is written? If the answer is no, then this could reflect a deeper systemic problem that requires the staff to be refocused on following the manual provisions.

Approval and Reporting

Another problem area is the approval and reporting mechanisms for the capability list. On the approval front, I have seen repair station problems arise because a single signature was missing from the list of departments that were supposed to approve a self-evaluation (according to the repair stations’ own procedures). Similarly, I have seen reporting problems where the mechanisms for reporting changes o the FAA were unnecessarily complicated or burdensome.

You should be thinking about how amendments are approved internally – a simple signature from the accountable manager is often best. The accountable manager can review the self-audit and then be responsible for ensuring the other internal approvals and acquiescences are obtained before signing the amendment. An accountable manager signature is not a legal requirement – it is merely a recommendation that can be implemented to suit the needs of the repair station. The important thing is that a complicated approval mechanism – especially one that requires complicated or unnecessary record-keeping – can create non-compliance risks that add no safety value to the process, so the process should be simplified.

It is typically not necessary for the FAA to approve changes to the capability list. Nonetheless, I still see repair stations that require FAA approval or response before implementing such a change. The original purpose, when capability lists were added to the regulations, was to take the FAA out of this process, and to make the process of adding a capability less onerous. Adding the FAA back into the process when they don’t need to be in the process defeats the purpose of the rule. If your manual requires FAA approval or response before amending the capability list, then this is something you should be working to change with your Principle Maintenance Inspector.

Finally, make sure you have a realistic reporting mechanism for capability list changes. You are required to report to the FAA such changes on a schedule described in your manual. Normally, this will be reported by submitting the current version of the list (with changes redlined) on a periodic basis. The actual period may depend on your relationship with the Principle Maintenance Inspector, but every six to twelve months seems reasonable. I have seen repair stations that skip a report because there is no change. This makes the reporting process irregular, and irregularity in reporting can lead to a missed report when the report is needed; thus I recommend that reporting be accomplished on schedule even if there are no changes to the capability list. It is wise to align this capability list reporting process with other regular reporting obligations, in order to make the FAA reporting process as regular as possible. Calendaring systems can help ensue that those responsible for reporting remember to do so, and those responsible for management know to look for the completion of this task.

Conclusion

Focus on capability lists has shifted from the earlier priority of using them to ease compliance and facilitate the process of adding new capabilities. The new focus is on ensuring they are completed and used correctly. This focus is leading to repair station certificate actions that can be devastating to a business. Arguing that this is a mere paperwork exercise is an insufficient defense to a capabilities list enforcement action, even when you can prove that the underlying work was done correctly. Using information technology infrastructure, auditing, and careful review and improvement of procedures can help to maintain your company’s compliance.