Fault Tree Analysis Starter Kit: Part 2

Gulroz Singh
4 min readMar 12, 2023

--

Photo by Ben Griffiths on Unsplash

In part 1 of the FTA starter kit, we introduced what Fault Tree Analysis is and some of the basics of constructing a fault tree diagram. But a Fault Tree is a lot more than just creating a Fault Tree diagram. It involves system understanding, system analysis, and, probabilistic evaluation among other important steps.

Here are the basic steps that any analyst can use as a checklist to perform a complete Fault Tree Analysis:

  1. Define the System and its boundaries: Having a clear understanding of the scope of the analysis is the first important step. Ask yourself questions like: a) What exactly am I analyzing? b) Where does the system under analysis start and end? c) What are the main functions of the system under analysis? What are the limitations of the system?
  2. Define the Top-Level Event: Clearly understanding the top-level undesired event is a prerequisite to a correct fault tree analysis.
  3. Construct a Fault Tree: Follow the fault tree diagram construction steps defined in Part 1 of the starter kit.
  4. Validate the Fault Tree: Ensure accuracy and completeness of the FT logic. Some standard set of design rules can be applied as a checklist to validate FTs, automation may help here.
  5. Evaluate the Fault Tree: Generate cut sets and probability (optional) from the FT to understand weak parts of the FT and related safety issues.
  6. Propagate insights to requirements/ system model: An analyst may gain additional system insights in the form of requirements, failure modes, safety mechanisms, etc. that can be documented and propagated systematically back to system definition through a defined systems engineering process.

Fault Tree Example

Below is the expanded version of the fault tree example that was introduced in Part 1 of the starter kit. We have taken a generic SoC as our system under analysis in this case and we will use this example for diving deeper into some key fault tree concepts.

Figure 1: Expanded generic SoC FTA Example

Find the enlarged left and right subtree figures below:

Figure 1A: Left Subtree
Figure 1B: Right Subtree

Evaluation of the Fault Tree

In figures 1A and 1B above, different failure causes leading to an undesired event (Undetected failure of the SoC) are modeled. In the left subtree (figure 1A) the undetected failure in power management in the SoC is protected by a Power Management Unit which acts as a safety mechanism and that’s the reason for using an AND gate. Similarly, the clock failure is protected by a Clock Monitor. In the right subtree (figure 1B), the Incorrect code execution in the processing element is not protected which makes this event a single-point fault. Whereas, the failure in the interconnect is protected by E2E protection.

We will evaluate this fault tree to determine the results and significance of the analysis. As mentioned before, the evaluation could be qualitative or quantitative in nature. Qualitative evaluation is an inspection and evaluation of FT cut sets. It is a great way to reduce the risk associated with the top-level event. Quantitative evaluation is a numerical analysis where the failure rates related to different events are appended along with the event description to calculate probabilities at different levels.

Qualitative Evaluation includes generating cut sets and evaluating cut sets from a safety perspective. A cut set is a unique set of events that together cause the top-level undesired event (system failure) to occur. Each cut set is a unique path to top-level event.

A Minimal cut set is the smallest combination of basic events that causes the top-level undesired event to occur.

Let’s list the cut sets in the fault tree example above:

  1. E10
  2. E06, E07
  3. E08, E09
  4. E11, E12

As you can see the FT example has 3 AND gates and 1 OR gate. AND gate represents a safety mechanism protecting the fault whereas an OR represents no protection. As indicated from the cut sets, the most critical event (fault) in the system is “Incorrect code execution” represented by E10 which is a single point fault followed by the other 3 cut sets which are considered dual point faults.

Additional safety analysis such as Common Cause Analysis can be performed on the multi-point cut sets to further analyze the system. Common Cause Analysis allows the analyst to evaluate each multi-point cut set for a single cause of failure such as common clock/ power failure, environmental factors, EMI, corrosion, aging, etc.

To be continued…

--

--

Gulroz Singh

Engineering @ Semiconductor Tech | Mentor | Fitness | Self improvement