Summary & Conclusions
System-level FMEA that is highly focused on software behavior (defined as SS-FMEA) often fails to identify an effective failure mode, which may lead to a serious accident. The reason for this is the systemic rather than the physical nature of failures. The study has modeled insights derived from the best practices of the space system’s SS-FMEA and created a framework to assist the novice in identifying the high-impact failure mode.
Based on the analysis of three space system projects, it has been determined that important failure modes in SS-FMEA which contributed to improving the software designs can be simplified to the following model.
Important failure modes always have a set of elements of the environment that cannot be controlled or maintained by the target system with more than one module representing its abstractive capability. A module is viewed as a set of the input, the function, its output, and the inner state change from pre-condition to post-condition resulting from the function. The state change at the first module, which is Modulea, affects the behavior of the second Moduleb. Therefore, most of the failure modes which have systemic nature are the result of a violation of the expected state change that sets a logical connection between modules. There are only a few common principles for systemic failure mode identification since the violation of the expected state change is system-specific. That is why knowledge of the causes of the state change violation of the module is often acquired through experience. Therefore, the performance of SS-FMEA largely depends on the engineer’s competence and expertise. However, the impossibility of assigning only expert engineers to perform SS-FMEA highlights the importance of elaborating a method which would enable a novice to perform effective SS-FMEA.
In this paper, we assume SS-FMEA for a novel product like a space system that could be performed by the novice. In this SS-FMEA, the novice needs to determine the elements on which the analysis should focus, yet there is no single way to identify them (Issue #1). Then, given the limited expertise, the novice needs to combine the elements of the failure mode, such as the relation with the environment condition underuse, the function of the system, and how they deviate (Issue #2).
To solve these issues, we modeled insights derived from our best practice of SS-FMEA. As a result, frameworks have been created to assist the novice in defining failure modes for SS-FMEA. The basic idea behind the frameworks is the restriction of the elements of focus and the gradual expansion of focus. This idea is based on the logic which guides expert engineers as they perform SS-FMEA. A repetition of this restriction and focus expansion allows for excluding some worthless elements, leaving only the core ones for the failure mode. For instance, the framework guided the novice to consider only the deviation of the module input (Solution #1). The need to concentrate only on how the input for software affects the state change simplifies the analysis of the potential failure mode for the novice. The next step in the framework involves the expansion of information about the module and further combination of the elements using a two-by-two matrix (Solution #2). The results of the trial obtained during the space system development by the novice showed the precise determination of the elements of the failure mode in SS-FMEA.
Introduction
Predicting potential failure mode and reducing the failure rate or effects for the system before operational use is important as they help avoid a serious accident. For instance, a mishap of the ASTRO-H satellite as a result of a chain of events triggered by the failure of attitude control software after maneuvers. It might have been prevented by a thorough analysis of possible failure scenarios associated with this software. System-level FMEA can be applied to analyze potential failure modes that may lead to accidents, as well as identify weaknesses in the design. A system-level FMEA that is highly focused on software behavior will be further defined as a System-Level Software FMEA (SS-FMEA).
A successful SS-FMEA activity should thus yield possible actions to reduce the failure or effects for the subsequent development phase. In hardware FMEA, an effective FMEA result is still expected since the failure mode can be predicted by random failure with well-known electromechanical knowledge or stochastic distribution. In SS-FMEA, however, it is considerably difficult to standardize the identification of failure modes due to the systemic nature of software failures as opposed to a random hardware failure. This is because the role of software and the cause of the failure is strongly associated with the system features, including an operating environment and human factor. The failure mode for SS-FMEA is thus represented as a violation of logical relations in the combination of elements that are specific for each system. This specificity explains the lack of principles that could be used for failure mode identification in SS-FMEA.
Currently, there are only a few experienced engineers in space system development who determine implicit failure mode of systemic nature in SS-FMEA and identify the high impact potential failure. This is because the knowledge of the elements which cause the violation of logical relations is often acquired through experience. However, the impossibility of assigning only expert engineers to perform SS-FMEA highlights the importance of elaborating a method that would enable a novice to run effective analysis.
The given paper considers SS-FMEA in the context of the non-mass-produced product like space system that is performed by the novice. The first difficulty is the need for the novice to identify elements on which the analysis will focus balancing various perspectives to rely on (Issue #1). The second difficulty arises as the novice needs to combine the elements of the failure mode, which include the relationship with the environment condition underuse, the function of the system, and how they deviate (Issue #2). To solve these issues, we modeled insights derived from our best practice of SS-FMEA to create frameworks that can assist the novice in defining failure modes for SS-FMEA. The best experience of SS-FMEA and the logic of the expert engineer are modeled and incorporated into frameworks. By using these frameworks, the novice is expected to obtain the result that is close to that obtained by an experienced engineer.
Related Work
Generally, FMEA has a limited scope in that only a single item (function, box, or component) is typically analyzed at a time, so failures happening in a combination cannot be detected. However, in real development, some FMEA may analyze software-intensive systems contextually, which allows for identifying the failure mode occurring in a combination. In most cases, these FMEA are performed by expert engineers who successfully identify important potential failures that are crucial for the subsequent development phase. In the given paper, such FMEA are referred to as SS-FMEA.
To assist the novice in FMEA, an idea of guide words for setting a proper failure mode was proposed by various studies. Common guide words for low-level software failure mode may include infinite loop, numerical overflow, and memory leak. Also, it is expected that the use of common failure scenarios for SS-FMEA will yield satisfactory results when applied to a serial product that has few differences in the contexts. Those common failure scenarios considered various contexts for the specific domain and underwent several improvements. Considering the specificity of contexts that most of the space systems feature, common guide words or scenarios cannot be used for defining failure modes. Reflecting upon the experience of high-impact failures in the space systems mentioned before, it can be stated that the single application of the guide words is insufficient and needs to be combined with other methods.
Modeling Best Practices of Ss-Fmea
It is assumed that FMEA can consider only one element, though, in reality, there is SS-FMEA that contextually considers the failure mode happening in a combination. In this paper, we show the result of the analysis of the effective SS-FMEA which took into account several elements and identified the high impact systemic failure.
Analysis of Effective SS-FMEA Performance
The analysis of three space system projects was performed, and it has been determined that important failure modes in SS-FMEA, which are helpful in improving the software design, can be described as the following model.
As shown in equation (1), important failure modes are represented by environmental elements that cannot be controlled or sustained by the target system. Moreover, there is always more than one module accounting for abstractive capability in the target system. A module consists of the input, the function, its output, and the inner state change from pre-condition to post-condition which depends on the function. The inner state change represents the partial behavior of the system. The state change at the first module, which is Modulea from equation (2), impacts the behavior of the second Moduleb from equation (3). If the environment is stable, no unexpected behavior in modules will occur. However, if the environment deviates from the expectation, inhibition of the state change may happen, and failure will occur.
A good example of the above-said is a system that always starts its observation after deleting old files to save the new observation data. Let us assume the urgent observation plan that arises during observation data downlinks and corresponds to the environment which may affect the state change of the module. In this environment, the behavior that deletes old files is Modulea, and the behavior that starts the observation is Moduleb. In this case, Modulea cannot delete files because the function in Modulea was not designed to delete files during data downlinks. As a result, Moduleb is not able to start urgent observation. The failure mode will be the pre-state for the Moduleb as the command upload preparation completion time is in conflict with the currently visible time which is considered unexpected. The failure mode regarded several elements and assumed that the failure of systematic logic might lead to the mission failure when observation chance was once-in-a-lifetime.
Problems of High Impact Failure Modes for Novices
Based on the observation described in section 3.1, it can be said that failure mode having systemic nature can be characterized as a violation of expected state change in the module. Understanding of causes that lead to the violation of the state change usually comes with experience. Therefore, outcomes of the SS-FMEA performance are inextricably linked with the engineer’s expertise. We assume SS-FMEA for the novel product-like space system that is performed by the novice. In this SS-FMEA, the novice needs to determine on which elements the analysis will focus on, considering different ways of making a choice (Issue #1). After that, the novice has to combine the elements of the failure mode to establish a meaningful relationship between the environment and several modules, as well as how they deviate (Issue #2).
Framework To ASsist THE Novice
To solve the above-mentioned issues, we modeled insights derived from our best practice of SS-FMEA and created frameworks to assist the novice in defining failure mode for analysis. The underlying idea was the need for the framework to repeat the restriction of elements on which the novice should focus and the gradual expansion of that focus. The same algorithm is utilized by expert engineers as they perform SS-FMEA. A repetition of this restriction and the expansion of focus provide for the critical consideration of elements for the failure mode, so only core ones remain. In this paper, we introduce two frameworks, one is for the restriction and the other is for the spread of the ideas. These frameworks do not ensure exhaustive analysis and may be used as a training tool to perform SS-FMEA.
Restricting Information for the Failure Mode Analysis
The novice needs to choose the elements on which the analysis should focus out of a multitude of choices (Issue #1). Most novices have difficulty in leading failure mode due to a large number of elements that need to be considered at once. Therefore, the framework enabling the novice to consider only one deviation at a time is proposed (Solution #1). Figure 1 represents the example of the restriction framework which only focuses on the input deviation. The framework consists of elements represented in the equations (1) – (3).
To use this framework, the novice should fill in columns from left to right. The order of framework sections was set in accordance with the logic on which expert engineer relies when performing FMEA. First, the novice recognizes the critical situations which may affect the mission. Second, inputs that may deviate in that situation should be found. This will restrict the elements that are to be considered. After identification of input that may deviate, the corresponding function and effects for the chosen elements need to be determined. By combining all the information about the environment and modules, valuable insights into the nature of the possible failure may be gained.
Identification of Combinations of Elements
After the elements that need to be considered in the analysis have been identified, all their possible combinations need to be assessed to determine the high impact failure mode. The potential failure identified in section 4.1 is just a component of the possible failure mode. A real-world failure is more complex as the elements affecting the occurrence of the accident may be arranged in all possible combinations. Therefore, the novice needs to combine the elements of the
failure mode, despite the limited expertise (Issue #2). The proposed framework guides to expand the information from section 4.1, encouraging the novice to combine the elements (Solution #2). Figure 2 represents the example of the framework considering various combinations of elements in terms of “time,” “amount,” and “property” related to the input. Figure 3 is an example of the two-by-two matrix where the elements are combined in terms of time and amount. To use the framework, the novice should select the element which is likely to cause the failure identified in the previous section.
The element presented in section 4.1 gives a limited understanding of the failure, so the novice should use this element to expand the idea of the potential failure. As an example, distinctive features of the preparation completion time are presented at the top of Figure 2. Using these features, variations of possible failure modes were predicted.
Figure 1 illustrates only one idea of the failure mode, though the second framework enables to identify of five failure modes, as shown in Figure 2. Insights into the failure mode should be then combined by using a two-by-two matrix. Figure 3 shows just one example of the matrix which combines the failure mode in terms of time and amount. Even though not all elements can be combined, the combination of some of them gives a clearer picture of the possible failure. The failure mode that is analyzed using such matrixes is quite close to the real-world accident and can be modeled even by a person with limited expertise.
SS-FMEA performed by THE novice USING THE proposed framework
A trial of the SS-FMEA proposed in section 4 was conducted. Spacecraft system development projects were selected for the SS-FMEA performance. A novice and an expert were selected to run SS-FMEA with and without the framework. The development phase was based on the detailed design phase.
Table 1 represents the results of the trial by comparing the number of failure modes and possible failures identified by the novice using the framework and the expert without using the framework. The analysis targeted a single function in the software system. Although the amount of failure modes and possible failures identified by the novice is smaller than that identified by the expert, the novice performed well. However, without the framework, the novice cannot detect any means possible failure, as shown in Figure 4. This emphasizes the feasibility of relying on the logic which guides the expert engineers as they perform FMEA and integrating the repetition of the restriction and gradual expansion of the focus to ensure identification of the core elements of the failure mode.
Table 1 – Comparison of the Amount of the Failure Modes and Possible Failures
Figure 4 shows the percentage of elements which were identified without using the framework. The upper and the lower bar charts show the percentage of the elements detected by the expert and the novice correspondingly. It can be seen that the novice has limited expertise to conduct SS-FMEA, especially due to poor knowledge of the operational environment. The failure which has a systemic nature needs to be put in the context of the environment, so even though the failure mode was identified from the analysis, there were only a few meaningful failure modes and no possible failures at all identified by the novice.
Figure 5 shows that when relying on the framework, the novice identified the percentages of elements quite similar to that identified by the expert. Thus, the framework enabled the novice to be guided by the knowledge of the operational environment which always comes with the experience of renal failure. Interestingly, the number of possible failures was two times greater than the number of failure modes for the novice and four times greater for the expert. This time, the proposed framework only suggests combining the idea about the failure mode. After that, the expert is proposed to combine several elements, such as knowledge of the environment or modules, and consider various patterns of possible failure, even if there is only a single idea of failure mode.
Current frameworks do not support the combination of elements in one failure mode. To train the novice to think like the expert, the framework can be improved to support the other combinations. As has been mentioned before, the framework is designed as a training tool for the novice and does not provide for exhaustive analysis to perform SS-FMEA during the real development process. The procedure of identifying the failure mode using this framework would be too time-consuming during the development phase due to the vast analysis. To put this framework into practical use, some improvements need to be made.
Conclusion
In this paper, the system-level FMEA that is highly focused on the software behavior (SS-FMEA) for the novel product like space system performed by the novice was assumed. The difficulty in identifying an effective failure mode using SS-FMEA stems from the systemic nature of failures as opposed to random hardware failures and may lead to a serious accident. Most of the failure modes which have a systemic nature result from a violation of expected state change which establishes a logical relation between modules.
Due to the specificity of the expected state change for each system, there is no common principle to guide systemic failure mode identification. That is why knowledge of factors contributing to the violation of the module state change usually comes with experience, and the novice often faces difficulties identifying the high impact failure mode. The study modeled insights derived from best practices of the space system’s SS-FMEA and created a framework to help the novice identify the high impact failure mode. The trial has shown that the proposed framework supports the novice. The framework has been designed solely for training and not for the usage in real development. To make the framework suitable for practical use, some improvements should be considered.
References
Japan Aerospace Exploration Agency, Hitomi Experience Report: Investigation of Anomalies Affecting the X-ray Astronomy Satellite “Hitomi” (ASTRO-H), 2016. Web.
M. Jones, K. Fretz, S. Kubota, C. A. Smith, “The Use of the Expanded FMEA in Spacecraft Fault Management,” RAMS, 2018.
B. Huang, H. Zhang, M. Lu. “Software FMEA approach based on failure modes database.” 2009 8th International Conference on Reliability, Maintainability and Safety. IEEE, 2009.
H. H. Kim, “SW FMEA for ISO-26262 software development,” in 2014 21st Asia-Pacific Software Engineering Conference, vol. 2, 2014, pp. 19–22.