Application of SRE to Ultrareliable Systems

Application of SRE to Ultrareliable Systems - The Space Shuttle

by Dr. Norman F. Schneidewind

Introduction

The Space Shuttle avionics software represents a successful integration of many of the computer industry’s most advanced software engineering practices and approaches. Beginning in the late 1970’s this software development and maintenance project has evolved one of the world’s most mature software processes applying the principles of the level 5 of the Carnegie Mellon University Software Engineering Institute’s Capability Maturity Model. This article explores the successful use of extremely detailed fault and failure history, throughout the software life cycle, in the application of software reliability engineering techniques to gain insight into the flight-worthiness of the software.

Using the Shuttle application, we show how Software Reliability Engineering (SRE) can be applied to: interpret software reliability predictions, support verification and validation of the software, assess the risk of deploying the software, and predict the reliability of the software. Predictions are currently used by the software developer to add confidence to the reliability assessments of the Primary Avionics Shuttle Software (PASS) achieved through formal software certification processes.

Interpretation of Software Reliability Predictions

Successful use of statistical modeling in predicting the reliability of a software system requires a thorough understanding of precisely how the resulting predictions are to be interpreted and applied [6]. The PASS (430 KLOC) is frequently modified, at the request of NASA, to add or change capabilities, using a constantly improving process. Each of these successive PASS versions constitutes an upgrade to the preceding software version. Each new version of the PASS (designated as an Operational Increment, OI) contains software code which has been carried forward from each of the previous versions ("previous-version subset") as well as new code generated for that new version ("new-version subset").We have found that by applying a reliability model independently to the code subsets according to the following rules, we can obtain satisfactory composite predictions for the total version:

(1) all new code developed for a particular version uses the same development process. (2) all code introduced for the first time for a particular version is considered to have the same life and operational execution history (3) once new code is added to reflect new functionality in the PASS, this code is only changed thereafter to correct faults.

Estimating Execution Time

We estimate execution time of segments of the PASS software by analyzing records of test cases in digital simulations of operational flight scenarios as well as records of actual use in Shuttle operations. Test case executions are only counted as "operational execution time" for previous-version subsets of the version being tested if the simulation fidelity very closely matches actual operational conditions. Pre-release test execution time for the new code actually being tested in a version is never counted as operational execution time. We use the failure history and operational execution time history for the new-code subset of each version to generate an individual reliability prediction for that new code in each version by separate applications of the reliability model. This approach places every line of code in the total PASS into one of the subsets of "newly" developed code, whether "new" for the original version or any subsequent version. We then represent the total reliability of the entire software system as that of a composite system of separate components ("new-code subsets"), each having an individual execution history and reliability, connected in series. The developer uses this approach to apply the Schneidewind Model [5, 6] as a means of predicting a "conservative lower bound" for the PASS reliability. This prediction is important because the user can be confident that it is highly likely that the software reliability would be no worse than this bound in operation.

Verification and Validation

Software reliability measurement and prediction are useful approaches to verify and validate software. Measurement refers to collecting and analyzing data about the observed reliability of software, for example, the occurrence of failures during test. Prediction refers to using a model to forecast future software reliability, for example, failure rate during operation. Measurement also provides the failure data that is used to estimate the parameters of reliability models (i.e., make the best fit of the model to the observed failure data). Once the parameters have been estimated, the model is used to predict the future reliability of the software. Verification ensures that the software product, as it exists in a given project phase, satisfies the conditions imposed in the preceding phase (e.g., reliability measurements of ultrareliable systems software components obtained during test conform to reliability specifications made during design) [2]. Validation ensures that the software product, as it exists in a given project phase, which could be the end of the project, satisfies requirements (e.g., software reliability predictions obtained during test correspond to the reliability specified in the requirements) [2]. Another way to interpret verification and validation is that it builds confidence that software is ready to be released for operational use. The release decision is crucial for systems in which software failures could endanger the safety of the mission and crew (i.e., ultrareliable systems software). To assist in making an informed decision, we integrate software risk analysis and reliability prediction.

Risk Assessment

Safety risk pertains to executing the software of an ultrareliable systems system where there is the chance of injury (e.g., astronaut injury or fatality), damage (e.g., destruction of the Shuttle), or loss (e.g., loss of the mission) if a serious software failure occurs during a mission. In the case of the PASS, where the occurrence of even trivial failures is extremely rare, the fraction of those failures that pose any safety or mission risk is too small to be statistically significant. As a result, for risk assessment to be feasible, all failures (of any severity) over the entire 20-year life of the project have been included in the failure history database for this analysis. Therefore, the risk criterion metrics to be discussed for the Shuttle quantify the degree of risk associated with the occurrence of any software failure, no matter how insignificant it may be. This approach can be applied to assessing safety risk where sufficient data exist.

The prediction methodology [3] provides bounds on total test time, remaining failures, and time to next failure that are necessary to perform the risk assessment. Two criteria for software reliability levels are defined. Then these criteria are applied to the risk analysis of ultrareliable systems software, using the PASS as an example. In the case of the Shuttle example, the "risk" represents the degree to which the occurrence of failures does not meet required reliability levels, regardless of how insignificant the failures may be. Next, selected prediction equations that are used in reliability prediction and risk analysis are defined and derived.

Criteria for Reliability

If the reliability goal is the reduction of failures of a specified severity to an acceptable level of risk [4], then for software to be ready to deploy, after having been tested for total time tt, it must satisfy the following criteria:

1) predicted remaining failures r (t_t) < r_c, (1) where r_c is a specified critical value, and

2) predicted time to next failure T_F (t_t) > t_m, (2) where t_m is mission duration.

For systems that are tested and operated continuously like the Shuttle, t_t, T_F (t_t), and t_m are measured in execution time. Note that, as with any methodology for assuring software reliability, there is no guarantee that the expected level will be achieved. Rather, with these criteria, the objective is to reduce the risk of deploying the software to a "desired" level.

Remaining Failures Criterion

Using the assumption that the faults that cause failures are removed (this is the case for the Shuttle), criterion 1 specifies that the residual failures and faults must be reduced to a level where the risk of operating the software is acceptable. As a practical matter, r_c=1 is suggested. That is, the goal is to reduce the expected remaining failures of a specified severity to less than one before deploying the software. The assumption for this choice is that one or more remaining failures would constitute an undesirable risk of failures of the specified severity. Thus, one way to specify r_c is by failure severity level, as classified by the developer, (e.g., include only life threatening failures). Another way, which imposes a more demanding criterion, is to specify that r_c represents all severity level, as in the Shuttle example. For example, r (t_t) <1 would mean that r (t_t) must be less than one failure, independent of severity level.

If r (t_t) ł r_c is predicted, testing would continue for a total time t_t´ > t_t that is predicted to achieve r (t_t´) < r_c, using the assumption that more failures will be experienced and more faults will be corrected so that the remaining failures will be reduced by the quantity r (t_t)—r (t_t‘). If the developer does not have the resources to satisfy the criterion or is unable to satisfy the criterion through additional testing, the risk of deploying the software prematurely should be assessed (see the next section). It is known that it is impossible to demonstrate the absence of faults [1]; however, the risk of failures occurring can be reduced to an acceptable level, as represented by r_c. This scenario is shown in Figure 1. In case A, r (t_t) < r_c is predicted and the mission begins at t_t. In case B, r (t_t) ł r_c is predicted and the mission would be postponed until the software is tested for total time t_t´, when r (t_t´) < r_cis predicted. In both cases, criterion 2) must also be satisfied for the mission to begin.

Time to Next Failure Criterion

Criterion 2 specifies that the software must survive for a time greater than the duration of the mission. If T_F (t_t) ˛ t_m, is predicted, the software is tested for a total time t_t" > t_t that is predicted to achieve T_F (t_t") > t_m, using the assumption that more failures will be experienced and faults corrected so that the time to next failure will be increased by the quantity T_F (t_tý) -T_F (t_t). Again, if it is infeasible for the developer to satisfy the criterion for lack of resources or failure to achieve test objectives, the risk of deploying the software prematurely should be assessed. This scenario is shown in Figure 2. In case A T_F (t_t) > t_m is predicted and the mission begins at t_t. In case B, T_F (t_t)˛ t_m is predicted and in this case the mission would be postponed until the software is tested for total time t_t", when T_F (t_t") > t_m is predicted. In both cases, criterion 1) must also be satisfied for the mission to begin. If neither criterion is satisfied, the software is tested for a time which is the greater of t_t´ or t_t".

Remaining Failures Metric

The mean value of the risk criterion metric (RCM) for criterion 1 is formulated as follows: RCM r (t_t) = (r (t_t) -r_c) /r_c = (r (t_t) /r_c) - 1 (3)

Equation (3) is plotted in Figure 3 as a function of t_t for r_c = 1, where positive, zero, and negative values correspond to r (t_t) >r_c, r (t_t) = r_c, and r (t_t) < r_c, respectively. In Figure 3, these values correspond to the following regions: CRITICAL (i.e., above the X-axis predicted remaining failures are greater than the specified value); NEUTRAL (i.e., on the X-axis predicted remaining failures are equal to the specified value); and DESIRED (i.e., below the X-axis predicted remaining failures are less than the specified value, which could represent a "safe" threshold or in the Shuttle example, an "error-free" condition boundary). This graph is for the Shuttle Operational Increment OID (with many years of operation): a software system comprised of modules and configured from a series of builds to meet Shuttle mission functional requirements. In this example, it can be seen that at approximately t_t = 57 the risk transitions from the CRITICAL region to the DESIRED region.

Time to Next Failure Metric

Similarly, the mean value of the risk criterion metric (RCM) for criterion 2 is formulated as follows:

RCM T_F (t_t) = (t_m - T_F (t_t)) / t_m=1-(T_F (t_t)) / t_m (4)

Equation (4) is plotted in Figure 4 as a function of t_t for t_m = 8 days (a typical mission duration time for this OI), where positive, zero, and negative risk corresponds to T_F (t_t) < t_m, T_F (t_t) = t_m, and T_F (t_t) > t_m, respectively. In Figure 4, these values correspond to the following regions: CRITICAL (i.e., above the X-axis predicted time to next failure is less than the specified value); NEUTRAL (i.e., on the X-axis predicted time to next failure is equal to the specified value); and DESIRED (i.e., below the X-axis predicted time to next failure is greater than the specified value). This graph is for the Shuttle operational increment OIC. In this example, the RCM is in the DESIRED region at all values of t_t.

Lessons Learned

Several important lessons have been learned from our experience of twenty years in developing and maintaining the PASS, which you could consider for adoption in your SRE process:

1) No one SRE process method is the "silver bullet" for achieving high reliability. Various methods, including formal inspections, failure modes analysis, verification and validation, testing, statistical process management, risk analysis, and reliability modeling and prediction must be integrated and applied.

2) The process must be continually improved and upgraded. For example, experiments with software metrics have demonstrated the potential of using metrics as early indicators of future reliability problems. This approach, combined with inspections, allows many reliability problems to be identified and resolved prior to testing.

3) The process must have feedback loops so that information about reliability problems discovered during inspection and testing is fed back not only to requirements analysis and design for the purpose of improving the reliability of future products, but also to the requirements analysis, design, inspection and testing processes themselves. In other words, the feedback is designed to improve not only the product but also the processes that produce the product.

4) Given the current state-of-the-practice in software reliability modeling and prediction, practitioners should not view reliability models as having the ability to make highly accurate predictions of future software reliability. Rather, software managers should interpret these predictions in two significant ways: a) providing increased confidence, when used as part of an integrated SRE process, that the software is safe to deploy; and 2) providing bounds on the reliability of the deployed software (e.g., high confidence that in operation the time to next failure will exceed the predicted value and the predicted value will safely exceed the mission duration).

References

1. E. W. Dijkstra, "Structured Programming", Software Engineering Techniques, eds. J. N. Buxton and B. Randell, NATO Scientific Affairs Division, Brussels 39, Belgium, April 1970 pp. 84-88.

2. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12.1990, The Institute of Electrical and Electronics Engineers, New York, New York, March 30, 1990.

3. Ted Keller, Norman F. Schneidewind, and Patti A. Thornton "Predictions for Increasing Confidence in the Reliability of the Space Shuttle Flight Software", Proceedings of the AIAA Computing in Aerospace 10, San Antonio, TX, March 28, 1995, pp. 1-8.

4. Norman F. Schneidewind, "Reliability Modeling for Safety Systems Software", IEEE Transactions on Reliability, Vol. 46, No.1, March 1997, pp.88-98.

5. Norman F. Schneidewind, "Software Reliability Model with Optimal Selection of Failure Data", IEEE Transactions on Software Engineering, Vol. 19, No. 11, November 1993, pp. 1095-1104.

6. Norman F. Schneidewind and T. W. Keller, "Application of Reliability Models to the Space Shuttle", IEEE Software, Vol. 9, No. 4, July 1992 pp. 28-33.

About the Author:

Norman F. Schneidewind is Professor of Information Sciences at the Naval Postgraduate School. Dr. Schneidewind was selected for an IEEE USA Congressional Fellowship for 2005 and will work on the Committee on Governmental Affairs in the U.S. Senate.

Dr. Schneidewind is a Fellow of the IEEE, elected in 1992 for "contributions to software measurement models in reliability and metrics, and for leadership in advancing the field of software maintenance". In 2001, he received the IEEE "Reliability Engineer of the Year" award from the IEEE Reliability Society.

In 1993 and 1999, he received awards for Outstanding Research Achievement by the Naval Postgraduate School. He is the developer of the Schneidewind software reliability model that is used by NASA to assist in the prediction of software reliability of the Space Shuttle. This model is one of the models recommended by the American Institute of Aeronautics and Astronautics Recommended Practice for Software Reliability.

Norman F. Schneidewind
[email protected]

December 2004
Vol. 8, Number 1

Software Reliability Engineering

Articles in this issue:

Software Reliability Engineering - An Overview

Automated Testing with an Operational Profile

Applications of SRE in the Security Domain

Software Reliability Engineering for Mass Market Products

Application of SRE to Ultrareliable Systems - The Space Shuttle

Download this issue (PDF)

Recieve the Software Tech News