Software Rejuvenation: Avoiding Failures Even When There Are Faults
"Software Engineering Design Which Enables Faulty Software to Run Without Failure - a Study of ARQ Wireless Protocol Implementation in C++"
By Lawrence Bernstein, Professor, Computer Science; Yu-Dong Yao, Professor, Electrical Engineering; Kevin Yao, Master’s Candidate, Computer Science
Overview
Skeptics doubt the use of software rejuvenation technology to avoid failures even when there are faults in the system as presented by Larry Bernstein on the front page of the DoD Software Tech News (Volume 3, Number 4), in the article, "A Software Engineering Course for Trustworthy Software." Here is a case study showing that his conclusions are valid. You may wish to repeat this work in your organization to convince your process driven engineers and your hackers that there is a right and proper role for software design technology in realizing trustworthy software.
In this case study, communications protocol software was built by a skilled programmer. It contained a memory leak that would crash the system. The software failed under certain stressful tests. Once software rejuvenation library software was bound into the communications software, the protocol no longer failed, even though the bug was still there. By using known software design technology, bugs may become benign.
Software execution is very sensitive to its initial conditions and the external data it receives. What appear to be random failures are often repeatable. The problem in finding and fixing these problems is related to the difficulty of doing the detective work needed to discover the particular initial conditions and data sequences that can trigger the fault so that it becomes a failure.
Prof. Lui Sha’s model of reliability is based on these postulates [2]:
- Complexity begets faults. For a given execution time, software
reliability decreases as complexity increases.
- Faults are not equal. Some are easy to find and fix and others are
Heisenberg’s. Faults are not random.
- All budgets have limits. There is not unlimited time or money to pay
for exhaustive testing.
Prof. Bernstein’s effectiveness extension of the reliability model adds an effectiveness factor. In Œ7x24’ systems, the longer the software system runs, the lower its reliability and the more likely a fault will be executed and become a failure. Reliability can be improved by investing in tools, simplifying the design, or increasing the development effort beyond that projected using Software Cost Estimation models such as COCOMO II. For example, one may inspect code twice or include a diabolic testing group in addition to the normal test groups.
Study Perspective
Our goal was to study and quantify the parameters for the software reliability model by gathering real data from a controlled implementation of a wireless communications protocol. The study included these steps:
- Review the requirements.
- Create a Software Architecture.
- Design the software.
- Review the Software Architecture and design.
- Implement (develop) Automatic Request Response (ARQ) communication
software protocols in C++
[3].
- Inspect the code.
- Test the software.
- Measure reliability data from running system tests.
- Once the developer is satisfied that the code works reliably, stress test
the software with diabolic tests. The test chosen was one where all frames fail
in a window of 1000 frames, where the software should have continued to try to
send the frames. In the case study, the software crashed.
- Do not fix any bugs detected during the diabolic stress test. Instead, add
a fault tolerance library to see if the bug could be avoided. If the bug is
not executed the software is shown to run reliably even though it
still defective.
Software Requirements
The code will simulate standard Selective Repeat ARQ protocol [3]. In this case, the messages bytes are grouped into a set of frames, each frame is separated by special header and trailer bytes and a number of frames are sent in a burst. The receiver software either acknowledges successful receipt of the frames or sends special control frames to the sender software signaling that the frames that had errors must be resent. The software stops after correctly sending the assigned number of frames.
A procedure call sends frames to the network. The network may send a frame correctly, corrupt it, or lose it, or lose frames. The protocol can detect many kinds of network corruptions and frame loss.
- Loss. Users specify a frame loss probability. A value of 0.1 would
mean that one in ten frames (on average) will be lost. Frame loss means that the
frame is lost or that the header or trailer is corrupted so that the frame does
not reach its destination because of noise.
- Corruption. Users specify a frame corruption probability. A value
of 0.2 would mean that one in five frames (on average) will be corrupted.
Frame corruption is when a frame reaches the destination with a bit error detected
by special error detection algorithms. Note that the header, message or trailer
bytes may be corrupted.
- Tracing. Setting a tracing value of 1 or 2 will print out
useful information about what is going on inside the emulation (e.g.,
what’s happening to frames and timers). A tracing value of 0 will turn
this off.
- Average time between messages. Users set this value to any
non-zero, positive value. Note that the smaller the value chosen, the faster
the frames will be arriving to the sender.
Based on the requirement, the program will run in a simulated hardware/software environment. It will have three parts: A-side (sender), B-side (receiver) and a network emulator to simulate the network environment. The overall structure of the environment is shown in Figure 1.
Figure 1. Layered Structure and Design Diagram
The program only implements unidirectional transfer of data (from A to B). Of course, the B side will have to send frames to A to acknowledge receipt of data.
Software Development Process History
(as recorded by the developer)
- I wrote documentation, tried to finish requirements document and
architecture document and even the test plan first before I began coding.
I became bored and under time pressure started coding before all documents
were ready.
- I rushed into coding without complete understanding of the problem in
the mistaken belief that coding would lead to insight into the problem.
- I built a simple version of the software and was proud to see that it
worked. Early success gave me confidence to plunge ahead.
- I was stuck after finishing a simple version of the protocol because it
would not scale to the Stop-N-Wait version. The early version lacked timers,
flow control counters and frame counters.
- I made a design change to an event driven architecture.
- I ran unit tests and fixed bugs. A bug surfaced when sending 1000 frames
with 50% lost rate. A pointer operation for one boundary condition exceeded the
design constraints.
- The stress test of 100% frame loss rate exhausted the system memory that
came from use of fprintf() used for debugging and statement recording and the
software hung.
- I added Libft library to provide fault tolerance into the program
[1]. A memory
conflict happened at the start up of Windows 2000. Windows NT 4.0
ran with the latest service pack 6a and Swift. There were many problems when
configuring the Windows NT computer. These were resource limitation and
configuration conflict problems, typical of so many software developments.
- Reran stress tests and all worked perfectly.
Stress Test Without Rejuvenation
- Sending out 1000 frames 1000 times, with 0% loss rate and 0% corruption
rate - passed.
- Sending out 1000 frames 1000 times, with 10% loss rate and 10% corruption
rate - passed
- Sending out 1000 frames 1000 times, with 50% loss rate and 50% corruption
rate - passed.
- Sending out 1000 frames 1000 times, with 100% loss rate and 100%
corruption rate - failed.
There was a memory leak in the code. The memory leak operates under all conditions but does not cause the software to fail except under the most stressful conditions. The stress test uncovered the fault. If the system were used in production, a random amount of memory would have been consumed. Other programs running in the same memory space might have randomly failed for lack of memory. The fault becomes coupled from one program to another. The stress test was effective in detecting the fault so that a potential system failure could be avoided. After the fault tolerance library (Libft) was bound to the ARQ product, the stress test was repeated. With a 100% loss rate, the program has never failed and as expected never stopped. The memory usage did not increase. The memory leak failure was avoided.
The results from the stress test are illustrated in Figure 2.
After fault tolerance library (Libft) was built in to reset the program every time 100 frames were sent, the 1000 frames test cases were repeated. This time there was no crash.
Figure 2. Stress Test Results
Software Reliability Analysis
Two defects were found during system test. One defect crashed the software due to a memory leak and was avoided once the software fault tolerance library was added to the ARQ product. The other defect could occasionally cause the wrong frame to be sent but it did not hang or crash the package and the code was fixed.
The rush to coding without careful design led to poor software architecture. Too often developers do not feel they have the time to start again and live with an ill conceived architecture leading to untrustworthy software execution. The need for a solid architecture and good prototype was illustrated by the events in the case study.
The stress tests were designed by Professor Bernstein and went well beyond the imagination of the developer. The stress test was successful in inducing a latent fault to hang the system. Too often a latent fault in one process consumes resources that cause other fault free processes to fail.
The wisdom of exploiting fault tolerant software technology was demonstrated in the case study. Try it, you will like it and so will those you live with that will not be startled in the middle of the night as you handle a frantic phone call.
About the Author
Lawrence Bernstein is a recognized expert in software technology, network architecture, network management software, software project management, and technology conversion. He conceived of the notion of software rejuvenation. He is currently teaching graduate courses on Computer Networks and undergraduate Software Engineering at Stevens Institute of Technology in Hoboken, NJ.
During a distinguished 35-year career at Bell Laboratories he was Chief Technical Officer of the Operations Systems Business Unit and an Executive Director managing large software projects. Since retirement he heads his own consulting firm.
Author Contact Information
Larry Bernstein
Stevens Institute of Technology
4 Marion Avenue
Short Hills NJ 7078
973-258-9213
http://guinness.cs.stevens-tech.edu/~lbernste/
References
[1] Huang, Y. and Kintala, C. M. R., "Software Implemented Fault Tolerance: Technologies and Experience", Proceedings of 23rd Intl. Symposium on Fault-Tolerant Computing, Toulouse, France, pp. 2-9, June 1993;
Also appeared as a chapter in the book Software Fault Tolerance , M. Lyu (Ed.), John Wiley & Sons, March 1995
[2] Sha, Lui, "Using Simplicity to Control complexity," IEEE Software, July/August 2001, Volume 18, Number 4, IEEE 0740-7459/01, software@computer.org, page 27
[3] Stallings, William, Data and Computer Communications, 5th Edition, New York: Prentice Hall PTR, (1996), ISBN: 0024154253 |