Email:
Password: [?] 
  Register with the DACS
Site Search: Advanced Search Search: Bibliographic Database(SEBD)     Lifecycle Database(SLED)    DoD Acronyms 
DACS Home Advertising Submitting Articles Archives About Us Suggest A Link
Rate this page's content:
  poor
excellent

Software Rejuvenation and Self-healing

By Lawrence Bernstein, Industry Research Professor, Computer Science Department
and Chandra M. R. Kintala, Distinguished Professor, Electrical and Computer Engineering Department Stevens Institute of Technology, Hoboken, NJ 07030

Software rejuvenation is a periodic, pre-emptive restart of a running system to prevent future failures. It is one aspect of a self-healing system. It was first introduced, described, implemented, modeled and analyzed in.[1] It is used in systems ranging from a data collector used by most of the US Telephone companies to collect billing information to NASA's long-duration space mission to Pluto[2]. It is also implemented in IBM's Netfinity resource manager[3]. Billing system failures and the use of software rejuvenation to prevent those failures, as described in [1], are quite similar to the failures and the fix that Nick van der Zweep described in the Computer World article (QuickLink Ref# 43636) dated January 12, 2004.

Software rejuvenation incurs overhead. Modeling to find optimal times is crucial. A simple and useful model based on Continuous-time Markov chains was introduced in [1] to analyze the reliability improvements due to software rejuvenation; the model is also useful to find optimal trigger rates/frequencies for rejuvenation. This model was then extended using Stochastic Petri Nets to study rejuvenation using the fail-over mechanisms in IBM's cluster-based systems[4]. X2000 for NASA's 12-year long Pluto-Kuiper Express mission to do simultaneous on-board preventive maintenance of software and hardware components during cruise and exploration phases used software rejuvenation. Analysis of reliability due to software rejuvenation showed 2 orders of magnitude improvement;[2] optimal interval was found to be 31.2 weeks in the 12-year long cruise phase. A recent paper[5] described software rejuvenation in web servers and how it can be analyzed to determine optimal interval for rejuvenation.

Recent experiments at Stevens Institute of Technology showed that data link protocols suffering memory leak failures could be made reliable using Rejuvenation libraries without having to fix the memory leak bug.[6] In essence Rejuvenation bounds the execution space for the working software so that latent failure modes are not executed. Had this technology been used in the Patriot Missile system during the first Iraq war the counter overfiow problem causing the anti-scud system to fail would not have occurred. The need for this technology was first identified during field tests of the earlier Safeguard anti-missile system. It then was applied to avoid hash table problems in a data switch.

Since the 1960s data communication designers knew to have software modules restart a line when it hung. The rejuvenation technology restarts a line before the hang to avoid potential secondary problems. It is a low cost, easy to implement technology that makes systems more trustworthy.

Software rejuvenation is one aspect of self-healing. Interesting new problems to study rejuvenation of large scale systems are:

  • What is a state in a large-scale system for rejuvenation analysis when “state” is across several products and systems
  • Failure symptoms are at a system/ network (macro) level but rejuvenation actions are at a component (micro) level; how does one correlate the twofi
  • What are the models and analytical methods for rejuvenation in large-scale systemsfi
  • How does one do rejuvenation in a large systemfi Through gradual load sheddingfi
  • What is a safe (clean internal) state to back up tofi How does one backup to that statefi
  • How does the technology become common practicefi

1 Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software Rejuvenation: Analysis, Module and Applications”, in Proc. of 25th Symposium on Fault Tolerant Computing, FTCS-25, pages 381–390, Pasadena, California, June 1995.

2 A. T. Tai, L. Alkalai and S. N. Chau, “On-Board Preventive Maintenance: A Design-Oriented Analytic Study for Long-Life Applications”, in Performance Evaluation, Vol. 35, No. 3-4, pp. 215–232, June 1999.

3 V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. P. Zeggert, “Proactive Management of Software Aging”, in IBM Journal of Research & Development, Vol. 45, No. 2, March 2001.

4 K. Vaidyanathan, R. E. Harper, S. W. Hunter, K. S. Trivedi, “Analysis and Implementation of Software Rejuvenation in Cluster Systems,” in Proc. of the Joint Intl. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, MA, June 2001.

5 Y. Bao, X. Sun and K. Trivedi, “Adaptive Software Rejuvenation: Degradation Models and Rejuvenation Schemes,” in Proc. of The International Conference on Dependable Systems and Networks, DSN-2003 June 2003.

6 Lawrence Bernstein, Yu-Dong Yao, Kevin Yao, “Software Avoiding Failures Even When there are Faults,” The DoD SoftwareTech News, October 2003, Vol. 6, No. 2, pp8 – 11,https://www.softwaretechnews.com , http://iac.dtic.mil/dacs

July 2004
Vol. 7, Number 2

Software Costs
 

Articles in this issue:
Tech Views
Software Testing as an Art a Craft and a Discipline
Software Rejuvenation and Self-healing
Industry Software Cost, Quality and Productivity Benchmarks
Independent Verification and Validation of Neural Networks
The DACS Gold Practice Initiative (Advertisement)

Download this issue (PDF)

Get Acrobat

Receive the Software Tech News
 
Click here to submit
an article or to check out future themes of the Software Tech News

STN Issues

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993


About the Software Tech News
 
  Advertising Opportunities
 
  Article Reprints
   DACS Gold Practice Initiative  ROI Dashboard
 
Acquisition Process Improvement
Architecture-First Approach
Assess Reuse Risks and Costs
Binary Quality Gates at the Inch-Pebble Level
Capture artifacts in rigorous, model-based notation
Commercial Specifications and Standards/Open Systems
Defect Tracking Against Quality Targets
Develop and Maintain a Life-cycle Business Case
Ensure Interoperability
Formal Inspections
Formal Risk Management
Goal-Question-Metric Approach
Integrated Product and Process Development
Manage Requirements
Metrics-based Scheduling
Model Based Testing
Plan for Technology Insertion
Requirements Trade-Off/Negotiation
Statistical Process Control
Track Earned Value
  Access benefit data from software technical and management improvements including SEI CMMI, PSP/TSP, Cleanroom, Inspections, and Agile Development.

View the ROI Dashboard
Copyright © 2010, ITT Corporation    Privacy Policy
webmaster@thedacs.com
775 Daedalian Drive Rome, NY 13441
(800) 214-7921 Fax: 315-838-7130
This site is best viewed in Firefox 1.0+ or IE 6.0+
XHTML