OBAFEMI AWOLOWO UNIVERSITY, ILE-IFE, NIGERIA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
ACHIEVING FAULT-TOLERANCE IN OPERATING SYSTEM DESIGN AND IMPLEMENTATION
Fault-tolerant computing is the art and science of building computing systems that continue to operate satisfactorily in the presence of faults. A fault-tolerant system may be able to tolerate one or more fault-types including - i) transient, intermittent or permanent hardware faults, ii) software and hardware design errors, iii) operator errors, or iv) externally induced upsets or physical damage. An extensive methodology has been developed in this field over the past thirty years, and a number of fault-tolerant machines have been developed - most dealing with random hardware faults, while a smaller number deal with software, design and operator faults to varying degrees. A large amount of supporting research has been reported.
Fault tolerance and dependable systems research covers a wide spectrum of applications ranging across embedded real-time systems, commercial transaction systems, transportation systems, and military/space systems - to name a few. The supporting research includes system architecture, design techniques, coding theory, testing, validation, proof of correctness, modelling, software reliability, operating systems, parallel processing, and real-time processing. These areas often involve widely diverse core expertise ranging from formal logic, mathematics of stochastic modelling, graph theory, hardware design and software engineering. Recent developments include the adaptation of existing fault-tolerance techniques to RAID disks where information is striped across several disks to improve bandwidth and a redundant disk is used to hold encoded information so that data can be reconstructed if a disk fails. Another area is the use of application-based fault-tolerance techniques to detect errors in high performance parallel processors. Fault-tolerance techniques are expected to become increasingly important in deep sub-micron VLSI devices to combat increasing noise problems and improve yield by tolerating defects that are likely to occur on very large, complex chips.
Fault-tolerant computing already plays a major role in process control, transportation, electronic commerce, space, communications and many other areas that impact our lives. Many of its next advances will occur when applied to new state-of-the-art systems such as massively parallel scalable computing, promising new unconventional architectures such as processor-in-memory or reconfigurable computing, mobile computing, and the other exciting new things that lie around the corner.
Hardware Fault-Tolerance - The majority of fault-tolerant designs have been directed toward building computers that automatically recover from random faults occurring in hardware components. The techniques employed to do this generally involve partitioning a computing system into modules that act as fault-containment regions. Each module is backed up with protective redundancy so that, if the module fails, others can assume its function. Special mechanisms are added to detect errors and implement recovery. Two general approaches to hardware fault recovery have been used: 1) fault masking, and 2) dynamic recovery. Fault masking is a structural redundancy technique that completely masks faults within a set of redundant modules. A number of identical modules execute the same functions, and their outputs are voted to remove errors created by a faulty module. Triple modular redundancy (TMR) is a commonly used form of fault masking in which the circuitry is triplicated and voted. The voting circuitry can also be triplicated so that individual voter failures can also be corrected by the voting process. A TMR system fails whenever two modules in a redundant triplet create errors so that the vote is no longer valid. Hybrid redundancy is an extension...