On-line failure detection is an essential means to control and assess the dependability of complex and critical software systems. In such context, effective detection strategies are required, in order to minimize the possibility of catastrophic consequences. This objective is however difficult to achieve in complex systems, especially due to the several sources of non-determinism (e.g., multi-threading and distributed interaction) which may lead to software hangs, i.e., the system is active but no longer capable of delivering its services. The paper proposes a detection approach to uncover application hangs. It exploits multiple indirect data gathered at the operating system level to monitor the system and to trigger alarms if the observed behavior deviates from the expected one. By means of fault injection experiments conducted on a research prototype, it is shown how the combination of several operating system monitors actually leads to an high quality of detection, at an acceptable overhead.
Content
Author and article information
Contributors
G. Carrozza
M. Cinque
D. Cotroneo
R. Natella
Conference
Publication date:
July
2008
Publication date
(Print):
July
2008
Pages: 1-11
Affiliations
[(1)
]Dipartimento di Informatica e Sistemistica - Università degli Studi di Napoli Federico
II
Via Claudio 21, 80125 - Naples, Italy