Performance/cost analysis of Software Implemented Hardware Fault Tolerance Techniques
Abstract
As Moore's law projections continue to hold true,
more and more transistors are getting integrated on a single chip. These transistors are being smaller each passing technology generation. Forecasts suggest that the reliable operation of future devices with continuously shrinking geometries cannot be guaranteed. Currently used hardware-based fault tolerance techniques are expensive in terms of the hardware modifications,
the corresponding expenses on verification and testing and loss of volume savings. Hence there is a need for high-level techniques that will ensure reliability even when the underlying hardware is not reliable. The advent of multi-cores coupled with the accompanying exponential increase in transistor counts threatens to make the problem even more severe. This report compares the performance cost of Software-Implemented Hardware Fault
Tolerance (SIHFT) techniques on two distinct architectures: CPUs and GP-GPUs. The choice of multi-core architectures is due to the belief that they are the way to go for all processing
cores of the future. We analyze the relative overhead of implementing several SIHFT techniques on four different computational kernels, present and analyze the relative overheads of implementing them and highlight the architectural
features that lead to different tradeoffs on CPUs versus GPUs. The results of this study present insight into the cost of implementing SIHFT-based fault tolerance and the role that different architectural model play and impact these costs.