Home > Published Issues > 2010 > Volume 5, No. 1, January 2010 >

Adaptive Checkpointing

Zizhong Chen
Department of Mathematical and Computer Sciences Colorado School of Mines, Golden, USA

Abstract—Checkpointing is a typical approach to tolerate failures in today’s supercomputing clusters and computational grids. Checkpoint data can be saved either in central stable storage, or in processor memory (as in diskless checkpointing), or local disk space (replacing memory with local disk in diskless checkpointing). But where to save the checkpoint data has a great impact on the performance of a checkpointing scheme. Fault tolerance schemes with higher efficiency usually choose to save the checkpoint data closer to the processor. However, when failures are handled from application level, the storage hierarch of a platform is often not available at the fault tolerance scheme design time. Therefore, it is often difficult to decide which checkpointing schemes to choose at the application design time. In this paper, we demonstrate that, a good fault tolerance efficiency can be achieved by adaptively choosing where to store the checkpoint data at run time according to the specific characteristics of the platform. We analyze the performance of different checkpointing schemes and propose an efficient adaptive checkpointing scheme to incorporate fault tolerance into high performance computing applications.

Index Terms—adaptivity, checkpointing, diskless checkpointing, fault tolerance, parallel and distributed computing, high performance computing

Cite: Zizhong Chen, "Adaptive Checkpointing," Journal of Communications, vol. 5, no. 1, pp.81-87, 2011. Doi: 10.4304/jcm.5.1.81-87




 

Cite: Chin-Chen Chang, Pei-Yu Lin, Zhi Hui Wang and Ming Chu Li, "A Sudoku-based Secret Image Sharing Scheme with Reversibility," Journal of Communications, vol. 5, no. 1, pp.5-12, 2011. Doi: 10.4304/jcm.5.1.5-12