ABSTRACTION CHECKPOINTING LEVELS: PROBLEMS AND SOLUTIONS
DOI:
https://doi.org/10.47839/ijc.13.3.629Keywords:
Checkpointing, abstraction level, system level, application level, compiler, transparency, portability.Abstract
A common approach to guarantee an acceptable level of fault tolerance in scientific computing is the checkpointing. In this strategy: when a task fails, it is allowed to be restarted from the recently checked pointed state rather than from the beginning, which reduces the system loss and ensures the reliability. Several systems use the checkpointing to ensure the fault tolerance such as HPC, distributed discrete event simulation and Clouds. The literature proposes several classifications of checkpointing techniques using different metrics and criteria. In this paper we focus on the classification based on abstraction level. In this classification the checkpointing is categorized into two principal types: application level and system level. Each of these levels has its advantages and suffers from many problems. The difference between our present paper and the others surveys proposed in the literature is that: in this paper we will study each level in details. We will also study and analyze some works that propose solutions to solve the problems and exceed the limits of each abstraction level.References
J. C. Sancho, F. Petrini, K. Davis, R. Gioiosa and S. Jiang, Current practice and direction forward in checkpoint/restart implementation for fault tolerance, in Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), Denver, CO, USA, (April 3-8, 2005).
S. Siva Sathya, K. Syam Babu, Survey of fault tolerant techniques for grid, Computer Science Review, (4) 2 (2010), pp. 101–120.
R. Garg and A. Kumar Singh, Fault tolerance in grid computing: state of the art and open issues, International Journal of Computer Science and Engineering Survey, (2) 1 (2011), pp. 88–97.
G. Bronevetsky, R. Fernandes, D. Marques, K. Pingali and P. Stodghill, Recent advances in checkpoint/recovery systems, in Proceedings of the Next Generation Systems Program Workshop (IPDPS 2006), Rhodes Island, Greece, (April 25-29, 2006).
L. M. Silva, J. G. Silva, System-level versus user-defined checkpointing, in Proceedings of the 17th IEEE Symposium on Reliable Distributed Systems, West Lafayette, Indiana, (October 20-23, 1998), pp. 68–74.
V. Fontes, B. Schulze, M. Dutra and F. Porto, Checkpointing-based rollback recovery for parallel applications on the InteGrade Grid Middleware, in Proceeding of the 2nd Workshop on Middleware for Grid Computing, Toronto, Ontario, Canada, (October 18-22, 2004).
D. Koch, C. Haubelt and J. Teich, Efficient hardware checkpointing, concepts, overhead analysis, and implementation, in Proceedings of the 15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2007), Monterey, CA, (February 18-20, 2007), pp. 188–196.
J. Leon, A. L. Fisher, and P. Steenkiste, Fail-safe PVM: a portable package for distributed programming with transparent recovery, Technical report in Carnegie Mellon University, February 1993.
J. N. C. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan, DOME: parallel programming in a distributed computing environment, in Proceeding of the 10th International Parallel Processing Symposium (IPPS-96), Honolulu, Hawaii, (April 15-19, 1996), pp. 218–224.
P. Tullmann, J. Lepreau, B. Ford, M. Hibler, User-level checkpointing through exportable kernel state, in Proceeding of the International Workshop on Object Oriented Operating System, Seattle, Washington (October 27-28, 1996).
W. R. Dieter, J. E. Lumpp, A user-level checkpointing library for POSIX threads programs, in Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (FTCS’99), Madison, Wisconsin, (June 15-18, 1999), pp. 224–227.
M. Rieker, J. Ansel, and G. Cooperman, Transparent user-level checkpointing for the native posix thread library for Linux, in Proceeding of the PDPTA’2006, Las Vegas, Nevada, USA, (June 26-29, 2006), pp. 492–498.
W. R. Dieter, J. E. Lumpp, User-level checkpointing for Linux: threads programs, in Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, Boston, Massachusetts, USA, (June 25-30, 2001), pp. 81–92.
H. Abdel-Shafi, E. Speight, and J. K. Bennett, Efficient user-level thread migration and checkpointing on Windows NT clusters, in Proceedings of the 3rd USENIX Windows NT Symposium, Seattle, Washington, (July 12-15, 1999).
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill, C3: A system for automating application-level checkpointing of MPI programs, in Proceeding of the 16th International Workshop Languages and Compilers for Parallel Computing (LCPC 2003), College Station, TX, USA, (October 2-4, 2003), Lecture Notes in Computer Science, Springer, Vol. 2958, 2004, pp. 357–373.
J. F. Ruscio, M. A. Heffner, S. Varadarajan, DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems, in Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IPDPS’07, Long Beach, California, USA, (March 26-30, 2007), pp. 1–10.
P. H. Hargrove and J. C. Duell, Berkeley lab checkpoint/ restart (BLCR) for Linux clusters, in Proceedings of SciDAC, 2006, Denver, CO, (June 25-30, 2006).
J. Ansel, K. Arya, and G. Cooperman, DMTCP: transparent checkpointing for cluster computations and the desktop, in Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’09), Rome, Italy, (May 25, 2009), pp. 1–12.
G. Rodriguez, X. C. Pardo, M. J. Martin, P. Gonzalez, Performance evaluation of an application-level checkpointing solution on grids, Future Generation Computer Systems, (26) 7 (2010), pp. 1012–1023.
S. Krishnan, D. Gannon, Checkpoint and restart for distributed components in Xcat3, in Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing, Pittsburgh, USA, (November 8, 2004), pp. 281–288.
J. P. Walters, V. Chaudhary, Application-level checkpointing techniques for parallel programs, in Proceedings of the Third International Conference on Distributed Computing and Internet Technology (ICDCIT’06), Bhubaneswar, India, (December 20-23, 2006), pp. 221–234.
M. Prvulovic, Z. Zhang, and J. Torrellas, ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors, in Proceedings of 29th International Symposium on Computer Architecture (ISCA 2002), Anchorage, AK, USA (May 25-29, 2002), pp. 111–122.
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood, SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery, in Proceedings of 29th International Symposium on Computer Architecture (ISCA’2002), Anchorage, AK, USA, (May 25-29, 2002), pp. 123-134.
S. Osman, D. Subhraveti, G. Su, and J. Nieh, The design and implementation of zap: a system for migrating computing environments, in Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI 2002), Boston, Massachusetts, USA, (December 9-11, 2002).
F. Petrini, K. Davis and J. C. Sancho, System-level fault-tolerance in large-scale parallel machines with buffered coscheduling, in Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico, USA, (April 26-30, 2004).
R. Gioiosa, J.C. Sancho, S. Jiang and F. Petrini, Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers, in Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, Seattle, WA, USA, (November 12-18, 2005).
X. Li, D. Yeung, Exploiting application-level correctness for low-cost fault tolerance, Journal of Instruction-Level Parallelism, (10), (2008), pp. 1–18.
G. Rodriguez, M. Martin, P. Gonzalez, J. Tourio, R. Doallo, CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications, Journal Concurrency and Computation: Practice and Experience, (22) 6 (2010), pp. 749–766.
C. Li, E. Stewart, W. Fuchs, Compiler-assisted full checkpointing, Journal Software-Practice and Experience, (24) 10 (1994), pp. 871–886.
J. Long, W. K. Fuchs and J. A. Abraham, Compiler-assisted static checkpoint insertion, in Proceedings of the Twenty-Second Annual International Symposium on Fault-Tolerant Computing (FTCS-22), Boston, Massachusetts, USA, (July 8-10, 1992), pp. 58-65.
G. Rodriguez, M. J. Martin, P. Gonzalez, J. Tourino, A heuristic approach for the automatic insertion of checkpoints in message-passing codes, Journal of Universal Computer Science, (15) 14 (2009), pp. 2894–2911.
A. N. Norman, S.-E. Choi and C. Lin, Compiler-generated staggered checkpointing, in Proceedings of the 7th Workshop on Languages, Compilers, and Run-time Support for Scalable Systems (LCR’04), Houston, Texas, (October 21-23, 2004), pp. 1-8.
J. Plank, M. Beck, G. Kingsley, Compiler-assisted memory exclusion for fast checkpointing, IEEE Technical Committee on Operating Systems and Application Environments, (7) 4 (1995), pp. 10–14.
G. Bronevetsky, D. Marques, K. Pingali, S. A. MacKee and R. Rugina, Compiler-enhanced incremental checkpointing for OpenMP applications, in Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2009), Rome, Italy, (May 23-29, 2009), pp. 1–12.
H. Jiang, V. Chaudhary, J. Walters, Data conversion for process/thread migration and checkpointing, in Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), Kaohsiung, Taiwan, (October 6-9, 2003).
B. Lyon, Sun external data representation specification, Technical report RFC-1832, SUN Microsystems, Inc., Mountain View, 1984.
S. Krishnan, D. Gannon, Checkpoint and restart for distributed components in XCAT3, in Proceedings of the 5th International Workshop on Grid Computing (GRID’2004), Pittsburgh, PA, USA, (November 8, 2004), pp. 281-288.
B. Ramkumar, V. Strumpen, Portable checkpointing for heterogeneous architectures, in Proceedings of the Twenty Seventh Annual International Symposium on Fault-Tolerant Computing, (FTCS-27), Seattle, Washington, USA, (June 24-27, 1997), pp. 58-67.
H. Zhou, A. Geist, Receiver makes right data conversion in PVM, in Proceedings of the IEEE Fourteenth Annual International Phoenix Conference on Computers and Communications, Scottsdale, Arizona, USA, (March 28-31, 1995), pp. 458-464.
D. Sun, G. Chang, C. Miao, X. Wang, Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments, The Journal of Supercomputing, (66) 1 (2013), pp. 193–228.
U. Song, J. Gil, S. Hong, Checkpoint sharing-based replication scheme in desktop grid computing, in Proceedings of the International Conference on Embedded and Multimedia Computing Technology and Service, Gwangju, Korea, (September 6-8, 2012), Lecture Notes in Electrical Engineering, Vol. 181, 2012, pp. 477–484.
Y.-B. Lin, Design issues for optimistic distributed discrete event simulation, Journal of Parallel and Distributed Computing, (62) 3 (2002), pp. 327–335.
L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, Incremental checkpointing with application to distributed discrete event simulation, in Proceedings of the Winter Simulation Conference (WSC 2006), Monterey, California, USA, (December 3-6, 2006), pp. 1004–1011.
K. B. Ferreira, Rolf Riesen, Patrick Bridges, Dorian Arnold, Ron Brightwell, Accelerating incremental checkpointing for extreme-scale computing, Journal of Future Generation Computer Systems, (30) 1 (2014), pp. 66-77.
H. Li, L. Pang, Z. Wang, Two-level incremental checkpoint recovery scheme for reducing system total overheads, PLoS ONE, (9) 8 (2014), Article ID e104591.
K. Sato, A. Moody, K. Mohror, T. Gamblin, B. R. de Supinski, N. Maruyama, and S. Matsuoka, A User-level infiniband-based file system and checkpoint strategy for burst buffers, in Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2014), Chivago, IL, (26-29 May 2014), pp. 21-30.
S. Al-Kiswany, M. Ripeanu, S. S. Vazhkudai, A. Gharaibeh, stdchk: a checkpoint storage system for desktop grid computing, in Proceedings of the 28th International Conference on Distributed Computing Systems, 2008, pp. 613-624.
Z. Wang, X. Shi, H. Jin, S. Wu, Y. Chen, Iteration based collective I/O strategy for parallel I/O systems, in Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2014, pp. 287-294.
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.