CONSISTENCY OF DISTRIBUTED SYSTEM WITH ACTIVE INITIATOR PROCESS WITHOUT USELESS CHECKPOINTS

Authors

  • N. P. Gopalan
  • K. Nagarajan

DOI:

https://doi.org/10.47839/ijc.5.1.387

Keywords:

Asynchronous distributed system, software fault tolerance, consistent global checkpoint, useless checkpoint, checkpointing Interval, initiator process and consistent state

Abstract

Checkpointing mechanism is the one of the best attractive approach for providing software fault tolerance in distributed message passing systems. This paper aims to implement a distributed checkpointing technique, which eliminates the drawbacks of the centralized approach like “domino effect”, “useless checkpoint” (checkpoints that do not contribute to global consistency), and “hidden and zigzag” dependencies. The proposed checkpointing protocol has a checkpoint initiator, but, coordination among the local checkpoints is done in a distributed fashion. This guaranty that no message would be lost in case of failure occurs, has been maintained in this work by exchange of information among the processes. However, there is no central checkpoint initiator, but each of the processes takes turn to act as an initiator. Processes take local checkpoints only after being notified by the initiator. The processes synchronize their activities of the current checkpointing interval before finally committing their checkpoints. Thus, the checkpointing pattern described in this paper takes only those checkpoints that will contribute to the consistent global snapshot thereby eliminating the number of useless checkpoints.

References

Aurelin, L.Pierre, K.Geraud, C.Franck, “Coordinated checkpoint versus message log for fault tolerant MPI,” Proceeding of the IEEE International Conference on Cluster Computing, PP: 242 – 250, IEEE CS Press, 1-4 Dec. 2003.

Baldoni, R., J.M.Mostefaoui, A and Raynal M., “A Communication Induced Checkpointing Protocol that Ensures Rollback Dependency Tractability”, IRISA Research Report 1076, Jan 1997.

Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fedak, G., Germain, C., Herault, T., Lemarinier P., Lodygensky O., Magniette F., Neri V., and Selikhov A., “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes”, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Baltimore, Maryland, PP: 1 - 18, 2002

Bouteiller Bouteiller, Franck Cappello, Thomas Herault, Krawezik Krawezik, Pierre Lemarinier, Magniette Magniette. "MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging," sc’03, ACM/IEEE press, PP: 25- 42, 2003.

Chandy, M. and Lamport, L., “Distributed snapshots: Determining global states of distributed systems”, ACM Transactions on Computing Systems, Vol. 3, No. 1, PP: 63-75, Aug. 1985.

Elnozahy, E. N., Alvisi, L., Wang, Y.M., and Johnson D. B., “A survey of rollback-recovery protocols in message-passing systems”, ACM Computing Surveys, Vol. 34, No. 3, PP: 375–408, 2002.

Gopalan, N.P. and Nagarajan, K., “Self-Refined Fault Tolerance in HPC using Dynamic Dependent Process Groups”, Lecturer Notes in Computer Science (LNCS), Springer-Verlag, LNCS 3741, pp. 153 – 158, Dec 2005..

Gunnels, J; Lin, C; Morrow, G; and Van de Geijn, R; “Analysis of a Class of Parallel Matrix Multiplication Algorithms,” Proc. Int’l Parallel Processing Symp., 1998.

Jichiang Tsai, “On Properties of RDT Communication-Induced Checkpointing Protocols”, IEEE Transactions on Parallel and Distributed Systems, Volume 14, Issue 8, Pages: 755 – 764, August 2003.

Kalaiselvi, S. and Rajaraman, V, “A survey of rollback and recovery strategies for computer programs”, IEEE Transaction on Computer, Vol. 25: PP 489–510, October 2000.

Lamport, L., “Time, Clock and the ordering of events in a Distributed System”, Communications of ACM, 21(7): 558-567, 1978.

Manivannan, D.; Netzer, R.H.B.; Singhal, M.; “Finding Consistent Global Checkpoints in a Distributed Computation”, IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No.6, June 1997, PP 623 – 627.

Manivannan, D., “Quasi-Synchronous Checkpointing: Models, Characterization and classifications”, IEEE Trans. On Parallel and Distributed Systems, Vol. 10. No. 7, July 1999, PP 703 –713.

Neogy, S. Sinha, A; Das, P.K., “Finding Consistent Checkpoints in a Distributed System with Synchronized Clocks”, IASTED International Conference on Applied Informatics AI -2001, February 19 – 22, Australia.

Prakash, R; Singhal,M; “Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems”, IEEE Trans. On Parallel and Distributed System, Vol. 7, No. 10, PP 1035-1048, October 1996.

Sinha, A; Das, P.K.; Basu, D.; “Implementation and Timing analysis of Clocks Synchronization on a Transporters based replicated systems”, Information & Software Technology, 40(1998), PP 291 –309.

Strom, R.E.; Yemini, S.; “Optimistic Recovery in Distributed Systems”, ACM Trans. On Computer Systems, Vol. 3. No. 3, Aug. 1985, PP 204 –226.

Tong. Z.; Richard, Y.K. & Tsai, W.T.; “Rollback Recovery in distributed systems using loosely synchronized clocks”, IEEE Trans. On Parallel and Distributed Systems, Vol. 3. No.2, March 1992, PP 246- 251.

Tsai, J.; Kuo, S.; “Theoretical Analysis for Communication Induced Checkpointing protocols with Rollback Recovery Dependency Tractability”, IEEE Trans. On Parallel and Distributed Systems, Vol. 9, No. 10, Oct. 1998, PP 963-971.

Tsai, J.; Wang, Y.;Kuo, S.; “Evaluation of Domino free communication induced checkpointing protocols”, Information Processing Letters 69(1999),PP 31- 37.

Wang, Y.M..; Lowary, A; Fuchs, W.K.; “Consistent Global Checkpoint Based on Dependency tracking”, Information Processing Letters, Vol. 50, No. 4, 1994, PP 223-230.

Wong, F. and Franklin, M., “Checkpointing in distributed systems,” Journal of Parallel & Distributed Systems, Vol. 35, No. 1, PP 67–75, May 1996.

Downloads

Published

2014-08-01

How to Cite

Gopalan, N. P., & Nagarajan, K. (2014). CONSISTENCY OF DISTRIBUTED SYSTEM WITH ACTIVE INITIATOR PROCESS WITHOUT USELESS CHECKPOINTS. International Journal of Computing, 5(1), 92-99. https://doi.org/10.47839/ijc.5.1.387

Issue

Section

Articles