FAULT TOLERANCE IN GRIDS USING JOB REPLICATION

Mohammed Amoon

doi:10.47839/ijc.11.2.556

Authors

Mohammed Amoon

DOI:

https://doi.org/10.47839/ijc.11.2.556

Keywords:

Job scheduling, Fault tolerant, Replication, Grid Computing.

Abstract

As grids consist of a large number of resources, fault tolerance forms an important aspect of the scheduling process. In this paper, we address the problem of scheduling user jobs in grids so that failures can be avoided in the presence of resources faults. We employ job replication as an effective mechanism to achieve efficient and fault-tolerant scheduling system. Most of the existing replication-based algorithms use a fixed number of replications for each job which consumes more grid resources. We first propose an algorithm to determine adaptively the number of job replicas according to the grid failure history. Then we propose an algorithm to schedule these replicas. The proposed algorithms have been evaluated through simulation and have shown better performance in terms of grid load, throughput and failure tendency.

References

S. Priya, M. Prakash and K. Dhawan, Fault tolerance-genetic algorithm for grid task scheduling using check point, Proc. of the sixth International Conference on Grid and Cooperative Computing, Urumchi, Xinjiang, China, August 16-18, 2007, pp. 676-680.

S. S. Sathya and K. S. Babu, Survey of fault tolerant techniques for grid, Computer Science Review, (4) 2 (2010). pp. 101-120.

Q. Zheng and B. Veeravalli, On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices, J. Parallel and Distributed Computing, (69) (2009). pp. 282-294.

H. Lee et al., A resource management system for fault tolerance in grid computing, Proc. of International Conference on Computational Science and Engineering, Vancouver, Canada, August 29-31, 2009, pp. 609-614.

F. Khan, K. Qureshi and B. Nazir, Performance evolution of fault tolerance techniques in grid computing system, J. Computing and Electrical Engineering, (36) (2010). pp. 1110-1122.

S. Hwang, C. Kesselman, A flexible framework for fault tolerance in the grid, J. Grid Computing, (1) (2003), pp. 251-272.

H. Lee et al., A resource management and fault tolerance services in grid computing, Journal of Parallel and Distributed Computing, (65) (2005), pp. 1305-1317.

B. Khoo and B. Veeravalli, Pro-active failure handling mechanisms for scheduling in grid computing environments, J. Parallel and Distributed Computing, (70) 3 (2010), pp. 189-200.

M. Huda, H. Schmidt and I. Peake, An agent oriented proactive fault-tolerant framework for grid computing, Proc. of International Conference on e-Science and Grid Computing, Melbourne, Australia, Dec. 5-8, 2005, pp. 304-311.

J. Abawajy, Fault-tolerant scheduling policy for grid computing systems, Proc. of 18th IEEE International Parallel and Distributed Processing Symposium, April 26-30, 2004.

K. Srinivasa, G. Siddesh and S. Cherian, Fault-tolerant middleware for grid computing, Proc. of 12th IEEE International Conference on High Performance Computing and Communications, Melbourne, Australia, Sep. 1-3, 2010б pp. 635-640.

M. Chtepen, B. Dhoedt, F. Cleays and P. Van-rolleghem, Evaluation of replication and rescheduling heuristics for gird systems with varying resource availability, Proc. of 18th International Conference on Parallel and Distributed Computing Systems, Anaheim, CA, USA, Nov. 13-15, 2006, pp. 622-627.

S. Song, K. Hwang, and Y. Kwok, Risk-resilient heuristics and genetic algorithms for security-assured grid job scheduling, IEEE Trans. Computers, (55) 6 (2006), pp. 703-719.

International Journal of Computing

FAULT TOLERANCE IN GRIDS USING JOB REPLICATION

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information