Journal of Computers, Vol 6, No 10 (2011), 2013-2020, Oct 2011
doi:10.4304/jcp.6.10.2013-2020

A Novel Failure Detection Algorithm for Reliable Distributed Systems

Yunni Xia, Chuanjian Jiang, Tianhao Sun, Ruilong Yang

Abstract


A failure detection service is perfect if it eventually detects all failures and every detection correctly identifies a failure that has occurred. Such a perfect failure detection service serves as a basic building block for many reliable distributed systems, for example in distributed lock services. In this paper, we introduce a perfect failure detection scheme in order to improve the fault tolerance of the service. We provide the precise system model and specification for a failure detection service. We present two novel algorithms that implement the failure detection service. We further develop a set of quality-of-service (QoS) metrics for perfect failure detection services, and apply probabilistic analysis to quantify the QoS metrics of the two algorithms.


Keywords


failure detection; distributed system; quality of service

References


[1] N. Balakrishnan and A. C. Cohen, “Order Statistics and Inference,” Academic Press, 1991

[2] K. P. Birman and R. van Renesse, “Reliable Distributed Computing with the Isis Toolkit”. IEEE Computer Society Press, 1993.

[3] M. Burrows, “The Chubby lock service for loosely-coupled distributed systems”. In Proceedings of the 7th Symposium on Operating System Design and Implementation, Nov. 2006.

[4] T. D. Chandra and S. Toueg, “Unreliable failure detectors for reliable distributed systems”. Journal of the ACM, 43(2):225–267, Mar. 1996.
http://dx.doi.org/10.1145/226643.226647

[5] W. Chen, S. Toueg, and M. K. Aguilera, “On the quality of service of failure detectors”. IEEE Transactions on Computers, 51(5):561–580, May 2002.
http://dx.doi.org/10.1109/TC.2002.1004595

[6] C. Fetzer, “Perfect failure detection in timed asynchronous systems”. IEEE Transactions on Computers, 52(2):99–112, Feb. 2003.
http://dx.doi.org/10.1109/TC.2003.1176979

[7] C. G. Gray and D. R. Cheriton. Leases, “An efficient faulttolerant mechanism for distributed file cache consistency”. In Proceedings of the 12th ACM Symposium on Operating Systems Principles, pages 202–210, Dec. 1989.

[8] L. Lamport, “On interprocess communication; part I: Basic formalism”. Distributed Computing, 1(2):77–85, 1986.
http://dx.doi.org/10.1007/BF01786227

[9] L. Lamport, “On interprocess communication; part II: Algorithms”. Distributed Computing, 1(2):86–101, 1986.
http://dx.doi.org/10.1007/BF01786228

[10] L. Lamport, “The part-time parliament”. ACM Transactions on Computer Systems, 16(2):133–169, 1998.
http://dx.doi.org/10.1145/279227.279229

[11] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, and L. Shrira, “Replication in the Harp file system”. In Proceedings of the 13th ACM Symposium on Operating Systems Principles, pages 226–238, Oct. 1991.

[12] J.MacCormick, N.Murphy, M. Najork, C. A. Thekkath, and L. Zhou, “Boxwood: Abstractions as the foundation for storage infrastructure”. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec. 2004.

[13] D. Malkhi, F. Oprea, and L. Zhou, “Omega meets Paxos: Leader election and stability without eventual timely links”. In Proceedings of the 19th International Symposium on Distributed Computing, Sept. 2005.

[14] R. van Renesse and F. B. Schneider, “Chain replication for supporting high throughput and availability”. In Proceedings of the 6th Symposium on Operating System Design and Implementation, Dec. 2004


Full Text: PDF


Journal of Computers (JCP, ISSN 1796-203X)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.