Thumbnail
Access Restriction
Subscribed

Author Das, A. ♦ Gupta, I. ♦ Motivala, A.
Sponsorship IEEE Comput. Soc. Tech. Committee on Fault-Tolerant Comput. (TCFTC) ♦ IFIP Working Group 10.4 on Dependable Comput. & Fault Tolerence ♦ DARPA ♦ LAAS-CNRS ♦ Univ. Michigan
Source IEEE Xplore Digital Library
Content type Text
Publisher Institute of Electrical and Electronics Engineers, Inc. (IEEE)
File Format PDF
Copyright Year ©2002
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Protocols ♦ Peer to peer computing ♦ Application software ♦ Large-scale systems ♦ Delay ♦ Frequency ♦ Computer crashes ♦ Personal communication networks ♦ Condition monitoring ♦ Robustness
Abstract Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation and performance of the SWIM sub-system on a large cluster of commodity PCs. Unlike traditional heart beating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossip-style) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed - this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. Experimental results from the SWIM prototype are presented. We discuss the extensibility of the design to a WAN-wide scale.
Description Author affiliation: Dept. of Comput. Sci., Cornell Univ., Ithaca, NY, USA (Das, A.; Gupta, I.; Motivala, A.)
ISBN 0769511015
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research ♦ Reading
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2002-06-23
Publisher Place USA
Rights Holder Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Size (in Bytes) 366.89 kB
Page Count 10
Starting Page 303
Ending Page 312


Source: IEEE Xplore Digital Library