A Scalable Failure Recovery Model for Tree-based Overlay Networks
File(s)
Date
2008Author
Arnold, Dorian
Miller, Barton P.
Publisher
University of Wisconsin-Madison Department of Computer Sciences
Metadata
Show full item recordAbstract
We present a scalable failure recovery model for data aggregations in large
scale tree-based overlay networks (TBONs). A TBON is a network of
hierarchically organized processes that exploits the logarithmic scaling
properties of trees to provide scalable data multicast, gather, and in-network
aggregation. TBONs are commonly used in debugging and performance tools,
system monitoring, information management systems, stream processing, and mobile
ad hoc networks. Our recovery model leverages inherent information redundancies
in TBON computations. This redundant information is gathered from non-failed
processes to compensate for computation and communication state lost due to
failures. This state compensation strategy is attractive because: (1) it avoids
the time and resource overheads of previous reliability approaches, which rely
on explicit replication; (2) recovery is rapid and only involves a small subset
of the network; and (3) it applies to many useful, complex computations. In this
paper, we formalize the TBON model and its fundamental properties to prove
that our state compensation model properly preserves computational semantics
across TBON process failures. These properties lead to an efficient
implementation of state compensation, which we use to empirically validate and
evaluate recovery performance. We show that state compensation can recover from
failures in extremely large TBONs in milliseconds rendering practically no
application service interruption.
Permanent Link
http://digital.library.wisc.edu/1793/60616Citation
TR1626