Sunday, Dec 25, 2005, 11:15-12:15
Room 309
--------------------------------------------------------------------------------
Adnan Agbaria
Title:
Compiler-Driven Distributed Checkpointing
Abstract:
Distributed checkpointing is an
important concept in providing fault
tolerance in computer systems. Fault
tolerance is important for distributed
systems, for which the failure rate
is high. In today's applications, e.g.,
grid and massively parallel applications,
the imposed overhead of taking a
distributed checkpoint using the
known approaches can often outweigh its
benefits, due to coordination and
other overhead from the processes. In this
talk, I present an innovative
approach for distributed checkpointing. In
this approach, during compilation, the
checkpoints are specified in the
application code using analysis
based on the application level. During
execution, no coordination is
required, and every process takes a local
checkpoint as specified in the code,
independent of the other processes. In
addition, I present a performance
analysis using stochastic models to
compare the imposed checkpoint
overheads of this approach with other existed
checkpointing protocols.