Fault Tolerance in Distributed Systems: Can We Scale to Cloud Computing?

April 8th, 2010

Tom Bressoud, Denison University

Tom Bressoud, Denison University

Monday April 12, 2010 Computer Science will host Professor Tom Bressoud from Denison University. The Talk will be held in King 221 at 4:30 Refreshments will be served at 4:00 in King 223. Fault Tolerance in Distributed Systems: Can we Scale to Cloud Computing? Distributed systems is a subfield of computer science wherein an application or service is modeled as a collection of independently executing processes, cooperating toward a common goal and communicating with each other across some medium (i.e. a network or shared memory).  Fault-tolerance is an area of study that recognizes that computer hardware and software fail and, for many application domains, the failure of a component resulting in a failure of the entire system is simply unacceptable.  So the goal of fault-tolerant systems is to continue to provide correct operation despite the occurrence of component failures. When we look at the intersection of fault tolerance and distributed systems, the problem becomes even more difficult.  The distribution of processes increases uncertainty, including basic questions such as "knowing" that a component has failed.  And as we scale our distributed systems, we, by definition, increase the number of independent components, and thus can linearly increase the arrival rate of failures. This talk will explore these issues and look at the scalability issue of fault tolerance in cluster systems and will compare the traditional fault tolerance technique of checkpointing with some newly popular models for cluster-parallel applications -- MapReduce, used by Google, and Dryad by Microsoft -- each vying for dominance in the currently "hot" area of Cloud Computing.