Failures and Fault Toleranace in Distributed systems


A dependable systems should have charectristics of

  • Availability
  • Reliability
  • Safety
  • Maintainability

Availability A ( t ) of a component in the time interval [ 0, t ) is defined as the average fraction of time that the component has been functioning correctly during that interval.

Reliability R ( t ) of a component in the time interval [ 0, t ) is formally defined as the conditional probability that it has been functioning correctly during that interval given that it was functioning correctly at time T = 0.

Metrices for Fault Tolerance

  1. Mean Time To Failure (MTTF) : average time until a component fails
  2. Mean Time To Repair (MTTR) : average time needed to repair a component.
  3. Mean Time Between Failures (MTBF) : MTTF + MTTR

Faults can be classified as

  1. Transient faults occur once and then disappear
  2. Intermittent keep reappearing on thir own accord and vanishing . More difficlt to diagnose.
  3. Permanent faults are the ones which continue to exists untill fixed like burned IC, disk head crash etc

Failure Models

  • crash failure
  • omission failure
    • Receive omission – causes : infinite loops or improper memory management
    • Send omission – causes : buffer overflow before sending
  • Timing failure – no buffer space if streaming video’s data is provided too soo
  • Response failure
    • Value Failure – search engine returns web pages unrelated to search terms
    • State transition failure – server reacts unexpectedly to an incoming request
  • Arbitary failure
    • Byzantine failures

Classification for halting failures

  1. Fail-stop failures
  2. Fail-noisy failures
  3. fail-silent failures
  4. Fail-safe failures
  5. Fail-arbitrary failures

Failure masking by redundancy

  • Information redundancy – extra bits are added to allow recovery from garbled bits ( eg : Hamming code )
  • Time redundancy – retransmitting
  • Physical redundancy- extra hardware pr processes added

Triple modular redundancy- three voters at each stage

Primary-based replicationReplicated write protocol
used Primary-based replication where when the primary crashes,
the backups execute some election algorithm to choose a new primary
active replication, quorum based protocol.
collection of identical processes into a flat group
(+) no single point of failure
(-) tougher distributed coordination

Process Resilience

Flat grouphierarchical group
(+)symmetrical and had no single point of failure
(+) one pprocess failure doesnt affect the group
(+) quick decision as long as coordinator is runing
(-) harder decision making – voting incurs delay and overheads(-) loss of coordinators halts everything
(-) requires leader selection algorithm

Memebership management

group server distributed group memebership
group server can then maintain a complete database of all the
groups and their exact membership
(-) centralized techniques: a single point of failure(-) needs reliable multicasting
(-) protocol is needed to rebuild the group

CAP Theorem

Since messages may be lost; a group may be partitioned due to a faulty network, Eric brewer in 2000 prposed CAP theorem

Any networked system providing shared data can provide only two of the following three properties:
C: consistency, by which a shared and replicated data item appears as a single, up-to-date copy
A: availability, by which updates will always be eventually executed
P: Tolerant to the partitioning of process group (e.g., because of a failing network).

There almost always is a a trade-off between safety and live- ness. Obtaining both in an inherently unreliable system cannot be achieved.

Failures in Client Server Communication

Even with reliable transport protocol, such as TCP which masks omission failures , the crash failures of connections are not masked. Different classes of failures that can occur in RPC systems :

1.Client is unable to locate the server.

  • (+) possible solutions – have the client error raise an exception.
  • (-) rasiing exceptions eliminates transparency from ditributed system

2. Request message from the client to the server is lost

  • (+) possible solution using timer
  • (-) if many requsts are lost , the claint can assume that the server is down

3. Server crashes after receiving a request.

4. Reply message from the server to the client is lost. Possible solution :

– structure all the requests in an idempotent way

– differ initial request from retransmission by having the client assign each request a sequence number.

5. Client crashes after sending a request

orphan (computation) – client crashes before server replies to its requests

wastes processing power – lock files or otherwise tie up valuable resources

possible solutions

  • orphan extermination – After a reboot, the log is checked and the orphan is explicitly killed off
    • (-) it can create creating grandorphans or descendants that are difficult or impossible to locate
  • reincarnation : divide time up into sequentially numbered epochs
  • gentle reincarnation :
  • expiration:

Failures in Group Communication

Atomic MultiCast Problem

Distributed commit Problem : having an operation being performed by each member of a process group, or none at all. Possible solution inclcude Commit protocol

  • 1-phase commit protocol
  • 2-phase commit protocol (2PC)
  • 3-phase commit protocol (3PC)

Faulty systems with arbitrary failures

Process is forwarding a different value or operation than it is supposed to. Could be malicious actions, omission or commission failures. Possible Solution

Recovery – backward recovery , Forward error recovery, erasure correction.

Checkpointing

  • distributed snapshot
  • coordinated checkpointing – processes synchronize to jointly write their state to local storage.
  • incremental snapshot
  • Independent checkpointing

Message logging

  • pessimistic logging protocols.
  • optimistic logging protocols.

Recovery-oriented computing

Byzantine agreement requires

BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process stores exactly what the primary had sent.

k-fault tolerant group

survive faults in k components and still meet its specifications.

  • if process fail silently, then having k + 1 of them is enough to provide k-fault tolerance.
  • if processes exhibit arbitrary failures, continuing to run when faulty and sending out erroneous or random replies, a minimum of 2k + 1 processes are needed to achieve k-fault tolerance

3k processes is not enough to reach consensus

single arbitrarily failing process in 3 processes system represented in 2 cases


3k+1 pocesses is enough to reach consensus

Reference

  • CAP theorem – Gilbert and Lynch [2002]
  • Reliabale multicasting issues – Popescu et al. [2007]
  • reliable multicasting in the context of publish/subscribe systems [Esposito et al., 2013]
  • Backward error Recovery problems – [Singhal and Shivaratri, 1994]
  • Paxos – Leslie Lamport [1998] [Lampson, 1996; Prisco et al., 1997;Lamport, 2001; van Renesse and Altinbuken, 2015]
  • Byzantine failures
    • Lamport et al.[1982]
    • Cachin et al. [2011]
    • PBFT – [Castro and Liskov, 2002]

Failure masking by redundancy

  • Johnson [1995]
  • flooding consensus – Cachin et al. [2011]

One thought on “Failures and Fault Toleranace in Distributed systems

Leave a comment