Failures and Fault Toleranace in Distributed systems

Failure Models
CAP Theorem
Failure in Client-Server Communication
Failures in group Communication
Faulty systems with arbitrary failure
K Fault Tolerant grouop

A dependable systems should have charectristics of

Availability
Reliability
Safety
Maintainability

Availability A ( t ) of a component in the time interval [ 0, t ) is defined as the average fraction of time that the component has been functioning correctly during that interval.

Reliability R ( t ) of a component in the time interval [ 0, t ) is formally defined as the conditional probability that it has been functioning correctly during that interval given that it was functioning correctly at time T = 0.

Metrices for Fault Tolerance

Mean Time To Failure (MTTF) : average time until a component fails
Mean Time To Repair (MTTR) : average time needed to repair a component.
Mean Time Between Failures (MTBF) : MTTF + MTTR

Faults can be classified as

Transient faults occur once and then disappear
Intermittent keep reappearing on thir own accord and vanishing . More difficlt to diagnose.
Permanent faults are the ones which continue to exists untill fixed like burned IC, disk head crash etc

Failure Models

crash failure
omission failure
- Receive omission – causes : infinite loops or improper memory management
- Send omission – causes : buffer overflow before sending
Timing failure – no buffer space if streaming video’s data is provided too soo
Response failure
- Value Failure – search engine returns web pages unrelated to search terms
- State transition failure – server reacts unexpectedly to an incoming request
Arbitary failure
- Byzantine failures

Classification for halting failures

Fail-stop failures
Fail-noisy failures
fail-silent failures
Fail-safe failures
Fail-arbitrary failures

Failure masking by redundancy

Information redundancy – extra bits are added to allow recovery from garbled bits ( eg : Hamming code )
Time redundancy – retransmitting
Physical redundancy- extra hardware pr processes added

Triple modular redundancy- three voters at each stage

Primary-based replication	Replicated write protocol
used Primary-based replication where when the primary crashes, the backups execute some election algorithm to choose a new primary	active replication, quorum based protocol. collection of identical processes into a flat group
	(+) no single point of failure (-) tougher distributed coordination

Process Resilience

Flat group	hierarchical group
(+)symmetrical and had no single point of failure (+) one pprocess failure doesnt affect the group	(+) quick decision as long as coordinator is runing
(-) harder decision making – voting incurs delay and overheads	(-) loss of coordinators halts everything (-) requires leader selection algorithm

Memebership management

group server	distributed group memebership
group server can then maintain a complete database of all the groups and their exact membership
(-) centralized techniques: a single point of failure	(-) needs reliable multicasting (-) protocol is needed to rebuild the group

CAP Theorem

Since messages may be lost; a group may be partitioned due to a faulty network, Eric brewer in 2000 prposed CAP theorem

Any networked system providing shared data can provide only two of the following three properties:
• C: consistency, by which a shared and replicated data item appears as a single, up-to-date copy
• A: availability, by which updates will always be eventually executed
• P: Tolerant to the partitioning of process group (e.g., because of a failing network).

There almost always is a a trade-off between safety and live- ness. Obtaining both in an inherently unreliable system cannot be achieved.

Failures in Client Server Communication

Even with reliable transport protocol, such as TCP which masks omission failures , the crash failures of connections are not masked. Different classes of failures that can occur in RPC systems :

1.Client is unable to locate the server.

(+) possible solutions – have the client error raise an exception.
(-) rasiing exceptions eliminates transparency from ditributed system

2. Request message from the client to the server is lost

(+) possible solution using timer
(-) if many requsts are lost , the claint can assume that the server is down

3. Server crashes after receiving a request.

4. Reply message from the server to the client is lost. Possible solution :

– structure all the requests in an idempotent way

– differ initial request from retransmission by having the client assign each request a sequence number.

5. Client crashes after sending a request

orphan (computation) – client crashes before server replies to its requests

wastes processing power – lock files or otherwise tie up valuable resources

possible solutions

orphan extermination – After a reboot, the log is checked and the orphan is explicitly killed off
- (-) it can create creating grandorphans or descendants that are difficult or impossible to locate
reincarnation : divide time up into sequentially numbered epochs
gentle reincarnation :
expiration:

Failures in Group Communication

Atomic MultiCast Problem

Distributed commit Problem : having an operation being performed by each member of a process group, or none at all. Possible solution inclcude Commit protocol

1-phase commit protocol
2-phase commit protocol (2PC)
3-phase commit protocol (3PC)

Faulty systems with arbitrary failures

Process is forwarding a different value or operation than it is supposed to. Could be malicious actions, omission or commission failures. Possible Solution

Recovery – backward recovery , Forward error recovery, erasure correction.

Checkpointing

distributed snapshot
coordinated checkpointing – processes synchronize to jointly write their state to local storage.
incremental snapshot
Independent checkpointing

Message logging

pessimistic logging protocols.
optimistic logging protocols.

Recovery-oriented computing

Byzantine agreement requires

BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process stores exactly what the primary had sent.

k-fault tolerant group

survive faults in k components and still meet its specifications.

if process fail silently, then having k + 1 of them is enough to provide k-fault tolerance.
if processes exhibit arbitrary failures, continuing to run when faulty and sending out erroneous or random replies, a minimum of 2k + 1 processes are needed to achieve k-fault tolerance

3k processes is not enough to reach consensus

single arbitrarily failing process in 3 processes system represented in 2 cases

3k+1 pocesses is enough to reach consensus

Reference

CAP theorem – Gilbert and Lynch [2002]
Reliabale multicasting issues – Popescu et al. [2007]
reliable multicasting in the context of publish/subscribe systems [Esposito et al., 2013]
Backward error Recovery problems – [Singhal and Shivaratri, 1994]

Paxos – Leslie Lamport [1998] [Lampson, 1996; Prisco et al., 1997;Lamport, 2001; van Renesse and Altinbuken, 2015]

Byzantine failures
- Lamport et al.[1982]
- Cachin et al. [2011]
- PBFT – [Castro and Liskov, 2002]

Failure masking by redundancy

Johnson [1995]

flooding consensus – Cachin et al. [2011]

	Distributed algorith… on Failures and Fault Toleranace…
	Security – Com… on TLS ( Transport Layer Sec…
	Communication betwee… on Distributed System Architectur…
	Decision Networks in… on Bayesian Networks, Inferences…
	Bayesian Networks, I… on Reinforcement Learning

Computer Science and Engineering

Insights and experiments from Computer Science Academia and Research

Failures and Fault Toleranace in Distributed systems

CAP Theorem

Failures in Client Server Communication

Failures in Group Communication

Faulty systems with arbitrary failures

k-fault tolerant group

One thought on “Failures and Fault Toleranace in Distributed systems”

Leave a comment Cancel reply

Share this:

Related

One thought on “Failures and Fault Toleranace in Distributed systems”

Leave a comment Cancel reply