- Failure Models
- CAP Theorem
- Failure in Client-Server Communication
- Failures in group Communication
- Faulty systems with arbitrary failure
- K Fault Tolerant grouop
A dependable systems should have charectristics of
- Availability
- Reliability
- Safety
- Maintainability
Availability A ( t ) of a component in the time interval [ 0, t ) is defined as the average fraction of time that the component has been functioning correctly during that interval.
Reliability R ( t ) of a component in the time interval [ 0, t ) is formally defined as the conditional probability that it has been functioning correctly during that interval given that it was functioning correctly at time T = 0.
Metrices for Fault Tolerance
- Mean Time To Failure (MTTF) : average time until a component fails
- Mean Time To Repair (MTTR) : average time needed to repair a component.
- Mean Time Between Failures (MTBF) : MTTF + MTTR
Faults can be classified as
- Transient faults occur once and then disappear
- Intermittent keep reappearing on thir own accord and vanishing . More difficlt to diagnose.
- Permanent faults are the ones which continue to exists untill fixed like burned IC, disk head crash etc
- crash failure
- omission failure
- Receive omission – causes : infinite loops or improper memory management
- Send omission – causes : buffer overflow before sending
- Timing failure – no buffer space if streaming video’s data is provided too soo
- Response failure
- Value Failure – search engine returns web pages unrelated to search terms
- State transition failure – server reacts unexpectedly to an incoming request
- Arbitary failure
- Byzantine failures
Classification for halting failures
- Fail-stop failures
- Fail-noisy failures
- fail-silent failures
- Fail-safe failures
- Fail-arbitrary failures
Failure masking by redundancy
- Information redundancy – extra bits are added to allow recovery from garbled bits ( eg : Hamming code )
- Time redundancy – retransmitting
- Physical redundancy- extra hardware pr processes added
Triple modular redundancy- three voters at each stage
Primary-based replication | Replicated write protocol |
used Primary-based replication where when the primary crashes, the backups execute some election algorithm to choose a new primary | active replication, quorum based protocol. collection of identical processes into a flat group |
(+) no single point of failure (-) tougher distributed coordination |
Process Resilience
Flat group | hierarchical group |
(+)symmetrical and had no single point of failure (+) one pprocess failure doesnt affect the group | (+) quick decision as long as coordinator is runing |
(-) harder decision making – voting incurs delay and overheads | (-) loss of coordinators halts everything (-) requires leader selection algorithm |
Memebership management
group server | distributed group memebership |
group server can then maintain a complete database of all the groups and their exact membership | |
(-) centralized techniques: a single point of failure | (-) needs reliable multicasting (-) protocol is needed to rebuild the group |
CAP Theorem
Since messages may be lost; a group may be partitioned due to a faulty network, Eric brewer in 2000 prposed CAP theorem
Any networked system providing shared data can provide only two of the following three properties:
• C: consistency, by which a shared and replicated data item appears as a single, up-to-date copy
• A: availability, by which updates will always be eventually executed
• P: Tolerant to the partitioning of process group (e.g., because of a failing network).
There almost always is a a trade-off between safety and live- ness. Obtaining both in an inherently unreliable system cannot be achieved.
Failures in Client Server Communication
Even with reliable transport protocol, such as TCP which masks omission failures , the crash failures of connections are not masked. Different classes of failures that can occur in RPC systems :
1.Client is unable to locate the server.
- (+) possible solutions – have the client error raise an exception.
- (-) rasiing exceptions eliminates transparency from ditributed system
2. Request message from the client to the server is lost
- (+) possible solution using timer
- (-) if many requsts are lost , the claint can assume that the server is down
3. Server crashes after receiving a request.
4. Reply message from the server to the client is lost. Possible solution :
– structure all the requests in an idempotent way
– differ initial request from retransmission by having the client assign each request a sequence number.
5. Client crashes after sending a request
orphan (computation) – client crashes before server replies to its requests
wastes processing power – lock files or otherwise tie up valuable resources
possible solutions
- orphan extermination – After a reboot, the log is checked and the orphan is explicitly killed off
- (-) it can create creating grandorphans or descendants that are difficult or impossible to locate
- reincarnation : divide time up into sequentially numbered epochs
- gentle reincarnation :
- expiration:
Failures in Group Communication
Atomic MultiCast Problem
Distributed commit Problem : having an operation being performed by each member of a process group, or none at all. Possible solution inclcude Commit protocol
- 1-phase commit protocol
- 2-phase commit protocol (2PC)
- 3-phase commit protocol (3PC)
Faulty systems with arbitrary failures
Process is forwarding a different value or operation than it is supposed to. Could be malicious actions, omission or commission failures. Possible Solution
Recovery – backward recovery , Forward error recovery, erasure correction.
Checkpointing
- distributed snapshot
- coordinated checkpointing – processes synchronize to jointly write their state to local storage.
- incremental snapshot
- Independent checkpointing
Message logging
- pessimistic logging protocols.
- optimistic logging protocols.
Recovery-oriented computing
Byzantine agreement requires
BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process stores exactly what the primary had sent.
k-fault tolerant group
survive faults in k components and still meet its specifications.
- if process fail silently, then having k + 1 of them is enough to provide k-fault tolerance.
- if processes exhibit arbitrary failures, continuing to run when faulty and sending out erroneous or random replies, a minimum of 2k + 1 processes are needed to achieve k-fault tolerance
3k processes is not enough to reach consensus
single arbitrarily failing process in 3 processes system represented in 2 cases
3k+1 pocesses is enough to reach consensus
Reference
- CAP theorem – Gilbert and Lynch [2002]
- Reliabale multicasting issues – Popescu et al. [2007]
- reliable multicasting in the context of publish/subscribe systems [Esposito et al., 2013]
- Backward error Recovery problems – [Singhal and Shivaratri, 1994]
- Paxos – Leslie Lamport [1998] [Lampson, 1996; Prisco et al., 1997;Lamport, 2001; van Renesse and Altinbuken, 2015]
- Byzantine failures
- Lamport et al.[1982]
- Cachin et al. [2011]
- PBFT – [Castro and Liskov, 2002]
Failure masking by redundancy
- Johnson [1995]
- flooding consensus – Cachin et al. [2011]
[…] Failures and Fault Toleranace in Distributed systems […]
LikeLike