**High Availability Computer Systems** ====================================== # Summary Note: this is like a book about fault tolerant and high availability ==> Read Salter chapter about this. This paper sketch techniques to build highly available computer systems, to deal with hardware fault, operation, environment, and software faults. # Requirement for HA System: HA requires systems designed to tolerate faults - detect a fault, report it, mask it, then continue service while faulty component is repaired offline # Terminology: - System can be viewed as a single module, but typically consists of multiple modules: - module: + have internal structures, which in turn composed of submodules + has ideal *specified* behavior and an observed *actual* behavior - fault --> error --> failure + fault: the cause of error: + error: defect in module + failure: deviation of the actual behavior from specified behavior ==> caused by effective error - error latency: time between occurrence of the error and the resulting failure - latent error vs. effective error: + latent: a fault cause an error, but the error is not manifest yet (i.e. not lead to a failure) + effective error: error that causes failure E.g: fault: programmer's mistake (hence, erroneous instruction) ==> this create *latent error* in program when the system execute erroneous instructions, they cause failure (error now becomes effective) - actual behavior: alternate between service accomplishment (while module acts as specified) and service interruption (while module deviates from the specified behavior) - *module reliability*: + measures the time from an initial instant and the next failure event + statistically quantified as Mean Time To Failure (MTTF) - Service interruption: statistically quantified as Mean Time To Repair (MTTR) - *Module Availability*: + measures the ratio of service accomplishment to elapsed time + statistically quantified as A = MTTF / (MTTF + MTTR) ==> hence, high-availability system need to improve MTTF (i.e tolerate failure as much as possible) and decrease MTTR (i.e, faster repairs) *HOW TO IMPROVE MODULE RELIABILITY* !!Note: this is just improving reliability, i.e extending MTTF, it not necessary imply that Availability will be improved, because A depends on MTTR (i.e. fast repairs) 1) *valid construction*: + remove faults during construction process + assure constructed module conforms to specification But physical components can fail during operation ==> This does not assure HA. 2) *error correction*: use redundancy to reduce failures by tolerating fault + latent error processing: detect and repair error before it become effective + effective error processing: > correction of error before it become effective > may *recover* from the error or *mask* the error + masking: use redundant information to deliver correct service E.g: Error-correcting code + recovering: denies the request, and sets the module to an error-free state, so that it can service subsequent requests > backward error recovery: returns to a previous correct state (e.g.: checkpoint-restarts, or possibly Rx) > forward recovery: constructs a new correct state (e.g: re-sending a damaged message, or re-reading a disk block ==> redundancy in time) TYPE of FAULT + hardware fault: falling device + design fault: faults in software (mostly) and hardware + operation (e.g.: mistakes by operators lead to system outages, upgrade, database reorganization...) + environmental (storm, earthquake, file, outage) # Empirical Experiment: - The "bathtub curve": + failure rate high for new units (infant mortality) + then it stabilizes at low rate + when module ages beyond a threshold, failure rate increase - device reliability has improve dramatically (that is the time from initial to a failure has increase) + long-lived devices + reduced power, hence reduced temperatures and slower device aging + fewer connectors: connections were a major source of faults because mechanical wear and corrosions. Hence, to improve availability, need to focus more on faster repair # Fault-tolerant design concepts (*this is important*) 1) Modularity: - system should be decomposed into modules - each module is a unit of service, fault containment, and repair - if a module fails, it is replaced by a new module 2) Fail-fast - module either operates correctly or *stop immediately* 3) Independent failure modes: - failed one does not affects others 4) Redundancy and repair - use spare module to replace the failed one instantly (goals is to reduce MTTR, hence increase availability) # Fault-tolerant hardware Two main techniques: 1) Self-checking: a module performs the operation and some *additional validation* work to validate the state + e.g: error-detecting code on storage and message + requires additional circuitry and design + its logic is simple and well understood ==> will likely to dominate storage and communication design 2) comparison: + two or more modules perform the operation + comparator examines the results + if disagreed, the modules stop (another form of comparison is using a voter circuit, choose majority of results as final result) ==> because comparators are simple, comparison trades additional circuits for reduced design time Some form of system: - duplex: + either work or not + cannot tolerate single fault ==> pair-and-spare (using OR combinations) - triplex (voter) + can tolerate single fault - N-plexed: can detect comparator failures - pair-and-spare (or dual-dual) (fig 3.) + combine two fail-fast modules to produce a supermodule that continues operating, even one submodule fails. + the combination is just OR the two sub-modules (because of fail-fast) + The reason for 2 OR is to tolerate the comparator *Important*: Table 2. Where does the number come from? - assume a single module has MTTF = 1 ==> the expected time until the first failure is 1/1 = 1 - now duplex: ==> the expected time until the first failure is 1/2 (because we have 2 module) - Now triplex MTTR of triplex = expected time until the first failure + expected time until the second failure = 1/3 + 1/2 = 5/6 Remember: there is no repair in this system *And some take away about redundancy*: + redundancy by itself does not improve availability or reliability (it does decease the variance in failure rates) + adding redundancy lessens the reliability in cases of duplex and triplex ==> Redundancy designs require *repair* to dramatically improve reliability # Important of repair We can see, with repair, duplex and triplex has dramatically increase in MTTF: MTTF of duplex with repair = (MTTF/2) * MTTF/MTTR The intuition here is that: + MTTF has been reduced by a factor of 2 (because of 2 replicas) + but at the same time, due to repair, MTTF has increased by a factor equal to the MTTF of the remaining 1 MTTF of triplex with repair = (MTTF/3) * (MTTF/2) / MTTR # Improved device maintenance The ideas of fail -fast modules and repairs via retry or spare modules is apply to create fault-tolerance devices ==> but they do not mask failures caused by hardware design faults. Similarly comparison techniques do not apply to software --> which is all design unless design diversity is applied # fault-tolerant software 1) Self-checking: i.e defensive programming, sanity checking + if some item does not satisfy the integrity assertion, raise exception or try to repair + watchdogs or auditors observes the state, if discover inconsistency, raise an exception and either fail-fast the state (erase it) or repair it 2) Comparison: run the same computation, compare result (similar to hw) 3) *Design diversity*: ==> Goal: get designs with independent failure mode - N-version programming: + same spec, but implemented by different independent groups + pros: provide design diversity + cons: > even independent group can make same mistake > common mistake may arise from original spec > how to repair is not clear > expensive: implementation and maintenance cost raised by a factor of N > need to repair to improve MTTF, but repairing a design flaw may takes a lot of time, affecting availability > difficult to reintegrate into working system without interrupting service ==> Software repair is not trivial # Process pairs and transaction - completely different than software repair - assumption: most errors caused by design faults in production hardware and software are transient ==> can be tolerate by retries - Best short-term approach for masking software faults is to restart the system PROCESS PAIRS: - way to get almost instant restart - if a process fails, it is unit of replacement - how it works: + *primary process* performs all operations for its clients + *backup process* passively watch the message flows + primary occasionally send checkpoint message to backup + when primary detect some inconsistency, fail-fast and notifies backup --> backup becomes primary and answering all requests - Pros: + instant replacement and repair + tolerate transient software failure + tolerate hw fault if backup process run on different hw - Cons: complex of writing checkpoint and take over logic Alternative: use TRANSACTION (ACID) + atomicity: either all or nothing + consistency: preserve state invariants + isolation: isolated from other concurrent transactions + durability: once committed, the effect of its operation survive subsequent failure How to Combine transactions with process pairs: 1) Designer declares a process pair's state to be *persistent* 2) When primary fails + all transactions in the primary are aborted + reconstruct the backup-process state as it stood at the start of the in-progress transactions + back-up process then reprocesses the transactions # How about operation, maintenance and environment fault ==> System pair: - 2 nearly identical systems are place far apart. - on different power grids - at sites on different earthquake faults, different weather systems - maintenance personnel and operators are different - each system carries half load - when one fails, the other take over Pros: - mask many hardware failures - mask maintenance fault - either system can be repaired, moved, replaced without affecting each other - easy to upgrade, reorganize one system at a time - mask environmental fault