**High Availability Computer Systems**
======================================

# Summary
Note: this is like a book about fault tolerant and high availability
==> Read Salter chapter about this.
This paper sketch techniques to build highly available computer systems, to deal
with hardware fault, operation, environment, and software faults.

# Requirement for HA System:
HA requires systems designed to tolerate faults - detect a fault, report it, 
mask it, then continue service while faulty component is repaired offline

# Terminology:
- System can be viewed as a single module, but typically consists of multiple
modules:
- module: 
	+ have internal structures, which in turn composed of submodules
	+ has ideal *specified* behavior and an observed *actual* behavior

- fault --> error --> failure
	+ fault: the cause of error:
	+ error: defect in module
	+ failure: deviation of the actual behavior from specified behavior
	==> caused by effective error
- error latency: time between occurrence of the error and the resulting failure
- latent error vs. effective error:
	+ latent: a fault cause an error, but the error is not manifest yet
	(i.e. not lead to a failure)
	+ effective error: error that causes failure
	E.g: fault: programmer's mistake (hence, erroneous instruction)
			==> this create *latent error* in program
			when the system execute erroneous instructions, they cause failure
				(error now becomes effective)
- actual behavior: alternate between service accomplishment (while module acts
as specified) and service interruption (while module deviates from the 
specified behavior)
- *module reliability*: 
	+ measures the time from an initial instant and the next failure event
	+ statistically quantified as Mean Time To Failure (MTTF)
- Service interruption: statistically quantified as Mean Time To Repair (MTTR)
- *Module Availability*: 
	+ measures the ratio of service accomplishment to elapsed time
	+ statistically quantified as A = MTTF / (MTTF + MTTR)
	==> hence, high-availability system need to improve MTTF (i.e tolerate 
	failure as much as possible) and decrease MTTR (i.e, faster repairs)

*HOW TO IMPROVE MODULE RELIABILITY*
!!Note: this is just improving reliability, i.e extending MTTF, it not necessary
imply that Availability will be improved, because A depends on MTTR (i.e. fast
repairs)

1) *valid construction*: 
	+ remove faults during construction process
	+ assure constructed module conforms to specification
But physical components can fail during operation ==> This does not assure HA.

2) *error correction*: use redundancy to reduce failures by tolerating fault
+ latent error processing: detect and repair error before it become effective
+ effective error processing:
	> correction of error before it become effective
	> may *recover* from the error or *mask* the error
		+ masking: use redundant information to deliver correct service
			E.g: Error-correcting code
		+ recovering: denies the request, and sets the module to an error-free
		state, so that it can service subsequent requests
			> backward error recovery: returns to a previous correct state
			(e.g.: checkpoint-restarts, or possibly Rx)
			> forward recovery: constructs a new correct state
			(e.g: re-sending a damaged message, or re-reading a disk block
			==> redundancy in time)

TYPE of FAULT
	+ hardware fault: falling device
	+ design fault: faults in software (mostly) and hardware
	+ operation (e.g.: mistakes by operators lead to system outages, upgrade,
		database reorganization...)
	+ environmental (storm, earthquake, file, outage)

# Empirical Experiment:
- The "bathtub curve":
	+ failure rate high for new units (infant mortality)
	+ then it stabilizes at low rate
	+ when module ages beyond a threshold, failure rate increase
- device reliability has improve dramatically
(that is the time from initial to a failure has increase)
	+ long-lived devices
	+ reduced power, hence reduced temperatures and slower device aging
	+ fewer connectors: connections were a major source of faults because 
	mechanical wear and corrosions.

Hence, to improve availability, need to focus more on faster repair

# Fault-tolerant design concepts (*this is important*)
1) Modularity:
	- system should be decomposed into modules
	- each module is a unit of service, fault containment, and repair
	- if a module fails, it is replaced by a new module
2) Fail-fast
	- module either operates correctly or *stop immediately*
3) Independent failure modes:
	- failed one does not affects others
4) Redundancy and repair
	- use spare module to replace the failed one instantly
	(goals is to reduce MTTR, hence increase availability)
	
# Fault-tolerant hardware
Two main techniques:
1) Self-checking: a module performs the operation and some 
*additional validation* work to validate the state
	+ e.g: error-detecting code on storage and message
	+ requires additional circuitry and design
	+ its logic is simple and well understood
	==> will likely to dominate storage and communication design
2) comparison: 
	+ two or more modules perform the operation
	+ comparator examines the results
	+ if disagreed, the modules stop
(another form of comparison is using a voter circuit, choose majority of
results as final result)
	==> because comparators are simple, comparison trades additional circuits
	for reduced design time

Some form of system:
- duplex:
	+ either work or not
	+ cannot tolerate single fault
	==> pair-and-spare (using OR combinations)
- triplex (voter)
	+ can tolerate single fault
- N-plexed: can detect comparator failures
- pair-and-spare (or dual-dual) (fig 3.)
	+ combine two fail-fast modules to produce a supermodule that continues
	operating, even one submodule fails. 
	+ the combination is just OR the two sub-modules (because of fail-fast)
	+ The reason for 2 OR is to tolerate the comparator

*Important*: Table 2. Where does the number come from? 
- assume a single module has MTTF = 1
==> the expected time until the first failure is 1/1 = 1
- now duplex:
==> the expected time until the first failure is 1/2 (because we have 2 module)
- Now triplex
MTTR of triplex = expected time until the first failure + 
				  expected time until the second failure
				= 1/3 + 1/2 = 5/6
Remember: there is no repair in this system

*And some take away about redundancy*:
+ redundancy by itself does not improve availability or reliability
(it does decease the variance in failure rates)
+ adding redundancy lessens the reliability in cases of duplex and triplex
==> Redundancy designs require *repair* to dramatically improve reliability

# Important of repair
We can see, with repair, duplex and triplex has dramatically increase in 
MTTF:

MTTF of duplex with repair = (MTTF/2) * MTTF/MTTR 
The intuition here is that:
+ MTTF has been reduced by a factor of 2 (because of 2 replicas)
+ but at the same time, due to repair, MTTF has increased by a factor equal to
the MTTF of the remaining 1 

MTTF of triplex with repair = (MTTF/3) * (MTTF/2) / MTTR

# Improved device maintenance
The ideas of fail -fast modules and repairs via retry or spare modules is apply 
to create fault-tolerance devices
==> but they do not mask failures caused by hardware design faults.
Similarly comparison techniques do not apply to software --> which is all design
unless design diversity is applied


# fault-tolerant software
1) Self-checking: i.e defensive programming, sanity checking
	+ if some item does not satisfy the integrity assertion, raise exception
	or try to repair
	+ watchdogs or auditors observes the state, if discover inconsistency,
	raise an exception and either fail-fast the state (erase it) or repair it

2) Comparison: run the same computation, compare result (similar to hw)

3) *Design diversity*: ==> Goal: get designs with independent failure mode
- N-version programming:
	+ same spec, but implemented by different independent groups
	+ pros: provide design diversity
	+ cons: 
		> even independent group can make same mistake
		> common mistake may arise from original spec
		> how to repair is not clear
		> expensive: implementation and maintenance cost raised by a factor of N
		> need to repair to improve MTTF, but repairing a design flaw may takes
		a lot of time, affecting availability
		>  difficult to reintegrate into working system without interrupting 
		service
==> Software repair is not trivial

# Process pairs and transaction
- completely different than software repair
- assumption: most errors caused by design faults in production hardware and
software are transient
	==> can be tolerate by retries
- Best short-term approach for masking software faults is to restart the system

PROCESS PAIRS:
- way to get almost instant restart
- if a process fails, it is unit of replacement
- how it works:
	+ *primary process* performs all operations for its clients
	+ *backup process* passively watch the message flows
	+ primary occasionally send checkpoint message to backup
	+ when primary detect some inconsistency, fail-fast and notifies backup 
	--> backup becomes primary and answering all requests
- Pros:
	+ instant replacement and repair
	+ tolerate transient software failure
	+ tolerate hw fault if backup process run on different hw
- Cons: complex of writing checkpoint and take over logic

Alternative: use TRANSACTION (ACID)
	+ atomicity: either all or nothing
	+ consistency: preserve state invariants
	+ isolation: isolated from other concurrent transactions
	+ durability: once committed, the effect of its operation survive 
	subsequent failure
	
How to Combine transactions with process pairs:
1) Designer declares a process pair's state to be *persistent*
2) When primary fails
 	+ all transactions in the primary are aborted
	+ reconstruct the backup-process state as it stood at the start of the 
	in-progress transactions
	+ back-up process then reprocesses the transactions

# How about operation, maintenance and environment fault
==> System pair: 
- 2 nearly identical systems are place far apart.
- on different power grids
- at sites on different earthquake faults, different weather systems
- maintenance personnel and operators are different
- each system carries half load
- when one fails, the other take over

Pros:
- mask many hardware failures
- mask maintenance fault
- either system can be repaired, moved, replaced without affecting each other
- easy to upgrade, reorganize one system at a time
- mask environmental fault