**RX: Treating bugs as Allergies**
==================================

QUESTION:

* What aspect (reliability or availability) does Rx help to improve?
	- Availability: strictly speaking, Rx does not mask failure, but it detect
	  system failure and provide fast repair by changing execution environment,
	  so that the system in re-execution (hopefully) avoid that failure
	- Moreover, the proxy in Rx make re-execution transparent to client,
	  hence, from client view, there might be a increase in response time,
	  but client doesn't see the server has crashed.s
	
* Why not simply reboot?
	- Reboot does not helps with deterministic bug
	- Reboot is slow, need time to initialize the system state, to warm cache

* Trade-off: when to use reboot, when to use Rx?
	- 
		
	
# 0. Take away:
- how to improve availability? how to tolerate software failures?
	+ Alternative 1: reboot
		> cannot deal with *deterministic failure*
		> unavailability during reboot
		> require time to make cache warm
	+ Alternative 2: checkpoint and recovery
		> cannot deal with "deterministic bugs"
		Solution: message ordering, n-version programming
	+ Alternative 3: application-specific
		> require software to be failure-aware
		==> affect programming difficulty and code readability
	+ Alternative 4: non-conventional
	(like failure oblivious computing)
		> for out-of-bound read, just provide dummy value rather than panic
		==> may work for certain bugs and application, but not for all
- idea in this paper: a "safe" approach, i.e rather fix bug at run time, 
change environment so that it does not manifest
	+ rollback program to recent checkpoint
   	+ *change* execution environment based on failure symptoms
	+ re-execute buggy code region in new environment
	+ after passing the region, changes are disable to avoid overhead

What if re-execute fails?
- rollback again, with different changes or older checkpoint
- after some threshold of iteration, switch to alternative (reboot)

- Assumption: 
	+ bugs related to execution environment:
		> dirty read: ==> fill allocated buffer with zero
		> dangling pointer ==> delay recycling of freed buffer
		> buffer overrun ==> padding
		> data race ==> change timing of related event
			(e.g increase time slot)
		> malicious user ==> drop request
	+ can be avoided by removing the "allergen" from environment

Question: 
1) HOW to find the *correct* allergen, and remove it from environment?
	(this is what the paper is all about)
2) Changing environment is expensive, hence when to make change?
3) Which change has to make first?
	+ learn from history (if a change good for similar failure, apply it)
	+ prefer small overhead changes
	+ change with negative effects is tried last (e.g, drop request)
	
Advantages of Rx:
- comprehensive: survive a lot of bugs
- safe: not fix the bug but change environment, no modification to program 
execution
- non-invasive: few to no modification to source code
- efficient: no reboot, no warm-up time
- informative: provide additional info for debugging

Disadvantage:
- depends on accuracy of sensors (i.e failure detector)
- cannot work with bugs that not depends on environment (i.e semantic bugs)
- memory leakage
- checkpoint and rollback is overhead
- cannot deal with latent bugs - bugs in which the fault is introduce at a time
  long before any obvious symptoms.

# 1. Design of Rx
Five components:

1) sensors: to detect failure at runtime*
	- software error: e.g assertion failure, access violation
	- software bugs: buffer overflow, dangling pointer

2) Checkpoint and Rollback
	- What to checkpoint?
		1) application state: COW on apps memory image
		2) system states: file state, signal, message
	- dealing with space overhead?	
		+ write old checkpoints to disk when idle
		but, recovery checkpoint on disk requires IO
		==> bound recovery time, hence small number of checkpoint

3) environment wrapper: to change environment during re-execution
*Requirement*
- correctness preserved
- avoid future failure
Now, the details
- memory wrapper: modify memory related library to introduce changes
		+ delaying free: avoid *dangling pointer* and *double free*
			> allocated only no other available free memory
			or after certain amount of time
		+ padding buffers: avoid *buffer overflow*
		 	> but waste memory :)
		+ allocation isolation: 
			all memory allocated during re-execution is placed in isolate place
			==> avoid memory corruption (buffer overflow)
		+ Zero-filling: avoid *uninitialized read*
			> time over head
- message wrapper: implemented in the proxy
	+ shuffle order of requests
	+ random-sized packets
- scheduling wrapper: implemented in kernel
	+ change process's priority
	==> reduce the chance a process to switch off in unprotected critical region
- signal delivery:
	+ may affect probability of a concurrency bug's occurrence rate
	+ record signal in a kernel-table before delivering
	+ h/w interrupt: deliver at random time but preserving order
	+ s/w timer signal: ignore during rollback
	+ s/w exception: capture by sensors (as a sign of failure)
- dropping user request:
	+ to avoid malicious user

4) proxy: make client transparent to server recovery
- 2 modes: normal and recovery 
- normal mode: 
	+ forwarding request/response between server and client
	+ buffer request
	+ record which message is answered
- recovery mode:
	+ replay the requests
	+ introduce environmental changes
	+ buffer new requests during re-execution
	+ drop request that has been answered
- for strict session consistency (i.e ensuring replay a message producing
exactly same result) ==> use hashing, if not match, abort session
- need to tweak application protocols to interpret the messages

5) control unit
- direct checkpoint and rollback
- diagnose failure, decide what changes should be applied based on symptom 
	+ symptoms could be:
		~ type of exceptions
		~ call chain
		~ instruction counters
		~ etc ...
- provide failure-related info for further debug
- learn from experience:
	+ build failure table, and a score vector for each failure
	+ once a failure is detected, search failure table, if match, 
	apply those change first

# Some issues
- inter-server communication: web sever - app server - database server
	+ apply Rx to all servers in hierarchy
	+ Rx in tiered servers take checkpoints coordinately
	+ When Rx detects a failure, it rollbacks the failed server and broadcast
	  its rollbacks to other correlated servers, which then roll back 
	  correspondingly
- multi-threaded process checkpointing:
	+ when checkpoint, some threads may block inside kernel on a system call
	+ potential problem:
		~ some state where kernel locks have been acquired
		~ roll back such state may cause two processes holding same kernel locks
	+ solution: force all threads to stay at user level before checkpointing
		~ how: send a signal to all threads
		~ then, Rx resumes the prematurely returned system call
		Implication: frequent checkpointing affect performance of normal IOes
- may not works for all bugs (this is the weakness)
	+ memory leakage bugs: take days to make server crash
	==> better reboot
	+ semantic bugs (may not related to environment), hence Rx can do nothing
	+ depends on accuracy of sensors