Ondemand fault isolation
========================

What is the broad question we are answering?

- How do we recover very quickly in drivers that has a diverse range of applications?
	-- Can be used for fault isolation, quick(er) suspend, switching devices in VM environment?
	-- We show fault tolerance application of recovery by introducing on-demand fault tolerance
	-- Use a combination of existing program partitioning and static analysis tools/techniques to provide on-demand fault tolerance

Paper:

1- Abstract
	+	While research on driver isolation has progressed, little research on recovery from driver failures. 
		We present results/answers to the following questions:
		1. Is it possible to recover driver failures at finer granularity?
		2. Is it possible to perform device recovery from within the driver 
		without resetting or restarting driver and device?
 
	+ Proposed solution
 	(1) Improve recovery in drivers where the current state of art is restarting the whole driver and  
 	(2) Develop a driver isolation model that is able to demonstrate the value of our recovery mechanism
			and use existing static analysis tools to tolerate bugs
 	(3) works with existing set of drivers/operating systems.   
	+ Brief results

2- Introduction
	+ Modern computer systems
		-- increasing reliability concerns from different classes of failures
		-- many sophisticated tools, language and hardware techniques that handle failures at runtime and keep the system running
		-- resilient systems require better recovery solutions
		-- significant progress in isolation systems of drivers (trend towards finer grained protection)
		-- but all of them often fallback on shadow drivers for recovery which requires resetting the device and driver 
				(which is slow as seen in shadow driver migration)
	 	-- Why do we need better recovery
			--- Drivers containt significant initialization code and take time to initialize
			--- problems with whole kernel extensions recovery (can never capture state in a very generic manner and require 
				changes per driver)
			--- can not handle transient failures	
		-- we develop a recovery mechanism that is able to perform quick recovery at low overhead that has many use cases
		-- we develop an isolation model to complement the recovery 
	
3- Motivation
	+ Why do drivers/devices need to support fast recovery?
		-- Failure recovery
		-- I/O virtualization	
		-- Upgrade of drivers
	+ Describe driver code constitution (mostly initialization code, multiple chipsets, slow probing)
		-- describe time taken during probe
	+ Biggest challenges
		-- Device state
		-- Device/driver specific nuances
	+ Introduce On-demand fault tolerance = on-demand isolation + no overhead recovery
		-- Use wealth of static analysis tools to guide on-demand isolaton

4- Design (perhaps the last bullet of the previous section overrides the need for this section?)
	+ Picture of the whole system
	+ Abstract about isolation
	+ Specific about recovery piece

	5- On demand fault isolation
	 + Why do we need isolation?
		We need an introduction here that would be in section 1 if the paper 
		was more about fault tolerance than just recovery
	 
	 + Goals of isolation
		-- Work with existing drivers
	
	
		-- entry point isolation
	   
	
	  + How do we enforce isolation?	
		1. Split drivers
		-- Use existing program partitioning techniques and static analysis tools to modify driver at compile time
		-- Convert driver into two components - a regular driver and SFI'd driver. (Can overlap but will require wrappers around entrypoints)
		-- Use wealth of existing static analysis tools, developer annotations,  to mark functions with __isolate__ attribute
		-- Control transferred between the two using on-demand marshaling (stubs/wrappers)
		-- Object tracker?
	
		2. How do we ensure safety?
		 --  The required data structures are marshaled in and stored in range hash tables.
		 --  Range hash tables use to provide SFI for reads/writes using weak type safety
		 -- No stack information to prevent control being diverted due to stack smashing
		 -- What are we getting from separate module => copy + marshalling = reliability by redundancy + reliability structure. 
			Difference between nooks/BGI => either ensure 	reliability by redundancy or structure, we are doing both. redundancy helps us recover easily while structure aids in detection
	
	   + Duality of isolation and recovery 
		 -- How do we invoke recovery from isolation?
		 -- How do we detect failures?
		 -- Walk through of isolation + recovery
	

6- No-overhead recovery
	+ State of art in driver recovery is to restart whole driver
	+ Present no-overhead recovery
		-- Ability to restore a running driver to a safe state back in time
		-- Reuse suspend/resume in drivers to checkpoint device state and recover
		-- How do we recover the driver/device/kernel state?

		+ Device state
			--- Describe suspend-resume background
			--- use suspend/resume
				---- device specific functionality provided by driver
				---- mention quirks
			--- do minimum work during suspend
				---- how do we ensure that our system captures a consistent suspend snapshot?
			--- how do we handle quiescing of running threads?
				---- only wait during resume?
			--- can we do better?
			--- e1000 does take a lock during restore
			
		+ kernel state(calls into kernel)
			--- wrappers around calls into the kernel to store args and return values.
			--- upon failure compensate only the non-idempotent function calls.
			--- cleanup locks

		 + Driver state
			--- make a copy using driver slicer (already handled)
			--- driver changes only propogated on commit

		 + Concurrent driver threads
			--- Existing driver locks take care of ensuring synchrinizing access to
			--- Need to cleanup driver locks though
	
 		 + What about interrupts?
			--- USB state cannot be recovered in interrupt context

	     + Requirements from an isolation solution(if recovery described before isolation)

	 + Trap calls using SFI, kernel traps and general protection fault

7Evaluation

+ Performance
	-- Zero overhead during regular operation
	-- Performance overhead of isolated I/O 
+ Fault Tolerance
	-- Fault injection experiments
	-- Demonstrate isolation works
	-- Demonstrate recovery works
	-- Applicability(Coverage) of solution

+ Recovery Time
	-- Time cost of isolation
	-- Time cost of anticipation
	-- Time cost of recovery

+ Developer overheads?
	- Ease of generating recovery functions
	- Numbers of annotations required is OK 
	- Working with existing static analysis tools
		- Code patching tools
			-- stack guard
			-- heapify
			-- SFI
		- Bug detection tools
			-- Carburizer

	- No changes to kernel

8 Related Work(in progress)

-Revive I/O (buffers I/O with pseudo driver
	- cases of DMA etc (must read!)

- Recovery techniques
	-- Shadow drivers/Membrane/Recovery domains
- Isolation mechanisms

- Nooks, reference validation, LXFI, BGI
- Safe drive 
- SDV from microsoft, built on SLAM 
- Other static verifiers

9Conclusion
- Make a case for generic checkpoint/restore service in drivers
- Show can be done with existing drivers and has low overhead