Ondemand fault isolation ======================== What is the broad question we are answering? - How do we recover very quickly in drivers that has a diverse range of applications? -- Can be used for fault isolation, quick(er) suspend, switching devices in VM environment? -- We show fault tolerance application of recovery by introducing on-demand fault tolerance -- Use a combination of existing program partitioning and static analysis tools/techniques to provide on-demand fault tolerance Paper: 1- Abstract + While research on driver isolation has progressed, little research on recovery from driver failures. We present results/answers to the following questions: 1. Is it possible to recover driver failures at finer granularity? 2. Is it possible to perform device recovery from within the driver without resetting or restarting driver and device? + Proposed solution (1) Improve recovery in drivers where the current state of art is restarting the whole driver and (2) Develop a driver isolation model that is able to demonstrate the value of our recovery mechanism and use existing static analysis tools to tolerate bugs (3) works with existing set of drivers/operating systems. + Brief results 2- Introduction + Modern computer systems -- increasing reliability concerns from different classes of failures -- many sophisticated tools, language and hardware techniques that handle failures at runtime and keep the system running -- resilient systems require better recovery solutions -- significant progress in isolation systems of drivers (trend towards finer grained protection) -- but all of them often fallback on shadow drivers for recovery which requires resetting the device and driver (which is slow as seen in shadow driver migration) -- Why do we need better recovery --- Drivers containt significant initialization code and take time to initialize --- problems with whole kernel extensions recovery (can never capture state in a very generic manner and require changes per driver) --- can not handle transient failures -- we develop a recovery mechanism that is able to perform quick recovery at low overhead that has many use cases -- we develop an isolation model to complement the recovery 3- Motivation + Why do drivers/devices need to support fast recovery? -- Failure recovery -- I/O virtualization -- Upgrade of drivers + Describe driver code constitution (mostly initialization code, multiple chipsets, slow probing) -- describe time taken during probe + Biggest challenges -- Device state -- Device/driver specific nuances + Introduce On-demand fault tolerance = on-demand isolation + no overhead recovery -- Use wealth of static analysis tools to guide on-demand isolaton 4- Design (perhaps the last bullet of the previous section overrides the need for this section?) + Picture of the whole system + Abstract about isolation + Specific about recovery piece 5- On demand fault isolation + Why do we need isolation? We need an introduction here that would be in section 1 if the paper was more about fault tolerance than just recovery + Goals of isolation -- Work with existing drivers -- entry point isolation + How do we enforce isolation? 1. Split drivers -- Use existing program partitioning techniques and static analysis tools to modify driver at compile time -- Convert driver into two components - a regular driver and SFI'd driver. (Can overlap but will require wrappers around entrypoints) -- Use wealth of existing static analysis tools, developer annotations, to mark functions with __isolate__ attribute -- Control transferred between the two using on-demand marshaling (stubs/wrappers) -- Object tracker? 2. How do we ensure safety? -- The required data structures are marshaled in and stored in range hash tables. -- Range hash tables use to provide SFI for reads/writes using weak type safety -- No stack information to prevent control being diverted due to stack smashing -- What are we getting from separate module => copy + marshalling = reliability by redundancy + reliability structure. Difference between nooks/BGI => either ensure reliability by redundancy or structure, we are doing both. redundancy helps us recover easily while structure aids in detection + Duality of isolation and recovery -- How do we invoke recovery from isolation? -- How do we detect failures? -- Walk through of isolation + recovery 6- No-overhead recovery + State of art in driver recovery is to restart whole driver + Present no-overhead recovery -- Ability to restore a running driver to a safe state back in time -- Reuse suspend/resume in drivers to checkpoint device state and recover -- How do we recover the driver/device/kernel state? + Device state --- Describe suspend-resume background --- use suspend/resume ---- device specific functionality provided by driver ---- mention quirks --- do minimum work during suspend ---- how do we ensure that our system captures a consistent suspend snapshot? --- how do we handle quiescing of running threads? ---- only wait during resume? --- can we do better? --- e1000 does take a lock during restore + kernel state(calls into kernel) --- wrappers around calls into the kernel to store args and return values. --- upon failure compensate only the non-idempotent function calls. --- cleanup locks + Driver state --- make a copy using driver slicer (already handled) --- driver changes only propogated on commit + Concurrent driver threads --- Existing driver locks take care of ensuring synchrinizing access to --- Need to cleanup driver locks though + What about interrupts? --- USB state cannot be recovered in interrupt context + Requirements from an isolation solution(if recovery described before isolation) + Trap calls using SFI, kernel traps and general protection fault 7Evaluation + Performance -- Zero overhead during regular operation -- Performance overhead of isolated I/O + Fault Tolerance -- Fault injection experiments -- Demonstrate isolation works -- Demonstrate recovery works -- Applicability(Coverage) of solution + Recovery Time -- Time cost of isolation -- Time cost of anticipation -- Time cost of recovery + Developer overheads? - Ease of generating recovery functions - Numbers of annotations required is OK - Working with existing static analysis tools - Code patching tools -- stack guard -- heapify -- SFI - Bug detection tools -- Carburizer - No changes to kernel 8 Related Work(in progress) -Revive I/O (buffers I/O with pseudo driver - cases of DMA etc (must read!) - Recovery techniques -- Shadow drivers/Membrane/Recovery domains - Isolation mechanisms - Nooks, reference validation, LXFI, BGI - Safe drive - SDV from microsoft, built on SLAM - Other static verifiers 9Conclusion - Make a case for generic checkpoint/restore service in drivers - Show can be done with existing drivers and has low overhead