- does it require reading individual drivers?
- Performance hit?
Notes from reviews:
i. Improve reliability by tolerating dominant cause of failure
1. Don’t bother making everything reliable
2. Try to make it integrate well with existing OS
3. Make it compatible with existing drivers / OS /applications
ii. Key pieces
1. Isolation / fault containment: prevent driver from corrupting os/application
2. Recovery: get driver running again after a failure
i. LW Kernel Prot Domains
ii. Prevent driver from writing to OS
iii. Allow writes to driver-private data
iv. XPC – invoke code in another domain
i. Inject code transparently
ii. Like VMM – but boundary is kernel/driver
iii. Done at load time, not compile time
1. Note: can choose where to put it!
iv. Wrappers on driver/kernel interface
1. Recompile driver because binary interface changes (macros -> functions)
2. Pretty much no code changes to drivers
vi. QUESTION: What happens when modules invoke other modules?
i. Allow safe-sharing
ii. Validate shared parameters
iii. Map between kernel and driver-private data
iv. QUESTION: What happens on a multi-processor?
i. Bad parameter
ii. Excessive resource consumption
ii. SW agent
i. unload driver
i. Unload completely
ii. Prot domains, obj. track allows completely unloading w/o driver help
1. Like a process can clean up for itself
iii. Restart driver
1. Needs user-level knowledge of how to restart
2. Issues: where does configuration data come from?
a. Solved in shadow drivers
i. Restart , replay log to move forward to state at crash
i. Where does it come from?
1. New code in system
b. Object tracking
c. Domain change (change page table)
2. Existing code running slower
a. More TLB misses
b. More cache misses due to copying
i. Are drivers fail stop?
1. What if driver writes bad data to device?
ii. Are driver failures heisenbugs?
iii. Can we virtualize this interface? Is it too ugly?
i. Pointing out that drivers are the problem
ii. Pointing out that compatible driver isolation is possible
iii. Pointing out that driver isolation can have reasonable performance
iv. Pointing out the importance of recovery
i. QUESTION: is this a good technique?
ii. QUESTION: What do we learn from these results?
1. Nooks stopped the faults we injected
iii. What are the limitations?
1. How realistic are faults?
a. Didn’t wait a long time for faults to have an effect
2. How realistic is the fault distribution?
a. Uniform distribution across fault types
3. How realistic was recovery?
a. Reloaded same code w/o faults
i. Need to show speedup / CPU utilization separately
ii. Else cpu increase is masked for non-cpu bound tests
iii. QUESTION: what about multiple drivers at once?
i. 22,000 lines of code. Is this a lot or a little?
i. Could have written paper as “How to make Linux device drivers execute reliably”
1. Talk about changes to Linux data structures
ii. Instead, presented as:
a. Generic approach, not many choices
b. E.g. could use virtual machines, could use software fault isolation, could use java
a. Specific set of choices, specific OS, specific isolation technique
iii. Makes paper more general, stronger