Improving the Reliability of
Commodity Operating Systems
|
|
nook (nk)
n.
- A small corner, alcove, or recess, especially one in
a large room.
- A hidden or secluded spot.
- A lightweight kernel
protection domain for preventing device drivers from crashing an
operating system (new).
|
Overview
Despite decades of
research in extensible operating
system technology, extensions such as device drivers remain a
significant cause of system failures. In Windows XP, for example,
drivers account for 85% of recently reported failures.
Nooks is a reliability subsystem that seeks to greatly enhance
OS reliability by isolating the OS from driver failures. The Nooks
approach is practical: rather than guaranteeing complete fault
tolerance through a new (and incompatible) OS or driver architecture,
our goal is to prevent the vast majority of driver-caused
crashes with little or no change to existing driver and system
code. To achieve this, Nooks isolates drivers within lightweight
protection domains inside the kernel address space, where hardware and
software prevent them from corrupting the kernel. Nooks also tracks a
driver's use of kernel resources to hasten automatic clean-up during
recovery.
More recently, we
have extended Nooks with shadow drivers. A shadow driver is
a kernel agent that (1) conceals a driver failure from its clients,
including the operating system and applications, and (2) transparently
restores the driver back to a functioning state. In this way,
applications and the operating system are unaware that the driver
failed, and hence continue executing correctly themselves.
People
Faculty
|
Hank Levy
Brian Bershad
|
Graduate
Students |
Mike Swift
Muthu Annamalai
Brian Milnes
Leo Shum
|
Undergraduate
Students
|
Micah
Brodsky
Eric Kochhar
Jordan Hom
Doug Buxton
Steve Martin
|
Exchange
Students
|
Christophe
Augier
Damien Martin-Guillerez
|
Publications
- Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, Henry M. Levy. Recovering
Device Drivers (pdf,
550k), In ACM Transactions on
Computer Systems, 24(4), Nov. 2006.
- Michael Swift. Improving
the Reliability of Commodity Operating Systems, Ph.D. Dissertation, Oct. 2005.
- Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, Henry M. Levy. Recovering
Device Drivers (pdf,
550k), In Proceedings of
the 6th ACM/USENIX Symposium on Operating Systems Design and
Implementation, San Francisco, CA, Dec. 2004.
- Michael M. Swift, Brian N. Bershad, and Henry
M. Levy. Improving
the Reliability of Commodity Operating Systems (pdf,
400k), to appear in ACM Transactions on Computer Systems,
23(1), Feb. 2005.
- Michael M. Swift, Brian N. Bershad, and Henry
M. Levy. Improving
the Reliability of Commodity Operating Systems (pdf,
300k), in Proceedings
of the 19th ACM Symposium on Operating Systems Principles, Bolton
Landing,
NY, Oct. 2003. Best paper award.
- Michael M. Swift, Steven Martin, Henry M. Levy, and Susan J. Eggers.
Nooks: an architecture for reliable
device drivers (pdf, 162k) in
Proceedings
of the Tenth ACM SIGOPS European Workshop, Saint-Emilion, France,
Sept. 2002.
Presentations
- Recovering Device Drivers, or
Cleaning Up Nooks talk in UW class CSE551: Graduate Operating
Systems (pdf)
- Shadow Drivers: Transparent
Recovery for Kernel Extensions poster at UW
industrial Affiliates 2004 (pdf)
- Improving the Reliability of
Commodity Operating Systems talk at SOSP 2003, October 2003 (pdf)
- Nooks poster at UW
industrial Affiliates 2003 (pdf)
- Nooks: an architecture for
reliable device drivers talk at ACM SIGOPS worksop, September
2002(ppt)
- Nooks: an architecture for reliable device
drivers talk at UW Networking and Systems Retreat, June
2002 (ppt)
Lessons Learned
-
How to profile the Linux Kernel: use kernrprof, which may
cause double-faults when used with Nooks but provides call graphs, or oprofile, which does
interrupt-based sampling.
-
How to mesaure microarchitectural events on the
Pentium 4 with Linux: Use Brink-Abyss,
but be sure to set the duration knob long enough to complete your
experiment.
-
What architectural state must be writable for a
task gate to execute in Ring 0 on the Pentium: the GDT and TSS of the
current task must be writable with the current page table, so that
these can be updated before switching to the new task.
Software Downloads
- Software Fault Injection for Linux This patch applies to the
linux.2.4.18 kernel and requires the
kdb-v2.1.2-2.4.18-common-1 and kdb-v2.1-2.4.18-i386.1 patches. It is
enabled by turning on kernel debugging and enabling the "Load all
symbols for debugging" and "Automatically inject faults" options when
configuring the kernel.
This tool was created by porting the fault injector tool used in the Rio file cache
project
and is describe in the paper The Systematic
Improvement of Fault Tolerance in the Rio File Cache.
- Nooks source code for the Linux 2.4.18 kernel. This tarball
includes a patch against the Linux 2.4.18 kernel (available from
kernel.org
here ). We've tested our kernel with RedHat Linux 7.3, 9, and
Enterprise Linux 3. It is known not to work with the Fedora Core 2
distribution.
Also included are a suite of usermode tools:
- nooks-agent: the recover agent, to be placed in /sbin
- systest: the test tool for manually creating and
manipulating nooks
- modutils: new module loading tools to replace insmod and
modprobe
Nooks source
Grant Support
- This work is supported in part by NSF grant CCR-0326546.
Driver Reliabily
Links
Last modified 9/30/2004