« Terra: a virtual machine-based platform for trusted computing | Main | Remus: High Availability via Asynchronous Virtual Machine Replication »

Improving the Reliability of Commodity Operating Systems

Improving the Reliability of Commodity Operating Systems. Michael M. Swift, Brian N. Bershad, and Henry M. Levy. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, Oct. 2003.

Quick review due Tuesday, 4/19.

Comments

Summary:The paper presents Nook, a reliability subsystem that aims to improve OS reliability by isolating the OS from driver failures. It emphasizes on preventing the majority of driver-caused crashes by minimal change to existing driver and system code, compared to assuring complete fault tolerance through new OS driver architecture which can lead to issues such as incompatibility. It contains lightweight protection domains inside the kernel address space which comprise the isolated drivers, which prevents them to corrupt the kernel. It ensures automatic clean-up of kernel resources by tracking each driver’s use. The authors evaluate Nook by implementing it in the Linux OS. They present results that show Nook offers a considerable increase in reliability of OS, identifying and recovering from faults, hence preventing system crashes. The empirical data showed a 99% recovery rate from faults that would have otherwise crashed Linux. A plus point for Nook is that even though it is designed for drivers, it can be extended to other kernel resource reliability. The authors state this by using it to isolate file system and in-Kernel internet service. Finally as it can run on commodity systems and supports existing C-language extensions it definitely fairs better than legacy specialized architecture and type safe languages.
Confusion: 1. Impact on memory due to the bookkeeping of various information for protection domains, threads, kernel state etc. 2.Control flow between the various components.

1. Summary
The authors have attempted to isolate the kernel extensions such as device drivers from the kernel in order to achieve fault resistance in the system. Nooks works within lightweight protection domains inside the kernel address space and is backward compatible requiring little code changes to driver or kernel. The evaluations with multiple device driver workloads show that 99% of the crashes were contained when compared to native linux. Extension Procedure Call(XPC) mechanism for isolation is similar to lrpc, this is for all extension-to-kernel and kernel-to-extension control flow. For data flow, Nooks object tracking mechanism is enforced to avoid DMA misuse. Recovery functions detect and recover from faults: processor raises exception, extension improperly invokes a kernel function. Wrappers are used for accessing all kernel structures. Overall, the solution provides isolation with compatibility and transparency for the commodity systems.

2. Question
Why is it hard to prevent infinite loops inside an extension.

Summary

This paper describes Nooks, a new operating system subsystem that aims to enhance OS reliability by isolating the OS from driver failures. Initially, the authors provide motivation for the proposed work by giving details regarding extensions being the major contributors to failures. Next, the authors describe the main goals of Nooks - isolation, recovery and backward compatibility, and move on to describe various features of the subsystem such as isolation, interposition, object tracking and recovery. Lastly, the authors evaluate the proposed solution by implementing it Linux and the initial results seem promising.

Confusion

The paper does not talk about how simultaneous access to the same kernel object by two extensions is handled. Is this a rare event? Also, the paper does not quantify the additional memory overhead incurred by the subsystem. A discussion on the same would be interesting.

1. Summary
The paper presents the architecture , performance and implementation of Nooks, an OS module that allows extensions like device drivers, file systems to execute safely in commodity OS kernels. These extensions remain a significant cause of OS failures. Nooks is a reliability subsystem that provides OS reliability by isolating kernel from driver using protection domain and tracking use of kernel resources. There is minimal or almost no change to driver and system code focusing heavily on backward compatibility. Nooks Isolation manager is added as a separate layer between Kernel and Extension layer handling Isolation, Interposition, Object Tracking and Recovery. Nooks recovered from 99% of extension faults causing Linux to crash with a moderate performance overhead varying from 10-60%.

5. Confusion
What will be the overhead in terms of application latency for having an XPC call everytime b/w kernel and driver? For object tracking , what is the policy to decide if shadow copy of data should be present in the driver protection domain?

This paper describes a Nooks, a new reliability layer intended to significantly reduce extension related failures. The four key features being: isolation, interposition, object tracking and recovery. It sacrifices complete fault tolerance and isolation for compatibility and transparency with existing kernel and extensions. It is intended to prevent system crashes by faulty extensions, it is not designed to detect erroneous extension behavior.
Reliability was tested using synthetic fault injection using a fault injector for use with Rio File Cache. Performance was also measure on a linux bare machine.

Confusion:

As mentioned in the paper Nooks does not provide complete fault tolerance as it did not want to change existing driver and system code. I am not clear on how other faults come and why Nooks cant fix it.
What is the memory overhead of this design as the isolation manager does lot of tracking.

This paper describes a Nooks, a new reliability layer intended to significantly reduce extension related failures. The four key features being: isolation, interposition, object tracking and recovery. It sacrifices complete fault tolerance and isolation for compatibility and transparency with existing kernel and extensions. It is intended to prevent system crashes by faulty extensions, it is not designed to detect erroneous extension behavior.
Reliability was tested using synthetic fault injection using a fault injector for use with Rio File Cache. Performance was also measure on a linux bare machine.

Confusion:

As mentioned in the paper Nooks does not provide complete fault tolerance as it did not want to change existing driver and system code. I am not clear on how other faults come and why Nooks cant fix it.
What is the memory overhead of this design as the isolation manager does lot of tracking.

Summary
In this paper the authors present Nooks- a reliability subsystem that significantly reduces the device driver failures. It is designed as a transparent layer between the OS kernel and the extensions, which uses hardware and software techniques to isolate drivers within lightweight protection domain inside kernel address space, transparently integrate extensions through interposition mechanisms of XPC calls for control and data flow and wrapper stubs as interfaces for each call, maintain kernel structures information passed by the extensions in the object-tracker, and finally recover from software and hardware faults by returning error or triggering recovery. The results show that Nooks recovered from 99% of the faults, without any change in driver code.
Confusion
Providing isolation required the use of global variable to hold task pointer for uniprocessors. Could you explain the issue and the solution. The call and task flow between recovery manager and agent was little confusing.

1. Summary
This paper describes Nooks, a system that uses new address spaces for kernel extensions and a mechanism similar to remote procedure call in order to provide better isolation betwen drivers and the main kernel.

4. Confusion
Could you go into more detail on how the control flow passes from kernel to driver?

Summary Prior research has show that the majority of OS crashes in Windows are induced by faulty drivers - thus isolating failures in driver code should significantly improve operating system reliability. The authors propose Nooks, a system which wraps the kernel/driver interface for a specific a RPC system, allowing the driver to operate on isolated copies of kernel data structures, and permitting the identification and recovery of driver errors.

Confusion
I'm still a bit confused on how Nooks passes complex data structures like trees between user and kernel space.

Summary

In this paper, the authors describe Nooks, a reliability subsystem that seeks to enhance OS reliability by isolating the OS from driver failures. It runs in a lightweight kernel protection domain thereby providing isolation, fault resistance and backward compatibility. It describes a practical approach as there is little or no change to the existing code. It supports existing C-language extensions, runs on a commodity operating system and hardware and enables automated recovery. Nooks was able to automatically recover from 99% of faults in fault-injection tests.

Confusion

What are memory overheads and how are different accesses synchronized?

Summary:
This paper introduces Nooks, a subsystem that improves reliability in commodity operating systems through isolation of the kernel extensions from the kernel. Typically kernel extensions such as device drivers account for most of the crashes. So the need for improving reliability arises, but the goal of this subsystem is not fault tolerance but fault resistance (reducing crashes) along with backward compatibility and minimal code changes. In order to isolate the kernel extensions, a lightweight protection domain with restricted write access to kernel memory is where extensions are executed. Transparent reliability layer between OS and kernel extensions. Layer handles and recovers from h/w and s/w faults. Object tracking of kernel data structures and accesses. Prototype implemented in the linux kernel. Recovered from 99% of the faults. Introduction of Nooks does impact the overall performance of the system due to addition of a new layer.

Confusion:
Impact of additional memory overhead on the system(object tracking and maintaining protection domains)? How are driver failures handled now?

Summary
This paper describes the design, implementation and performance of a reliability subsystem that aims to prevent OS crashes due to driver failures, by providing a VM plus hardware based isolation, that place device drivers in a light-weight protection domain in the kernel address space. Each device driver has its own copy of private data structures and has access to shared kernel data structures in read only mode and writes to the kernel structures are carefully checked by interposing the writes using XPC.

Confusion
What is the performance implication as the number of isolated drivers increase because this increases the number of page tables to be kept in sync?

1. Summary
This paper talks about "Nooks" which is a kind of reliability subsystems that tries to improve the OS reliability from driver failures. The goal is to prevent vast majority of driver failures with little or no change to existing driver or system code. It isolates drivers within light weight protection domains inside the kernel address space where hardware and software prevent them from corrupting the kernel. The results mention in the paper show that there is a substantial increase in operating systems, catching and quickly recovering from many faults that would have crashed the system. Nooks was implemented on a Linux operating system kernel and in the evaluation test of 2000 fault injections Nooks recovered 99% of the faults.

2. Confusion
How does Nooks perform in time critical systems, does the added extension have any major impact on performance ?

Summary
This paper is about the design and implementation of Nooks, a subsystem for enhancing the reliability of commodity operating systems by isolating execution of OS extensions from the kernel. The goals of the system are fault resistance and recovery which are achieved by having the extensions run in light weight protection domains to contain the faults, using Extension Procedure Call(XPC) and wrappers as stubs for communication between extension and kernel domains, kernel objects tracking in each extension domain for recovery. The authors implement Nooks on top of Linux 2.4.18 kernel on x86. Along with device drivers, Nooks provide extensible isolation for VFAT, a kernel-mode file system and kHTTPd, an application specific kernel extension. Nooks eliminate 99% of system crashes and 60% of non-fatal extension failures by injecting faults.

Confusion
If there are shared kernel objects/memory pages, how are they tracked using the object-tracking mechanism?

Summary
This paper proposes an architectural design to minimize the frequency of system crashes caused by bugs residing in kernel extension code, which contributes to most of system failures. They implemented Nooks, a subsystem layer that works between kernel and extension, providing proper isolation and controlled communication between them by putting extension code in a separate protection domain with the same level of privilege. The key achievement of their design is a subsystem that successfully detects or recovers from extension errors without significant modification in either kernel or extension code. They used injection test to measure Nooks’ performance on a Linux machine, and the result showed that the Nooks layer was able to get over 99% of the bugs.
Confusion
How does Nooks support synchronization of multiple extensions accessing the same system resource?

Summary
This paper presents the design and implementation of a reliability subsystem called Nooks that aims at reducing driver/extension caused crashes in the commodity operating systems. This is done by introducing a new layer between the kernel and device drivers that help isolating the kernel from driver/extension failures by running each driver in a separate lightweight protection domain and recovering from some of the detected failures from drivers before they affect the rest of kernel. The new subsystem is just an addition to the commodity OSes and hence is backward compatible.

Confusion
- Memory overhead for the system ?
- Didn't really get the benefits in using deferred XPC.

Aim :
Isolation - Limit scope of corruption.
Recovery - Avoid loss of application state.
Avoid disruption – maintain compatibility – avoid API / architecture changes

Idea :
Isolate kernel extensions (drivers) within lightweight protection domains inside kernel address space.
Ie, drivers run within a sandbox (virtual memory)

On fault :
s/w fault – policy chooses between recovery and returning ctrl to extension with error code.
h/w fault – recovery
Default Recovery = unloading and reloading extension

Mechanism :
Reliability layer – Nooks Isolation Manager – between kernel and extensions.
XPC – control transfer between asymmetric but trusted domains.
Interface – wrapper stubs. Call by value result semantics
Object tracking – controls kernel data access by extensions.
Kernel mode execution, but has restricted write access to kernel data.

Lightweight Kernel Protection Domains :
Separate copy of the kernel page table for each domain
Each PD has private structures, such as a local heap, memory mapped i/o regions, kernel memory buffers, stack pool.
Changing PD – TLB flush

Feasibility :
Few kernel functions are at perf-critical points

Questions :
1. In protection domains, explanations on what necessitated each facility provided would be great.
2. Contrast with other protection domain approches.

Summary
Kernel extensions such as device drivers are a major source of failure for modern operating systems.
The paper describes Nooks, a reliability subsystem that aims at improving the reliability for commodity operations systems by isolating the OS from driver failures. Nooks was designed for fault resistance and not fault tolerance which means rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture,the goal is to prevent the vast majority of driver-caused crashes with little or no change to existing driver and system code to support backward compatibility. The key idea of Nooks is to execute each extension in a lightweight kernel protection domain – a privileged kernel-mode environment with restricted write access to kernel memory. Nooks achieves this by introducing a new reliability layer between the extensions and the OS kernel which intercepts all the interactions between the extension and the kernel. It uses hardware and software techniques to isolate kernel extensions, trapping many common faults and permitting automated extension recovery. Nooks also tracks a driver’s use of kernel resources to facilitate automatic clean-up during recovery. The authors implemented a prototype of Nooks in the Linux operating system with modest engineering effort and experimented with a variety of kernel extension types, including several device drivers, a file system, and a kernel Web server. Using automatic fault injection, it was shown that when injecting synthetic bugs into extensions, Nooks can gracefully recover and restart the extension in 99% of the cases that cause Linux to crash. In addition, Nooks recovered from all of the common causes of kernel crashes that were manually inserted.
Confusion
What is the synchronization that is being talked about in the context of wrappers?

Summary:
The paper presents Nooks,a reliability subsystem in commodity operating systems. Nooks provides for OS extensions to execute within the kernel address space, albeit in a protection domain with controlled write permissions. Nooks provides fault resistance using a combination of Extension Procedure Calls (XPCs) for domain transfer between calls and wrapper stubs to provide transparency and backward compatibility. Nooks also provide fault recovery mechanisms to minimize the impact of extension faults on the system.

Confusion:
How does this approach compare to the approach of moving device drivers to the user space? That approach would address isolation and reduce the complexity of kernel code as well.

Summary
This paper talks about Nooks, a framework for isolating and protecting against errors in kernel extensions. Unreliable and buggy device drivers/extensions are the most frequent cause of kernel crashes. Nooks departs from conventional approaches to this problem in terms of providing transparency and portability. Central to its design is the idea of running each extension in a lightweight protection domain, with read-only access to most kernel data structures. Secondly all standard interface calls are interposed using wrapper stubs to check and copy/synchronize arguments across protection domain. Nooks uses an object tracker in order to correctly free resources during a recovery operation. Extensive random fault injection shows that nooks handles majority of crashes and latent bugs. However overheads can be significant if the kernel interface is complicated.
Confusion
Why does the XPC wrapper functions not have to do marshaling? Did not understand well the exception mechanism with nooks.

Summary
The Nooks kernel-level subsystem was built to significantly improve existing commodity operating system reliability by providing isolation and automatic failure recovery of kernel-level extensions such as device drivers, while maintaining backward compatibility with the existing kernel and extension code. The key component of the Nooks design, the Nooks Isolation Manager, virtualizes the interface between the kernel and the extension using standard virtual memory and runtime techniques such as lightweight kernel protection domains, Extension Procedure Calls (XPC), object tracking functions and kernel and extension call wrappers. The garbage collection of extension-allocated data performed by the recovery manager and user-mode agent modules in Nooks help it recover from most well-behaved extension faults. The evaluation of the Nooks implementation on Linux highlighted that it can provide significant operating system reliability gains at the cost of a moderate to low degradation in performance.


Questions / confusion
1. In both the Nooks and the Optimistic Crash Consistency paper, we observe that the fault injection technique was being used to evaluate reliability of the respective designs. Are there any other techniques that can complement fault injection in evaluation of a system's reliability?

Summary
The paper starts with explaining the motivation behind this paper which was that most of the system failures were due to device driver failures and every design to solve this problem needed architectural changes. They propose a new subsystem/architecture called nooks whose main aim is to be fault resistant and avoid errors due to bugs. The paper then explains the main goals (isolation, recovery and backward compatibility) and functions (isolation, interposition, object tracking and recovery) of the system. It then explains the implementation/ limitation of all these functions in detail. Finally they talk about its evaluation (where they ran various benchmark experiments on 8 different extensions) in terms of performance and reliability.

Confusion
The paper has not spoken about memory overhead created due to many lightweight protection domain's private structure. Was this not significant? I did not understand the order of interaction between the recovery-manager and user-mode agent during recovery.

1. Summary
This paper introduces Nooks, lightweight containers for loadable kernel drivers. These were designed to improve the reliability of the kernels by isolating loadable module related bugs. The authors note that loadable modules simultaneously account for a large section of the kernel code base and an inordinately high number of kernel crashes. To mitigate mistakes in a driver leading to system downtime (and not to protect against malicious drivers) the authors: Isolate the modules address space from the remaining kernel address space, Add a shim layer between the module and the kernel similar to an RPC interface to sanitize andy changes the module tries to make to the system state, Track all kernel data structures that the module modifies to isolate faults and aid Recovery for an individual module without having to reboot the entire machine. The authors implement a prototype in the linux kernel to see the viability of such a solution. The system is able to recover from 99% of crashes without a system panic. Nooks was also able to catch various non-fatal extension failures. However, there is a performance penalty as Nooks adds time due to a layer of abstraction introduced between the kernel and various modules which previously worked at native speeds. This is exacerbated by the use of XPCs that include a context switch leading to the x86 architecture invalidating all TLB entries. These costs can add up to 60% of the total execution time. Hence, the decision to use Nooks (as it can be used selectively) needs to be taken on a case to case basis, based on the performance and reliability requirements of a certain system.
2. Confusion
Which learnings gained from Nooks and various Micro Kernels can be applied in each others environment, as they both have a similarity in architecture.

1. Summary
The authors have come up with a design for improving reliability of the system by introducing a layer between the kernel and the extensions (device driver code). They did this by 1) Isolating the failure of the extension from kernel 2) Providing XPC calls and wrappers for kernel - extension interaction 3) Object tracking for safely passing data. 4) Recovery by releasing all the misbehaving data objects and restarting the extension execution. Through their implementation (Nooks) the authors were able to isolate 99% of the injected faults.

2. Confusion
What are the current/state of the art mechanisms to prevent driver corrupting the kernel?

Summary
To isolate the OS from driver failures, which contribute the most to system failures, this paper discusses the design and implementation of the subsystem Nooks. Compared to solutions introducing new architectures or type safe languages, Nooks tries to work transparently with existing drivers and OSs by running extensions in isolated protection domains, interposing the communication between the kernel and the extension, and tracking the kernel objects and resources they are using. Once a failure is detected, Nooks will try to recover the system.

Confusion
It seems that to implement Nooks, each type of kernel objects and each function interface need to be taken care of individually by the developers. Could you make it clear which parts can be auto-generated and which parts are written by hand with the knowledge of their intended usage?

1. Summary
The authors introduce Nooks, Linux kernel subsystem that isolates device drivers in the form of extensions within lightweight protection domains inside kernel address space with little code change. It attempts to prevent driver failures and recover quickly with no application state by means of tracking kernel data structures and transparent imposition is done by wrapper stubs. Their solution was practical, backward-compatible and partially efficient which prevented most but not all system failures.
2. Confusion
What are pros and cons of this capability based approach vs protection rings? How atomicity/consistency is guaranteed in the latest paper while creating protection domain and what new idea it introduces, does it handle driver functional failures as well?

Summary
This paper describes the architecture, implementation and performance of Nooks, which is a reliability subsystem aimed at reducing system crashes induced by failures of OS extensions like device drivers, etc. The novelty of the paper lies in the fact that unlike previous systems, Nooks proposes to achieve its goals for contemporary commodity operating systems with little or no change to the underlying code of either the existing drivers or the OS itself. From the evaluations, it is clear that Nooks is able to recover from 99% of the faults, which otherwise would have caused a typical OS like Linux to crash in similar situations.

Confusion
How is synchronization handled when many extensions modify the same kernel data structure shared by multiple extensions? Whose changes prevail from among the various stale copies- is there a policy?

Summary

This paper describes Nooks - a reliability layer intended to reduce extension related failures by isolating extensions within lightweight protection domains inside the kernel address space. This helps in preventing the kernel from getting corrupted, prevent driver failures and achieve automatic, graceful recovery from failures.

Confusion
Can there be recursive problems seen if there is a bug in the driver code accessing illegal memory location? How is the application made aware about the faults and its subsequent recovery? Is there an error code passed back to the application?

1. Summary
This paper talks about Nook, a subsystem that seeks to improve OS reliability by isolating OS from driver failure. Computer reliability is an important an unsolved problem. With increasing number of extensions added to commodity OS, there is an urgent need to validate these extensions and ensure they are bug free. However, these extensions are often buggy and end up crashing the system. Nook tries to improve reliability of an OS by ensuring that it does not crash in case of a buggy extension. Nook also provides mechanism to support automatic recovery from failure. It provides these two features with the added advantage of being backward compatible. It does not use any type safe language and is completely written in C, making it even more attractive for adaptability. Nook’s key contribution is isolation which it achieves by running the driver in the kernel address space but with reduced access permission. The Nook layer sits between the kernel and the extensions, intercepting between them to achieve its goals.

2. Questions
1) How would Nook be implemented in a multicore system?
2) What are the pros and cons of a user level driver with respect to Nook?
3) What are the differences between a kernel-user context switch and the kernel lightweight protection domain context switch ?

Summary
The paper explains the implementation details of Nooks - a reliability layer which is added between the OS kernel and kernel extensions such as drivers which provides features that are- isolation, interposition, recovery. The nooks aims in increasing the reliability of commodity operating system by isolating any problem caused by the extension to the extension itself(using protection domains) and also helps in recovering the extension.It also provides a backward compaitbility with none or less requirement changes to existing OS or extensions. The performance impact of nooks ranges from 0 -60% depending on the number of interaction with kernel by the extension
Confusion
1]Can you please explain the advantage of deferred calls?
2]Isn’t it a memory overhead of maintaining a Copy of kernel page table for each driver?

summary
In order to prevent system crash due to the extension, device driver, Nooks uses thin software layer between kernel and device driver to bring reliability. Nooks uses four functions: Isolation with lightweight kernel protection domain and extension procedure call, interposition with XPC and wrapper stubs, object tracking for checkpoint, and recovery when detects hw or sw faults.

Confusion
It is very straightforward and sounds great. Why the capability based architecture is more popular than this?

1. Summary
In this paper, the authors present design and implementation of Nooks, which is a reliability sub-system targeted at enhancing OS reliability by isolating OS kernel from failures in extensions (drivers). There are 3 main goals of this work - (i) isolation, which means isolating extension failures from the kernel, thus preventing system crashes, (ii) recovery, which means supporting automatic recovery after extension failure to allow applications to continue executing, and (iii) backward compatibility, to support all existing systems and extensions with minimal change. The most crucial part of the design is Nooks Isolation Manager which sits between the OS kernel and the extensions to provide the main functions of isolation, interposition, object tracking and recovery.
2. Confusion
* What is “light” about the lightweight protection domains (LPD) in which the extensions execute? What is the difference between kernel-user switch, and the switch between kernel and LPD?
* How is Nooks different from using user mode drivers? What are added benefits?
* Why does the object tracking mechanism need to record the addresses of all objects used by an extension? Why not track only the objects that can be written by an extension?
* Could you discuss the changes in design needed to support Nooks on a multi-processor?

This paper describes Nooks, a series of isolation mechanisms based on wrappers that provide lightweight protection around drivers in case of failure. Nooks has a recovery-based approach consisting of a recovery manager (which returns the system to a clean state), and the user-mode recovery agent which facilitates flexible recovery via configuration files. The authors test Nooks against synthetically generate faults and report overall positive results.

Confusion:
- how does Nook detect livelock given detection is via processor exceptions?
- won't drivers just crash again if reloaded under the same circumstances?

1. Summary
The paper presents the design and implementation of nooks, a transparent reliability subsystem that improves realibility of commodity operating systems by isolating the OS extensions (windows drivers, kernel modules, etc.) failures. The proposed technique contains about 99% of the faults injected during testing with a moderate performance overhead ( with some exceptions like kHTTPD) in most cases.

2. Confusions
In what ways lightweight kernel protection domain different from user domain? What makes it to suitable for kernel module execution?
What is the added memory pressure on the system for executing nooks?

1. Summary
The paper describes the architecture and design of "Nooks" which is a framework for increasing fault resistance in commodity operating systems. Nooks provides isolation for the OS from extensions, it provides a recovery mechanism to permit applications depending on failing extensions to execute after the extension is reloaded and it provides backward compatibility so that all the OS extensions do not have to support the Nooks framework. Nooks was implemented on a Linux 2.4 kernel and the evaluations show that it eliminated 99% of the crashes that occurred with native Linux.

2. Confusion
I did not understand why global variable is not enough to hold the task pointer in SMP systems.

Summary
Since kernel extensions are a major source of system failure, this paper presents Nooks which is a practical, backward-compatible reliability subsystem of commodity operating systems that isolates extensions in lightweight protection domains inside the kernel address space. In order to achieve the goal of isolation, recovery and backward compatibility, a new layer (NIM) is inserted between the kernel and the extensions. The main functions of this layer are isolation(through XPC), interposition(track data/control transfer between kernel and extension, provide interface wrapper stubs), object tracking and detection & recovery from software and hardware faults.
Confusion
->Why copy kernel page table for each extension?Can COW be used here instead?
->Examples of how an extension can act maliciously?What privileges does it have ?
->How to determine lifetime of object during object tracking?

Summary: This paper presents Nooks, a framework to contain device driver (or more generally, kernel extensions) faults by placing them in special protection domains within the kernel. The authors do this, whilst ensuring backward compatibility to allow the existing codes to be reused, by interposing all kernel-driver interactions, and generating wrapper stubs.
Confusion: Since all protection domains share same kernel address space, there should be no TLB flush on domain switch. So, is all the TLB related overhead due to TLB misses because of interposition? In such a case, would having a L2TLB-like structure as in Mach help? How would optimistic crash handling help in Nooks’ performance? How will Nooks design change if it were to use CPU protection rings?

1. Summary
The paper proposes a new kernel subsystem, Nooks, that protects the system against crashes due to buggy kernel extensions. Nooks sits between the kernel and the extensions, and through stubs, it passes control from one side to another. During this pass, Nooks checks for possible errors that may corrupt kernel data. Nooks also runs extensions in their own protection domain to avoid corruption of kernel data due to bugs.
2. Confusion
Are locks considered “data objects” as well? If so, does Nooks avoid using deferred XPC in those cases?

Summary
The majority of operating system crashes are triggered by faulty device drivers and this paper introduces an additional software layer between the core kernel structures/services and device drivers to improve the overall reliability of the system. Traditional approaches to improve reliability at the time focussed on hardware re-architectures, type safe language systems, etc.. while the OS had no explicit mechanisms. The paper clearly articulates the goals(Isolation, Recovery, Backward-Compatibility) and introduces novel design and implementation details for NIM that isolates extensions onto protection-domains within the kernel address space, utilises XPC for control transfer, introduces wrappers/stubs for kernels and extensions to provide transparency, object tracking for recovery, etc... The authors also employ techniques like deferred XPC, shadow-copies, etc... to improve performance.

Confusion
What is the memory/performance cost of duplicating the kernel objects on a per protection domain basis ?
What is the cost associated with switching the CR3 register on every control transfer ?
How is the lifetime of an object estimated ?
Control flow during the recovery process between the recovery manager and user-mode agent is confusing ??
Explanation around the co-location of task_struct instances, DMA, non-utilisation of ring-1/2 (x86) would be really helpful.

Post a comment