Citation: Edouard Bugnion, Scott Devine, and Mendel Rosenblum, "Disco: Running Commodity Operating Systems on Scalable Multiprocessors", Proceedings of the 16th ACM Symposium on Operating Systems Principles, Saint-Malo, France, October 1997. * Summary Virtual machine monitors are reintroduced as a mechanism for running commodity operating systems efficiently on large-scale shared memory multiprocessors without a large implementation effort. * Supporting Scalable Systems One of the largest impediments into the success of large, scalable multiprocessors is the lack of reliable operating systems to run on them. Rather than making extensive changes to existing operating systems, an additional layer is inserted between the hardware and operating system. This layer acts like a virtual machine monitor in that multiple copies of commodity operating systems can be run on a single scalable computer. The monitor allows efficient cooperation and sharing of resources between the commodity OSes. Using commodity OSes leads to systems that are both reliable and compatible with the existing computing base. Coupling is accomplished using standard distributed systems and network protocols. By using this approach, only the VMM and the distributed systems protocols need to scale to the size of the physical machine, and the simplicity of the monitor makes this task much easier than building a scalable OS. * Virtual Machine Monitors VMMs virtualize all resources of a machine and export a more convenient hardware interface to OSes. A monitor manages all resources so that multiple virtual machines can coexist on the same multiprocessor. Each virtual machine is configured with the processor and memory resources that the OS can effectively handle, and virtual machines communicate using standard distributed protocols to export the image of a cluster of machines. In addition, the virtual machine becomes the unit of fault containment, and failures need not spread over the entire machine. Use of VMMs entails a number of disadvantages, including overheads necessary for virtualization of hardware resources, resource management problems, and communication and sharing problems. However, much of the impact of these problems can be limited by combining recent advances in OS technology along with some specialized enhancements to the monitor. * Disco Disco is a VMM for a scalable, cache-coherent multiprocessor called FLASH. Each node in FLASH contains a processor, main memory, and I/O devices. The nodes are connected with a high-performance scalable interconnect. Cache coherency is provided by use of a directory, providing to software the view of a shared- memory NUMA multiprocessor. To match FLASH, Disco provides the abstraction of a MIPS R10000 processor. Disco correctly emulates all instructions, the memory management unit, and the trap architecture of the processor. In addition, extensions to the architecture are provided so that kernel operations such as enabling/disabling CPU interrupts and accessing priveleged registers can be performed using load and store instructions on special addresses. Disco provides an abstraction of main memory residing in a contiguous physical address space. Disco uses dynamic page migration and replication to export a nearly UMA time memory architecture to the software. Each virtual machine is created with a specified set of I/O devices. Disco must intercept all communication to and from I/O devices to translate or emulate the operation. Disco provides a set of virtual disks that any virtual machine can mount, and allows various sharing and persistency models. Each virtual machine is assigned a distinct link-level address on an internal virtual subnet handled by Disco, and Disco acts as a gateway when communication with the outside world is needed. * Disco Implementation To improve NUMA performance, careful attention has been given to the NUMA architecture of the FLASH such that data structures with poor cache behavior are not used. To improve locality, the Disco code segment is replicated into all memories of the FLASH machine's processors so that all instruction cache misses can be satisfied from the local node. Disco uses direct execution on the real CPU to emulate execution of the virtual CPU. To schedule a virtual CPU, Disco sets the real processor's registers to those of the virutal CPU and jumps to the current PC of the virtual CPU. The challenge of direct execution is the detection and fast emulation of priveleged instructions such as TLB modification and direct access to physical memory and I/O devices. Disco contains a simple scheduler that allows the virtual processors to be time-shared across the physical processors of the machine. Disco runs in kernel mode, while virtual OSes run in supervisor mode, which allows the OS to use a protected portion of the address space but does not give access to priveleged instructions or memory. To virtualize physical memory, a level of address translation is added and Disco maintains a physical-to-machine address map. When an OS attempts to insert a virtual-to-physical address mapping into the TLB, Disco emulates the operation and installs the correct virtual-to-machine mapping. To quickly compute the corrected TLB entry, Disco maintains a per virtual machine pmap data structure that contains one entry for each physical page of a virtual machine. Each pmap entry contains a precomputed mapping of the physical page to a machine page, as well as virtual address backmaps for invalidating TLB mappings. To handle switching between virtual machines, Disco flushes the TLB contents on all switches, leading to increased TLB misses. To compensate, a second level software TLB is maintained by Disco and is consulted by the TLB miss handler before forwarding a miss exception to a virtual OS. Disco intercepts all device accesses and eventually forwards them to physical devices. A Disco device defines a monitor call used by the device driver to pass all command arguments in a single trap. All DMA requests are intercepted and physical addresses are translated into machine addresses. Copy-on-write disks are used to share disk blocks through virtual memory page mapping. The virtual subnet and networking interfaces of Disco also use copy-on-write mappings to reduce copying and to allow for memory sharing. Messages sent between virtual machines cause the DMA unit to map the page read-only into both the sending and receiving VM's physical address spaces. By combining copy-on-write disks with copy-on-write network sharing, Disco provides a global buffer cache that is transparently shared by independent VMs.