** Virtual Machine Monitors ** Years ago, IBM sold mainframes to large organizations, and a problem arose: what if the organization wanted to run different operating systems on the machine? (some applications were developed on one OS, and some on others, and thus the problem) As a solution, IBM introduce yet another level of indirection in the form of a *virtual machine monitor*, or *VMM* or just *monitor* for short. -------- -------- -------- | App1 | | App2 | | App3 | -------------------- -------------------- | Operating System | | Operating System | ----------------------------------------------------- | Virtual Machine Monitor | ----------------------------------------------------- [FIGURE: VMM UNDER TWO OPERATING SYSTEMS] Specifically, the monitor sits between one or more operating systems and the hardware and gives the illusion to each running OS that it controls the machine. Behind the scenes, however, the monitor actually is in control of the hardware, and must multiplex running OSes across the physical resources of the machine. Indeed, the VMM serves as an operating system for operating systems, but at a much lower level; the OS must still think it is interacting with the physical hardware. Thus, *transparency* is a major goal of VMMs. Today, VMMs have become popular again for a multitude of reasons. Server consolidation is one such reason. In many settings, people run services on different machines which run different operating systems (or even OS versions), and yet each machine is lightly utilized. In this case, virtualization enables an administrator to *consolidate* multiple OSes onto fewer hardware platforms, and thus lower costs and ease administration. Virtualization has also become popular on desktops, as many users wish to run one operating system (say Linux or Mac OS X) but still have access to native applications on a different platform (say Windows). Thus, for this and many other reasons, virtualization is back again and likely here to stay. This resurgance began in the mid-to-late 1990's, and was led by a group of researchers at Stanford headed by Professor Mendel Rosenblum. His group's work on Disco [1], a VMM for the MIPS processor, was an early effort that revived VMMs and eventually led that group to the founding of VMware [2], now a market leader in virtualization technology. In this note, we will discus the primary technology underlying Disco and through that window try to understand how virtualization works. [VIRTUALIZING THE CPU] To run a *virtual machine* (e.g., an OS and its applications) on top of a virtual machine monitor, the basic technique that is used is *direct execution*. Thus, when we wish to "boot" a new OS on top of the VMM, we simply jump to the address of the first instruction and let the OS begin running. It is as simple as that. Assume we are running on a single processor, and that we wish to multiplex between two virtual machines, that is, between two OSes and their respective applications. In a manner quite similar to an operating system switching between running processes (a *context switch*), a virtual machine monitor must perform a *machine switch* between running virtual machines. Thus, when performing such a switch, the VMM must save the entire machine state of one OS (including registers, PC, and unlike in a context switch, any privileged hardware state), restore the machine state of the to-be-run VM, and then jump to the PC of the to-be-run VM and thus complete the switch. Note that the to-be-run VM's PC may be within the OS itself (i.e., the system was executing a system call) or it may simply be within a process that is running on that OS (i.e., a user-mode application). We get into some slightly trickier issues when a running application or OS tries to perform some kind of *privileged operation*. For example, on a system with a software-managed TLB, the OS will use special privileged instructions to update the TLB with a translation before restarting an instruction that suffered a TLB miss. In a virtualized environment, the OS cannot be allowed to perform privileged instructions, because then it controls the machine rather than the VMM beneath it. Thus, the VMM must somehow intercept attempts to perform privileged operations and thus retain control of the machine. A simple example of how a VMM must interpose on certain operations arises when a running process on a given OS tries to make a system call. For example, the process may be trying to open() a file, or may be calling read() to get data from it, or may be calling fork() to create a new process. In a system without virtualization, a system call is achieved with a special instruction; on MIPS, it is a *trap* instruction, and on x86, it is the *int* (interrupt) instruction with the argument 0x80. Here is an open system call in assembly on FreeBSD: open: push dword mode push dword flags push dword path mov eax, 5 push eax int 80h add esp, byte 16 [FIGURE: OPEN SYSTEM CALL ON FREEBSD; SEE [3] FOR DETAILS] On UNIX-based systems, open() takes three arguments: int open(char *path, int flags, mode_t mode); and you can see in the code above how open() is implemented: first, the arguments get pushed onto the stack (mode, flags, path), then a 5 gets pushed onto the stack, and then "int 80h" is called, which transfers control to the kernel. The 5, if you were wondering, is the pre-agreed upon convention between user-mode applications and the kernel for the open() system call; different system calls would place different numbers onto the stack (in the same position) before calling the interrupt instruction. When a trap instruction is executed, it usually does a number of interesting things. Most important in our example here is that it first transfers control (i.e., changes the PC) to a well-defined *trap handler* within the operating system. The OS, when it is first starting up, establishes the address of such a routine with the hardware (also a privileged operation!) and thus upon subsequent traps, the hardware knows where to start running code to handle the trap. At the same time of the trap, the hardware also does one other crucial thing: it changes the mode of the processor from *user mode* to *kernel mode*. In user mode, operations are restricted, and attempts to perform privileged operations will lead to a trap and likely the termination of the offending process; in kernel mode, on the other hand, the full power of the machine is available, and thus all privileged operations can be executed. Thus, in a traditional setting (again, without virtualization), the flow of control would be like this: Process Hardware Operating System 1. execute typical instructions (e.g., add, load, etc.) 2. need to do system call: trap! 3. Switch to kernel mode and jump to OS trap handler 4. Running in kernel mode Handle the system call Return from trap 5. Switch back to user mode and return to execution at the proper return site 6. Go back to running typical instructions [FIGURE: EXECUTING A SYSTEM CALL] On a virtualized platform, things are a little more interesting. When an application running on an OS wishes to perform a system call, it does the exact same thing: executes a trap instruction with the arguments carefully placed on the stack (or in registers). However, it is the VMM that controls the machine, and thus the VMM who has installed a trap handler that will first get executed in kernel mode. So what should the VMM do to handle this system call? The VMM doesn't really know *how* to handle the call; after all, it does not know the details of each OS that is running and therefore does not know what each call should do. What the VMM does know, however, is *where* the OS's trap handler is. It knows this because when the OS booted up, it tried to install its own trap handlers; when the OS did so, it was trying to do something privileged, and therefore trapped into the VMM; at that time, the VMM recorded the necessary information (i.e., where this OS's trap handlers are in memory). Now, when the VMM receives a trap from a user process runnning on the given OS, it knows exactly what to do: it jumps to the OS's trap handler and lets the OS handle the system call as it should. When the OS is finished, it executes some kind of privileged instruction to return from the trap ("rett" on MIPS), which again bounces into the VMM, which then realizes that the OS is trying to return from the trap and thus performs a real return-from-trap and thus returns control to the user and puts the machine back in user mode. The entire process is depicted here, both for the normal case without virtualization and the case with virtualization (we leave out the exact hardware operations from above to save space): Process Operating System (user mode) (kernel mode) 1. trap 2. OS trap handler decode trap and execute appropriate syscall routine when done: return from trap 3. start running again [FIGURE: SYSCALL FLOW WITHOUT VIRTUALIZATION] Process Operating System Virtual Machine Monitor (user mode) (which mode?) (kernel mode) 1. trap 2. oh, a trap! call OS trap handler 3. OS trap handler decode trap and execute appropriate syscall routine when done: try to execute return from trap instruction 4. oh, a return from trap! do a real return from trap 5. start running again [FIGURE: SYSCALL FLOW WITH VIRTUALIZATION] As you can see from the figures, a lot more has to happen when virtualization is going on. Certainly, because of the extra jumping around, virtualization might indeed slow down system calls and thus could hurt performance. You might also notice that we have one remaining question: what mode should the OS run in? It can't run in kernel mode, because then it would have unrestricted access to the hardware. Thus, it must run in some less privileged mode than before, be able to access its own data structures, and simultaneously prevent access to its data structures from user processes. In the Disco work, Rosenblum and colleagues handled this problem quite neatly by taking advantage of a special mode provided by the MIPS hardware known as supervisor mode. When running in this mode, one still doesn't have access to privileged instructions, but one can access a little more memory than when in user mode; the OS can use this extra memory for its data structures and all is well. On hardware that doesn't have such a mode, one has to run the OS in user mode and use memory protection (page tables and TLBs) to protect OS data structures appropriately. [VIRTUALIZING MEMORY] You should now have a basic idea of how the processor is virtualized: the VMM acts like an OS and schedules different virtual machines to run, and some interesting interactions occur when privilege levels change. But we have left out a big part of the equation: how does the VMM virtualize memory? Each OS normally thinks of OS as a linear array of pages, and assigns each page to itself or user processes. The OS itself, of course, already virtualizes memory for its running processes, such that each process has the illusion of its own private address space. Now we must add another layer of virtualization underneath the OS, so that multiple OSes can share the actual physical memory of the machine, and we must do so transparently. To understand this process, we must first make sure we understand what happens when there is no virtual machine monitor. We thus review what happens on a MIPS-based system during address translation. Let us assume a user process generates an address (this could be for an instruction fetch or an explicit load or store, for example); by definition, the process generates a *virtual address*, as its address space has been virtualized by the OS. It is the role of the OS, with help from the hardware, to turn this into a *physical address* and thus be able to fetch the desired contents from physical memory. The act of turning a virtual address into a physical address is called *address translation*, as you might recall. Assume we have a 32-bit virtual address space and a 4-KB page size. Thus, our 32-bit address is chopped into two parts: a 20-bit virtual page number (VPN), and a 12-bit offset: --------------------------------------------------------- | Virtual Page Number (20 bits) | Page Offset (12 bits) | --------------------------------------------------------- The role of the OS, with help from the TLB, is to translate the VPN into a valid physical page number (PPN) and thus produce a fully-formed physical address which can be sent to physical memory to fetch the proper data. In the common case, we expect the TLB to handle the translation in hardware, thus making the translation fast. When a TLB miss occurs (at least, on a system with a software-managed TLB), the OS must get involved to service the miss, as depicted here: Process Operating System (user mode) (kernel mode) 1. load from memory TLB miss! trap into OS 2. OS TLB miss trap handler figure out VPN of address and lookup VPN in page table if present, get PPN and update TLB /w priv. instructions return from trap 3. start running again instruction is retried and this time a TLB hit [FIGURE: TLB MISS FLOW WITHOUT VIRTUALIZATION] As you can see from this flow, a TLB miss causes a trap into the OS, which handles the fault by looking up the VPN in the page table and installing the translation (VPN->PPN for this PID) in the TLB. With a virtual machine monitor underneath the OS, however, things again get a little more interesting. Let's examine the flow again: Process Operating System Virtual Machine Monitor (user mode) (supervisor mode) (kernel mode) 1. load from memory TLB miss! trap into OS 2. TLB miss handler Call into OS TLB miss handler to actually do the work 3. OS TLB miss trap handler figure out VPN of address and lookup VPN in page table if present, get PPN and update TLB /w priv. instructions 4. Trap handler Somebody (OS) is trying to update the TLB See that OS is trying to install VPN->PPN mapping Change mapping to desired VPN->MPN mapping jump back to OS, pretending that priv. instruction worked 5. return from trap 6. Trap handler Somebody (OS) is trying to return from a trap Thus, actually return from trap 7. start running again instruction is retried and this time a TLB hit [FIGURE: TLB MISS FLOW WITH VIRTUALIZATION] As you can see, the picture gets a little more complicated. Now, the VMM is managing memory underneath the OS and thus must interpose on these attempts by the OS to install direct virtual-to-physical mappings and redirect such mappings to the mappings desired by the VMM. In other words, the VMM is actually managing what we call real *machine memory* underneath the "physical" memory that the OS thinks it is managing. Thus, each process on a given OS thinks it has its own virtual address space, which the OS multiplexes across physical memory. The VMM, in turn, multiplexes different OS's physical memories onto the actual machine memory. By inserting another level of indirection, the VMM can control which pages are allocated to which OSes. (could put more detail here) [VIRTUALIZATION: OTHER ISSUES] There are a huge number of other issues one could discuss about virtualization. One general problem is what some have called the *semantic gap* between the VMM and the OS. The VMM, in general, doesn't know what the OS is trying to accomplish and this situation can often lead to inefficiencies. For example, an OS, when it has nothing else to run, will sometimes go into an *idle loop* just spinning and waiting for the next interrupt to occur: while (1) ; // the idle loop It makes sense to spin like this if the OS in charge of the entire machine and thus knows there is nothing else that needs to run. However, when a VMM is running underneath two different OSes, one in the idle loop and one usefully running user processes, it would be useful for the VMM to know that one of the OSes is now idle so it can give more CPU time to the OS that is doing useful work. There are many other similar examples, e.g., the OS knows which "physical" pages are free but to the VMM they look like the OS is using them; all require either some intelligence on the VMM's part to infer what is happening or a rewrite of a small portion of the OS to pass the VMM the needed information and thus let it make better decisions. Many other topics could also be discussed: - I/O virtualization - Hosted virtualization where an operating system is running and a VMM is loaded later on the side - Memory management in more detail - Paravirtualization, where the OS is modified to be more easily virtualized But for now, that's all. [SUMMARY] Virtualization is in a renaissance. For many reasons, it makes sense to allow users to run multiple OSes on the same machine at the same time. The key is that VMMs generally provide this service *transparently*; the OS above has little clue that it is not actually controlling the hardware of the machine. The key method that VMMs use to do so is *interposition*; by getting involved on critical privileged events such as traps, the VMM can completely control how machine resources are allocated while preserving the illusion that the OS requires. [REFERENCES] [1] "Disco: Running Commodity Operating Systems on Scalable Multiprocessors", Edouard Bugnion, Scott Devine, Kinshuk Govil, Mendel Rosenblum. SOSP '97. [2] VMware corporation. Available: http://www.vmware.com/ There, I saved you a google search. [3] "FreeBSD Developers' Handbook: Chapter 11 x86 Assembly Language Programming" Available: http://www.freebsd.org/doc/en/books/developers-handbook/x86-system-calls.html