« The multikernel: a new OS architecture for scalable multicore systems | Main | Virtual memory, processes, and sharing in MULTICS »

Disco: running commodity operating systems on scalable multiprocessors

E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: running commodity operating systems on scalable multiprocessors. ACM Trans. Comput. Syst., 15(4):412-447, 1997.

Reviews due Tuesday, 2/10.

Comments

Summary:
The paper describes the idea of implementing a virtual machine monitor over the hardware to multiplex between different operating systems to achieve scalability. While virtual machines at that time were not new, the authors were one of the first to revive them for multiprocessor architectures and commodity operating systems. The implementation is targeted at ccNUMA aware machines.


Problem:
Since commodity Operating System are not scalable, what layer should be modified to achieve scalability as the hardware evolves? Operating Systems can be changed but typically it takes a lot of time and development costs. The paper tries to solve the problem by introducing a layer of Virtual Machine monitors which would be capable of running multiple commodity operating systems by multiplexing the hardware to achieve maximum utilization.

Contributions:
Virtual Machine Monitor: An added layer between the Hardware and the guest operating systems which manages all the machine resources for various guest operating systems. Since the code base is smaller, it can be evolved and modified more easily as compared to commodity operating systems.
Virtual CPUs: The VCPUs provide abstraction of the MIPS processor. There are three modes: kernel, supervisor and user mode. The virtual machine monitor runs at the bottom layer - the kernel layer with complete access. The guest operating system runs in the supervisor layer and the user applications occupy the user mode. There are per VM data structures to manage the registers and TLBs containing information about the state of the VCPU.
Memory: Complete abstraction of physical memory and a contiguous chunk starting at address 0. Implementing virtual memory adds another layer of indirection, the physical to machine translation which is typically performed in software TLB. There is a TLB flush required when migrating to another VCPU.
I/O devices: The implementation described by the authors intercepts communication from the I/O devices using device drivers injected into the OS code.

Evaluation
The authors provide detailed evaluation results of their implementation with an extensive set of workloads including engineering, scientific and database. Comparing IRIX and Disco, the overhead incurred with the VMM is between 3% and 16% as compared to without the VMM, which arises mostly from traps and TLB reloads. Memory is partitioned across virtual machines and NFS provides more memory than available. Dynamic page migration and replication graph shows that it improves performance by about 33% as compared to commodity OS.

Confusions:
It would be great to have a discussion on similarities/differences between Disco, exokernel and microkernel and what each of them did right or wrong.

1. Summary
This paper introduces the Disco, a virtual machine monitor that manages hardware resources in a shared-memory multiprocessor machine with NUMA memory, in an effort to support conventional commodity operating systems in scalable systems with minimal implementation complexity.

2. Problem
Hardware innovation such as scalable shared-memory multiprocessors requires significant changes to operating system software. These changes have huge costs; it may bring instability, unreliability and poor data sharing; and it needs cooperation from commercial companies. Therefore, new ways must be found to develop a software system for a scalable multiprocessor machine.

3. Contributions
Disco adds an intermediate layer between hardware and OS, which virtualizes the hardware resources(CPU, I/O, memory) to OS and hides the underlying non-uniform hardware implementation.
Data sharing is done by efficient TLB and mapping shared data structures. Communication is supported b standard distributed system protocols such as TCP/IP and NFS.

4. Evaluation
Disco designed for FLASH is evaluated by SimOS simulator. The workloads are carefully chose from four categories. The virtualization overhead ranges from 3%-16% and the paper wisely avoids using “average” for normalized values. The overhead comes from system services and trap handling.
Sharing code and buffer cache could reduce almost half memory overhead with eight virtual machines, most coming from IRIX’s data.
The paper chose only two of four workloads to show scalability benefits without stating any reasons. I doubt if similar result holds for the other two workloads that almost 40% improvement of execution time on right virtual machines.
Dynamic page migration and replication performance is done on two workloads of poor memory behaviors. Disco shows a 33%-38% improvement although it is not close to the UMA lower bound.

5. Confusion
I believe that virtual machine monitor helps reduce cost to change operating systems and improve data sharing performance, but what is the cost and complexity to modify the monitor if hardware is updated or replaced? Also, with monitors, the system software may not be able to fully manage/utilize the hardware resources. Why didn’t the paper compare the performance of Disco with running modified operating systems on scalable multiprocessor machine directly – although we cannot modify Windows NT, we do can modify Linux?

1. summary
This paper proposes the idea of virtual machines monitors along with a prototype Disco. Virtual machine monitors is a extra layer that sits between the hardware and OS. It virtualizes the hardware and expose a generic interface which enables different commodity OS to run simultaneously on a single scalable multi processor system with minimal changes and overhead to the existing operating system. Results of the evaluations of Disco show that Virtual machine monitors provide almost comparable performance to the existing OS while providing better scalability and low overhead to support hardware innovations.
2. Problem
System Software was unable to cope up with the innovations of the systems hardware. To bring out the full potential of these hardware innovations many hardware oriented changes were needed to be made to the OS. These changes involve a lot of human work and is not future proof. Apart from this, they also gave solution to the problem of running multiple OS on a single scalable multiprocessor system.
3. Contributions
The key contribution of this paper is to introduce a new layer of abstraction between the Hardware and OS called Virtual machine monitors. Different commodity OS can run on Virtual machines over a single multiprocessor system. The process of virtualizing the hardware resources, physical memory, scheduling different OS are provided by the VMM.
1. Memory Management - Since multiple VMs run simultaneously, there is a another layer of abstraction during address translation. VMM translate the physical address provided by the guest OS to the real machine address.
2. I/O devices Management - I/O interactions by the OS is intercepted by the VMM to ensure consistency.
3. Introduced Page placement, dynamic page migration & replication which provide a view of the memory, hiding the NUMA-ness of the system. This enables non CC-NUMA OS to run in these hardware.
4. Fault containment - a bug/crash of one VM does not affect another.
4. Evaluation
Evaluation of Disco is carried out on SimOS simulating multiple copies of IRIX for different kinds of workloads (ie pmake, engineering, scientific computing , DB). Results suggest that VMM's additional virtualization adds 3-16% to the overhead. This over head is mainly due to the TLB miss handler simulation of Disco. So larger page size reduce the overhead. Page placement, dynamic page migration & replication techniques provide a 33-38% performance improvement over commodity OS modified to the needs of hardware.
5. Confusion
Few changes are required in the commodity OS to run over VMMs. But in present VMMs there is no such changes required, what technical advancement present VMM have to achieve this ?

1. Summary
Disco is a virtual machine monitor which runs many commodity operating systems on a scalable shared memory multiprocessor system by adding an extra layer of system software between the OSes and hardware. The monitor is proposed as an alternative to scalable operating systems which are complex, require lot of modifications and have huge implementation cost. Disco is evaluated by running real workloads on a system simulator of early FLASH machine and some empirical data shows that virtualization can be done with little overhead and gain benefits similar to scalable OS running on a large-scale cc-NUMA machine.

2. Problem
Large scale cc-NUMA machines, posed a challenge for system software development, as it needed lot of modifications for a scalable OS to achieve reliability, support the legacy computing base, cc-NUMA management and still support fault containment. The gap between innovative hardware and system software was hindering the delivery of good scalable products in the computer market. To address this problem, the article provides a simple yet effective solution of using virtual machine monitor to support commodity OSes running on hardware requiring minimum modifications and achieve scalability and fault containment.

3. Contributions
The important contribution of the Disco is the non-NUMA interface it presents to the commodity OSes running on a scalable cc-NUMA machine. Disco provides an interface to all the running OSes and virtualizes the resources like CPUs, memory and I/O devices among all of them, acting as an extra insulation layer between the hardware and OSes. Disco supports communication among virtual machines using standard network interface making shared memory communication easier and efficient. All the privileged instructions on native OS, traps into Disco and it emulates these instructions for each virtual CPU. It provides efficient mechanism for address translation from virtual to machine mapping by using certain shared data structures. Facilities like dynamic page migration and page replication are provided to support NUMAness of the machine as the commodity OSes are NUMA unaware. All I/O communication and hardware interrupts are intercepted by Disco, thus efficiently sharing the resources among the virtual machines.

4. Evaluation
Disco is evaluated on a simulator SimOS targeting the Stanford scalable multiprocessor system FLASH and running IRIX OS. Four representative uniprocessor workloads were evaluated for the execution time. The overheads of virtualization were evaluated with and without Disco alongside IRIX and found it to be from 3% to 16%. This extra overhead can be accounted for the traps and TLB misses in the virtualized environment. Upto 8 VMs were run using Pmake workload and the results showed a significant reduction in memory footprint. Scalability effects and performance benefits of page migration/replication are also empirically evaluated and the results seem satisfactory. The evaluation methodology and the thorough analysis along with the characterization of workloads supports good empirical evaluation of the system. But, evaluation on a real machine (on FLASH hardware) with long running program regions with different dynamic instruction count would have been more helpful.

5. Confusion
I am curios to know, how did Disco support running of data parallel workloads with the multiprocessor hardware it can run on? Data partitioning of workloads among the processor nodes is not explained w.r.t NUMA property? Was it a good idea to flush TLB on every swap to a new VM — How do new VM monitors handle this scenario?

Summary:

In this paper, the authors explain how a virtual machine monitor can be used to provide an abstraction of advanced hardware features to commodity operating systems. They explain how their implementation of a VM has comparable performance to an OS running on native hardware by reducing overheads.

Problem:

The main issue that the authors are trying to tackle is reducing the gap between hardware advances and system software capability. Their argument is that system software require a significant amount of time to incorporate hardware innovations and hardware vendors also face the additional challenge of convincing OS companies to support these changes. Considering the fact that Windows XP was supported well into Windows 8's launch, having system software support multiple hardware generations is clearly an issue for OS vendors. In addition, the authors also need to tackle the issues, like overhead, that normally affect VMs.

Contributions:

The authors revive the concept of virtual machine monitors and use it to abstract away the nuances of the underlying hardware from the OS. This allowed them to mask the non-uniform nature of the FLASH system, while also giving their system the capability to support a number of operating systems.

The authors raise a number of issues that current operating systems could face with the way that hardware is evolving. I think that dealing with heterogenous systems is a valid concern and a distributed kernel structure does look like an interesting, potential solution. However, the authors don't really address how applications could be compiled for these distributed kernel systems.

In addition, they also introduced a number if enhancements to the VM in order to overcome the issues that normally plague virtual machine systems. They reduced the overhead of resource management by having their VM aware if the priority of processes. In page replication and migration, they introduced features that take optimise the memory layout in a NUMA system.

They introduced a few features to VMs which are still used today. For instance, they allow all systems running on the VM to access the same address space and place the mapping from a systems physical addresses to the actual machine addresses in the TLB.

Evaluation:

Since the FLASH system wasn't ready at the time of this paper's publication, the authors tested their system using a simulator and comparing the performance of a NUMA unaware IRIX system with their VM supported version. Their results show that they were able to achieve comparable performance with a system running directly on their simulator.

Their tests also show that sharing memory between the systems running on a VM reduces the amount if memory needed by the system. While these initial results are promising, it was still obtained using a simulator, reducing the number of workloads they could test due to the amount of time needed for execution. It would be interesting to see if their results translate to the actual hardware.

Confusion:

The way the authors describe it, it seems like virtual machines had fallen off the radar at this point of time. How did VMs gain popularity in their initial run and what led to their fall from grace?

Summary :
The paper talks about a virtual machine monitor system named Disco above which various commodity operating systems could work without having to be tailored to the underlying scalable multiprocessor non-uniform memory access architecture. This approach achieves scalability with minimal or no change to the operating systems of the virtual machines and also the flexibility to support different kinds of workloads.

Problem :
The problem in question is to design the virtual machine monitor such that it effectively virtualizes the CPU, the memory and the I/O devices transparently and employs mechanisms to reduce the overhead of the additional level of indirection that is introduced due to the VMM.

Contributions :
1. Abstraction of processor provided by the virtual CPU’s, underlying memory is abstracted as physical memory by Disco and the I/O devices are abstracted as virtual disks and virtualizes access to network devices.
2. Takes into consideration the NUMA architecture and replicates the code of disco on each node so that it can always be accessed locally.
3. Uses appropriate data structures to store and retrieve the state of a virtual CPU in the process of scheduling between virtual CPU’s. Provision of a supervisor mode for the virtual machine operating systems so that they could access certain segments of the address space.
4. The VMM translates the physical address for a virtual address generated by the OS into the corresponding machine address and inserts it into the TLB. Since the TLB on Disco is flushed each time the virtual CPU switch happens, it maintains a second level TLB caching recent virtual to machine address translations.
5. In order to support the NUMA architecture, it replicates the read-shared pages across multiple nodes if all the nodes frequently access it. Another mechanism is that of page migration which migrates the page (changes the TLB mapping) only to the node that accesses it. Uses the memmap data structure to perform the TLB shootdown on replication or migration.
6. Enables multiple virtual machines that share a disk to share the machine memory avoiding unnecessary accesses to the disk each time. Uses the copy on write mechanism for writes that are not shared across VMs.
7. Virtual machines can communicate using distributed protocols like NFS. Enables sharing of files between virtual machines.

Evaluation :
The SimOS has been used to simulate the hardware and the comparison of performance of IRIX and IRIX with Disco has been made. The system has been evaluated using 4 different varying workloads. The performance has been analysed extensively in terms of the time spent in each of the services. The results quantify how the performance of workloads has improved by replication and data migration (locality), memory sharing etc. As disco is NUMA-aware, the increase in VM’s has led to improvement in scalability.

Confusion :
The changes to the Hardware Abstraction level in IRIX were not quite clear. The paper mentions about wait free synchronization using MIPS instruction pair that is provided for data structures that are shared and accessed by multiple processors. How does that work?

1. Summary
This paper examines the problem of extending modern commodity operating systems to run efficiently on large-scale shared-memory multiprocessors without a significant development effort by introducing an additional layer of software like the virtual machine monitors in between the hardware and the operating system. This explains the design and implementation of Disco, a prototype virtual machine monitor that can run multiple copies of operating systems on a multiprocessor and how it minimizes some of the drawbacks of virtual machine monitors used in the context of virtual machines.

2. Objective
Hardware innovation was rapid and the innovation with system software was not quite at the same rate. It was becoming an impediment to hardware innovation. Some of the existing commodity operating systems required a lot of effort and changes in order to support and effectively make use of new hardware innovations like shared-memory multiprocessors. The idea of this paper is to make use of a layer of software between the hardware resources and the OS to enable multiple copies of these commodity operating systems and special-purpose operating systems on these multiprocessors without having to make any changes to the operating system and also provide good performance as compared to running them on uniprocessors.

3. Contributions
The main contribution of this paper is the popularization of the use of virtual machine monitors and how it overcame some of the overheads associated with these traditional virtual machine monitors used in the context of virtualization by enhancing the resource sharing among these virtual machines.
Disco coupled these different virtual machines running on the same system by making use of standard distributed systems protocols like TCP/IP and NFS. It allows efficient sharing of resources like memory, processor and disk between these virtual machine. This became possible by transparently sharing some major data structures such as the program code and the file system buffer cache among the virtual machines.
Disco enabled running multiple operating systems on scalable hardware without having to make any changes to the structure of the OS to effectively make use of the hardware. This allowed the use of these operating systems as commodity software on regular hardware as well as their concurrent use with other commodity operating systems or special purpose operating systems on scalable hardware. Thereby solving many problems associated with system software development like development effort, reliability and memory footprint of these systems.
Disco, the virtual machine monitor schedules the virtual resources (processor and memory) of virtual machines on the physical resources of the scalable multiprocessor generally catering to different kind of services running on each virtual machine. One can draw an analogy to today's schedulers scheduling different kind jobs on a cluster of machine.

5. Evaluation
The authors run some experiments on realistic workloads to show that Disco achieves its foals. With some modifications to a commodity operating systems that existed at that time, the overhead associated with virtualization ranged from 3% to 16% for all their uniprocessor workloads. They showed that a system running eight virtual machines could run somem workloads 1.7 time faster than on a commercial symmetric multiprocessor operating system by increasing the scalability of the system software, without increasing the system's memory footprint significantly. They also showed that page migration and replication allowed Disco to hide the NUMA-ness of the memory system, reducing the execution time by upto 37%. Thus, with Disco they were able to overcome some of the drawbacks of using traditional virtual machine monitors.

4. Confusions
Is it possible to run multiple operating systems on Disco without making any changes at all as the paper states that the changes required were small in order to be able to run with a virtual machine monitor.

1. summary

This paper discusses the approach of using virtual machine monitors to build system software for shared-memory multiprocessor computers. In particular, the design and implementation details of a protoype
called Disco is discussed and the performance of Disco is also evaluated on a simulation as well as on an actual uniprocessor.

2. Problem

The authors contend that making extensive changes to existing operating system to support scalable machines is a task that is complex and costly in terms of implementation. For example, significant changes such as building a single system image across the units and fault containment has to be built in. Furthermore, even when such changes are functionally complete, the system software tends to be unreliable and buggy.

3. Contributions

The major contribution of the paper is the introduction of the idea of a virtual machine monitor (Disco) which is a "software layer between the hardware and the multiple vritual machines that runs independent operating systems". Disco virtualizes and manages the resources (memory and CPU) of the machine so that multiple virtual machines (running different operating systems) can co-exist on one multiprocessor.

Disco uses a number of abstractions to achieve its goal. The details of the implementation of these abstractions forms the key contributions of this paper.

CPU
--------
The virtual CPU is directly executed on a real CPU. To enable this, each virtual CPU stores information on the saved state of the CPU when it is not scheduled. There is also a simple scheduler which shares the virtual processors across the physical processors. Furthermore, the architecture is also extended to provide "efficient access to some processor functions", such as for e.g, kernel functions like enabling and disabling CPU interrupts.

Memory
--------
An additional level of address translation is added. Virtual machines are provided physical addresses starting at zero. Disco maps these to the actual machine addresses using the TLB of the MIPS processor.

For each virtual machine, a "pmap" data structure is also kept which precomputes the physical page to the location in real memory so as to quickly compute corrected TLB entries.

Since the TLB is used for ALL operating system references, applications running on top of Disco will suffer from increased TLB misses. Thus, a second-level software TLB is also maintained. This caches recent virtual-to-machine translations.

Finally, a dynamic page migration or replication system is implemented. This moves or replicates pages to "maintain locality between a virtual CPU's cache misses and the memory pages to which the cache misses occur"

I/O
--------
Disco intercepts all accesses to I/O devices from the virtual machines and forwards them to the physical devices through its device drivers.

Network
--------
Standard distributed protocols such as NFS are used for communication. To eliminate duplication, a "virtual subnet" is introduced which allows virtual machines to communicate with each other.

4. Evaluation

The performance of Disco is firstly evaluated on SimOS, a machine simulator, under different categories of workloads. The overhead of virtualization is shown to range between 3% to 16% in the uniprocessor case. However, in a system with 8 virtual machines, the execution speedup is shown to be as high as 1.7 times. Hence, this is used to show that Disco achieves its goal of scaling favorably.

5. Confusion

I am not sure of the part in the implemention of virtual memory where it states that having each "operating system trap into the monitor would lead to unacceptable performance". Why exactly does
each operating system need to access the monitor (by which I assume they are referring to Disco) and what was the solution for this?

I am also not sure of how the copy-on-write mechanism helps to reduce copying in the case of virtual machines communicating with each other.

1. Summary
The paper presents the design of Disco, a virtual machine monitor. Disco acts as a layer of indirection between the hardware and operating system(s), and lets multiple operating systems run simultaneously as virtual machines on the same physical machine. Disco imposes minimal overheads in virtualizing the physical resources across the virtual machines.

2. Problem
According to the authors, extensive effort is required to adapt commodity operating systems to the innovations in hardware. The authors address this issue by designing a virtual machine monitor, which accepts the responsibility of effectively using modern hardware (e.g., CC-NUMA), and presents a simplified view of the hardware (e.g., UMA) to commodity OSes that they expect.

3. Contributions
The authors' primary contribution in Disco involves designing and implementing the mechanisms for virtualizing access to physical resources (CPU, memory, and I/O devices) from OSes. Disco tries to expose near uniform memory accesses to OSes running on it using dynamic page migration and page replication of hot pages. Disco also interposes and translates DMA requests by I/O devices, and, in the process, exploits opportunities to share storage and memory across virtual machines running on it.

4. Evaluation
The authors evaluate the virtualization overheads of Disco by comparing the performance of various application workloads on IRIX running on baremetal and on top of Disco. The authors evaluated the memory footprint of Disco running multiple virtual machines to show effectiveness of transparent memory sharing. The authors also showed the benefits of page migration and replication in localizing memory accesses in virtual machines.

5. Confusion
Can a NUMA-aware operating system run on Disco and control page allocation in various NUMA nodes? If not, Disco seems like a step in the reverse direction restricting access to newer hardware features to OSes that can adapt quickly. What is address space identifier and why does it force Disco to flush TLB entries of a virtual machines when it is preempted?

1. Summary

Disco is an virtual machine monitor designed to help scalably run commodity operating systems of multiprocessors. It uses several techniques, including page replication, page sharing to efficiently implement the VMM, while still providing a uniform environment for guest operating systems to work in.

2. Problem

Many operating systems lack efficient support for multiprocess systems, especially those with Non Uniform Memory Access or NUMA properties. Furthermore, other VMMs often have unacceptably high overhead, and are unable to efficiently manage system resources due to a lack of knowledge about the intentions of the OS running under them.

3. Contributions

First to hide the NUMA nature of the architecture, disco implements transparent replication of pages. This allows different virtual CPUs on the same virtual machine to execute on separate nodes of the system, while still allowing each to access the memory with latency similar to would be expected if the OS were running on a more typical uniform memory access system. In particular, a physical page may have different mappings to machine pages on different CPUs, and so in particular, disco can choose to map the physical page to a machine page that is lower latency for the physical node that each CPU is running on.

The dual to this concept is the concept of page sharing where physical pages, often from different VM's can share the same machine page. In the case of data read from disk, all DMA requests to the disk are intercepted, and if the data already exists in memory, instead of rereading it, the already existent copy is used. And furthermore, rather than actually making a copy in machine memory, Copy on Write techniques are used to share the page between the machines. The consequence of this is that only one copy exists of most read only data. They further implement a similar technique for networks, allowing data shared over the network between different VMs on the same hardware to be sent without copying. In particular this allows them to de-duplicate shared data in NFS

Then further there are a group of techniques that I am going to group under paravirtualization, although this term is not used by the paper itself. These are calls from the operating systems to the monitor, mostly for reducing the case where performance would be hampered by repeated traps to the monitor. Although a few of the changes are simply so that certain common operations, such as getting a page of empty physical memory, can be done more efficiently.

4. Evaluation

First they consider overhead. By looking at 4 different work loads, they find an overhead ranging from 3-16%, the worst case occurring with disk intensive workloads like Pmake and Database, and the best case occurring with Raytrace, a CPU intensive workload.

They also consider the memory usage of the system, in particular the saving that they get from page sharing. To measure this, they measure the real memory usage as they change the number of VM's as compared with the combined virtual footprint of the VMs. Here for 8 VMs running the same task, they end up using half as much machine memory as the virtual footprint would suggest, largely from the sharing of code and of the buffer cache.

For performance scalability. The show that for Pmake, a relatively parallelizable task, by running the task on multiple VMs they see a significant speed up over either running it on only one VM, or directly on IRIX. This however likely represents a best case scenario of a highly parallelizable task. Nonetheless, this test does show that Disco can effectively exploit the multiprocess nature of the system. Furthermore, they then test the performance impact of page migration, showing that Disco significantly reduces cases where cache misses would require reading memory from a remote node, and thus incur a latency penalty due to the NUMA nature of the system.

What they don't test is how the disco system scales across a variety of hardware configurations, in particular what the affect of having more or fewer physical CPUs might be.

5. Confusion

To what extent do the virtualized drivers for disco mirror real devices, and to what extent do they have a different interface.

Summary:
The paper provides an overview of the design principles and ideas that can be applied to system software to scale it up to multiprocessor systems, where a layer of abstraction is added between the commodity OS and the hardware, called the virtual monitor. This approach is demonstrated using Disco, a prototype that is used measure the performance, scalability and overheads.

Problem:
The conventional OS that are scaled up to meet the requirements of multiprocessor systems are unreliable and inefficient due to large code bases overheads introduced. Also, the problem of fault containment in case of shared memory multiprocessors adds to the problems. The NUMA architectures possesses challenges to the memory managers in the OS. Modifying the existing OS to address these issues leads to huge and buggy systems.

Contributions:
The approach reintroduced the concept of virtual machines, using virtual machine monitors, which enabled the use of existing reliable and standardized OS to build systems for shared-memory multiprocessors. The system was treated as a cluster of machines, and communication was allowed using standard protocols, such as NFS and shared memory. Having a virtual monitor also allowed different OS to run on the same multiprocessor system and offered flexibility to extend to larger machines with little changes to monitors and OS. The concepts of dynamic page migrations and replication abstracted the underlying NUMA architecture, providing an UMA interface to the OS. A notable design feature was the use of load/ store on special registers to access privileged buffers, which eliminated many overheads due to traps into the monitor. Interprocessor interrupts were used to indicate change of state operations, such as the TLB shootdown, which I believe was more efficient than having messages to convey the same. The commodity OS were given access to few protected segments of address space, but not privileged operations, which is an extension of the limited direct
execution principle. Software TLB to translate VA to machine memory addresses was a key decision, which enabled easy address translations.

Evaluation:
The authors provide various evaluations that showed the effectiveness of the approach. Firstly, a description of the workloads and justification of the choices made are given. The first evaluation shows that Disco caused an overhead of 3-16%, 16% overhead and a slowdown of 2.15 in case of Pmake. The reasons for this was also evaluated, which was the heavy use of OS services by Pmake. Next, the memory overheads were evaluated and it was seen that the effective sharing cache limited the overhead. The scalability was evaluated by comparing the execution times of multiple VM against IRIX. It was seen that the execution time improved with increased number of VM. Also, page migration and relocation helped improve execution times of raytrace and engineering workloads. Graphs are provided to indicate all the measurements.

Confusions:
The method of updating the TLBs at different levels using pmap and memmap is a little vague to me.

Summary
Rather than modifying the existing operating systems to run on scalable shared-memory microprocessors, the paper proposes to introduce a new layer of software called virtual machine monitor that virtualizes all the resources(memory and processor) of the machine so that multiple potentially different operating systems can coexist on the same multiprocessor.
Problem
The paper mainly addresses the problem faced by the hardware vendors. The market did not have system software to support the new innovative hardware architecture. The existing "commodity" operating systems were not flexible and needed significant changes to be made compatible and run efficiently on the new hardware such as ccNUMA. Thus the success of this new hardware is greatly impeded by the cost of modifying these commodity operating systems.
Contribution
Design and implementation of Disco, the virtual machine monitor designed for the scalable cache-coherent FLASH multiprocessor. Disco allows the non-NUMA aware commodity operating systems to work well even on a NUMA machine by the use of dynamic page migration and page replication. As a result the ccNUMA machines can continue to support a large of applications that were developed to run on commodity operating systems. The abstraction of virtual physical memory is provided by adding an extra level of address translation using the TLB. The virtual-to-machine translations are also stored in second level software TLB to avoid the cost of TLB miss. pmap and memmap are maintained for mapping virtual addresses, physical address and machine address. The copy-on write mechanism allows the sharing of memory resources across virtual machines. Disco supports communication between virtual machines through internal virtual subnet, and other real machines through standard network interfaces.
Evaluation
Since the FLASH machine was not available, the authors performed their experiments on SimOS machine simulator and evaluated Disco. Simple realistic short workloads were run to study the issues like CPU and memory overhead of visualization,the benefits of scalability, and NUMA memory management. Different setups and workloads were used to evaluate the execution and memory overhead of running disco. Overall execution overhead of visualization ranged from 3% to 16%. Disco clearly outperformed IRIX NUMA and UMA on Engineering and Raytrace workloads( workloads that have large memory footprints) because of its page migration and replication implementation and achieved upto 38%improvement. Overall, disco demonstrated an excellent performance on their workloads.
Confusion
Disco keeps a data structure that contains saved register and 'state' of the virtual CPU when its not scheduled on a real CPU. is the cache of the real CPU also saved when switching to different virtual machine ?

Disco: running commodity operating systems on scalable multiprocessors

1. summary
The paper discussed the idea of inserting a new software layer as a virtual machine monitor between hardwares and operating systems, to support system scalability over multiprocessors and evolving hardwares at a lower cost. A prototype called Disco is designed and implemented to conduct experiments to prove the feasibility of the idea and how engineering difficulties could be solved.

2. Problem
This paper talks about the following problems that lead to the design of virtual machine monitors:


  • With the fast development in hardware innovation, significant changes are constantly required in corresponding system softwares. However, the engineering efforts of system softwares are too much to keep pace with the delivery of innovative hardwares. As a result, system softwares are often unavailable or late for new hardwares.

  • Newly developed system softwares are likely to be unstable and buggy, while the users demand highly reliable computing systems. To solve this dilemma, the increased costs might not worth the benefits from innovative hardwares.
  • Hardware vendors need operating systems changes to be made to support new hardwares, yet the constraints and inflexibility of system softwares makes it hard and costly to adapt to hardware innovations.


  • 3. Contributions
    This paper suggested a return to the old idea from the 1970s of virtual machine monitors, to address the problem of OS scalability over multiprocessors. The major contributions are:


    • The paper designed an extra layer of software called a virtual machine monitor, to be placed between the hardware and operating systems. The virtual machine monitor virtualizes all the hardware resources of a machine, and provides a conventional hardware interface for the operating systems above. In this way, it is much easier to achieve scalability when new hardwares come up.

  • With virtualized resources such as processors and memory, the virtual machine monitor is able to schedule the resources in a flexible way according to the needs of operating systems. It helps balance the load across the machine, and makes better use of free memory.
  • Commodity operating systems are running on different virtual machines over the virtual machine monitor layer, makes it possible to share memory regions explicitly among applications with virtual machine boundaries.
  • The method provides flexibility to support resource-intensive applications in specialized operating systems, and it is easy for operating systems that do not need the full functionality of the commodity operating systems to scale to the size of the machine. A wide variety of workloads are supported flexibly and efficiently.
  • Within a virtual machine, system software failures are contained without spreading over the entire machine. Also it offers an excellent way of introducing new system softwares while still providing a stable environment for older applications.
  • 4. Evaluation
    The idea was evaluated by implementing and running a prototype Disco, to measure the virtualization overheads of virtual machine monitors.

    5. Confusion
    What is the difference between the idea of Disco and Multikernel (by Andrew Baumann et al.) ? Why aren’t them adopted in modern operating systems, while the problems discussed in those papers still exist.

    1. Summary
    The paper presents a model of a virtual machine monitor(VMM) called DISCO that was specifically designed to tackle the loss of efficiency of operating systems running on NUMA architectures. Thus, when virtualizing OSes, all abstractions such as processes, memory, I/O go into another level of abstraction and the authors present a clear idea of how this is done bug-free and efficiently using a simple and small VMM.

    2. Problem
    With the advent of ccNUMA architectures, all existing operating systems were not totally compatible to exploit the advancements in the hardware. Also, running different versions of an OS or totally different OSes to depending on the functionality required was not possible without the existence of a VMM. Thus the VMM is an important piece of software that sits on the hardware, virtualizing the OSes but it had to be small and simple. This in turn provides scalability of resources such as memory and I/O devices.

    3. Contributions
    a) Physical Memory Abstraction: Since OSes run on top of the VMM like a process on top of an OS, there has to be another level of indirection to memory as part of the abstraction. Thus physical addresses from virtual address are translated by the OS while physical addresses are in turn translated to machine addresses by the VMM. Disks are virtualised too by mounting for the OS.
    b) Implementation: To keep the DISCO code simple and cache-efficient, it is purposely made small and without linked lists to avoid random memory accesses. Locks and synchronisations are also reduced.
    c) Virtual CPUs: Instructions are executed pretty much as on a normal OS while special instructions such as traps and page fault handlers trap into the VMM. Kernel code is copied into a specific area in memory so that the OS does not have to trap to the VMM while executing in supervisor mode.
    d) NUMA Memory Management: Frequently accessed “hot” pages are copied across processors for faster accesses. When pages are replicated/migrated, the correspond TLB entries are shot down by using back pointers from a data structure called mmap which acts like an inverse directory ( machine addresses to virtual addresses ).
    e) Inter-VM communication: The OSes communicate through shared memory by using a protocol similar to NFS where the server copies the data to the client’s buffer cache.

    4. Evaluation
    The DISCO VMM is evaluated on a SimOS machine with an IRIX OS running on it, although it was intended to target the FLASH architecture. Four different types of workloads are tested with and without the VMM and the results seem like the DISCO does not cause a lot of overhead. A lot of kernel time in an OS has been converted to DISCO time in the VMM. It is also seen that DISCO works efficiently with data sharing between the virtual machines sometimes even at twice the efficiency for 8VMs. Scalability and page migration effects are also efficient with DISCO where IRIX fails relatively.

    5. Confusions
    What is QUICK-FAULT? Why does it cause so much of a slowdown on a VMM?

    Summary:

    It presents a solution to the problem of using existing operating systems to run on large-scale shared memory multiprocessors without much changes or extra effort. The authors introduce the idea of virtual machine monitors, which is a middle-layer of software between the hardware and the various operating systems. They present a prototype of the virtual machine monitor called Disco.

    Problem:

    New hardware innovations at a rapid pace put demands of changing the operating system, so that the OS can support scalability. But the system software has often lagged behind the hardware innovations because of the huge effort required to change the OS.

    Contributions:

    The main contribution of this paper is to introduce a new level of indirection between the hardware and the operating system. This new layer is called a virtual machine monitor and it enables multiple "commodity" operating systems to run on and efficiently cooperate and share resources on a single set of hardware. Hence, the hardware is abstracted away from the operating systems. The virtual machine monitor provides coordination by the following:

    - Provides fast translation of the virtual machines' physical addresses to real machine pages. NUMA memory management uses techniques such as page migration and replication to deal with the non-uniform memory access times.

    -Scheduling of virtual CPUs on real CPU.

    -Various techniques to efficiently share the resources between virtual machines.


    Evaluation:

    Various workloads from various domains have been considered for evaluation. The overhead for the virtualization was found to be ranging from 3% - 16% due to Disco trap emulation of TLB reload misses. It was found that effective sharing of the kernel text and buffer cache limits the memory overheads of running multiple VMs. Partitioning the problem into different virtual machines helped in improving the scalability of the system. They saw performance gains through dynamic page migration and replication policies which enhances the memory locality of the workloads.

    Confusion:

    The Copy-on-write section and the disk-handling are not clear.

    Summary:
    This paper talks about the design, implementation and performance of a virtual machine monitor (Disco) built for a cache-coherent shared-memory multiprocessor systems.

    Problems/Motivation:
    The authors feel that the large development costs associated with rewriting operating systems for new hardware architectures and the instability and lack of legacy application support in new system software stymie immediate large scale adoption of hardware innovations.

    Contributions/Solution:
    The authors feel that the VMM based approach for system software development, where they run commodity operating systems on top of monitors which emulate a familiar hardware environment for the OS is: better suited to ensure development of system software that adopts to hardware changes quickly, is reliable and will have better adoption ( due to legacy application support in commodity OSs). Some important points to note regarding the design and implementation of 'Disco' are:
    -It emulates the processor (through limited direct execution), memory ( through a physical(OS-level) vs machine (actual hardware/monitor level) memory distinction) and I/O devices ( by interposing itself and emulating programmed I/O, direct memory access calls and all other I/O accesses).
    -It uses data migration and replication techniques to hide the NUMA-ness of memory access in the FLASH system from the NUMA-unaware operating systems.
    -It shares machine memory across virtual machines by remapping same machine memory pages into physical pages of different machines and using copy-on-write semantics.

    Evaluation:
    The authors provide detailed analysis of the performance (on short-term workloads) of a simulation of the FLASH system (and on the real hardware) with the Disco running on top of it and compare running the IRIX OS directly vs using variable number of VMs with Disco. Even on uni-processor workloads the over-all overheads appear to be small ( 6-8%) and with multiple VMs running on multiple processors, they show: that the memory overheads are offset by the memory sharing schemes, that the execution time scales down considerably due to avoidance of kernel synchronization across VMs and that the data migration and replication schemes bring the memory access latencies closer to UMA levels.
    But they do not provide any empirical analysis of improvement in system software development time and reliability, which were stated as important motivating factors for the VMM based design.

    Confusions:
    The authors say that the vmm layer provides the ability to quickly develop system software to keep pace with hardware innovations. But if we just change the vmm software to track the hardware changes, don't we degrade the ability to do 'direct execution' of commodity OSs on top of each new generation of hardware (and hence lose performance) ? Can you talk about the limits of the actual commercial viability of the vmm approach and the bounds beyond which we have no choice but to make a full scale OS change?
    Does vmm actually guarantee any more savings in system software development effort than a clearly defined HAL abstraction in the operating system?
    How do the trap-free load and store instruction for changing privileged registers work? i.e. how can the OS change the resgisters without trapping into the monitor?

    1. Summary
    This article describes an approach to running multiple operating systems on a multiprocessor. The solution described in the paper is the use of virtual machines to efficiently share the hardware.
    2. Problem
    Large scale multiprocessors (10-100 CPUs) were becoming available without reliable operating system support unless extensive modifications were undertaken to make them efficiently support the large scale machines they were to be run on. Disco proposed to solve this problem without making extensive changes to the existing operating systems of the time.
    3. Contributions
    Disco introduces an additional layer that runs between the commodity operating system and hardware. This layer is a "virtual machine monitor" which handles scheduling of the hardware resources and provides a set of virtualized hardware that each operating system can handle.
    With minor modifications the guest operating systems, Disco allows memory to be shared between operating systems. This theoretically allows guests to make better use of resources that would otherwise be duplicated between instances.
    Disco also exports a view of memory as UMA (uniform memory access) even while running on non-UMA physical hardware. All emulated "physical" addresses are translated to machine memory locations using the TLB functionality of the MIPS processor it runs on.
    All I/O devices are also emulated by Disco, so that guest operating systems can take "exclusive" control of the virtual devices like they expect. Virtual disk accesses are mapped into memory and accesses to shared devices use the same memory, which leads to improved performance.
    4. Evaluation
    Disco was evaluated using a simulated version of the system it was developed for, but since simulator overheads were so high the processor used was significantly different than the MIPS R10000 that was originally targeted.
    Four different workloads were tested, once in the simulator and once on a physical uniprocessor. Disco was found to have an overhead of around 3%-16% when compared to the IRIX operating system being run directly. The authors present a good breakdown of what stages of the virtualization system caused the overhead.
    Memory overhead is also compared to a standard IRIX system and the source of the overhead is again broken down.
    5. Confusion
    One problem with a virtual machine monitor is said to be that it can not distinguish idle loop spinning from real computation and may schedule resources to this useless computation. At the time this paper was written, were there not machine instruction specifically for idling/sleeping the CPU that would have allowed the monitor to detect this? A reduced power mode the R10000 was mentioned, so I'd assume that an idle loop would be issuing some sort of instruction to enter it?

    Summary
    This paper presents DISCO - a system software that is suitable for scalable shared memory multiprocessor. The authors share their work and discuss how the system is scalable to other commodity OS with low developmental efforts. The hardware chosen for this paper was the ccNUMA architecture. The concept of inserting a layer – Virtual Machine Monitor is an extension of VM370 (from 1970) and is the main focus of this paper. A detailed evaluation of Disco shows that overhead of virtual machines is modest (on IRIX). Disco also deals with memory management issues in NUMA and ultimately presents a UMA like architecture to the guest OS.

    Problem
    Up until late 90s, OS for shared memory multiprocessors lagged far behind hardware and extensive custom modifications in an OS were required for scalable machines. Such modifications are of course implementation intensive and are not reliable (for ex. in case of hardware or workload changes). Backwards compatibility was an issue too. Disco tries to tackle these problems in addition to the traditional challenges that are posed by VMs in general (execution overheads, resource managements, communication and sharing).

    Contribution
    The biggest contribution Disco is the implementation of the concept of a “software layer between the OS code and the hardware”. This layer is nothing but the Virtual Machine monitor which virtualizes all systems resources, enables multiple OSes to run concurrently, protects one virtual system from other (fault containment) and allows data sharing via distributed systems protocols.

    Disco achieves scalability. Shared multiple processors appear like an interconnect of smaller processors. It is also flexible as it can commodity OS, or specialized OS such as for large DB systems or computationally expensive tasks.

    Lastly, VMM is able to hide the non-uniformity of the memory in NUMA architecture and present a uniform memory. Page migration and replication are introduced in order to reduce latency and limit cache misses to the local virtual CPU.

    Evaluation
    The authors have extensively tested and evaluated the system with varying workloads, number of virtual machines and specialized application based OSes. Quite a many parameters were checked in the entire paper and are listed below –

    1. Execution overhead – Programs were run on uniprocessor directly and once with disco. The execution time was measured across different types of workloads. An overhead ranging from 3%-16% was observed.
    2. b) Memory overhead – This test ran single workload on eight different instances of pmake with 6 different configurations. Effective sharing of kernel text and buffer cache seem to reduce to the overall memory overhead.
    3. c) Scalability – Same pmake was run in different configuration across VMs, and with single VM overhead was maximum while with 8 VMs the execution time reduced to 60%.
    4. d) NUMAness – Performance of UMA machines was the lower limit of execution time of a NUMA machine and Disco achieves significant UMAness by enhancing memory locality.

    Confusion
    SPLASHOS is termed as an overly simplistic OS but sadly I fail to understand what constitutes it. Why is it compatible with Disco that makes application on Splash run faster than other commodity OS? Why is it called thin OS and what does SPLASH refer to in modern day VM technology?

    Summary:
    This article discusses Disco, a Virtual Machine Monitor (VMM) that allows multiple commodity operating systems to run on large scale shared-memory multiprocessors without a large software implementation effort. Disco achieves this by introducing a layer of software (monitor) between hardware and OS’s. This monitor, which aims at lowering virtualization overheads, allows the commodity operating systems running on top of it to efficiently cooperate and share resources with each other.

    Problem:
    The authors of this article were trying to bridge the gap between advances in hardware and adaptation of system software. The innovative hardware at the time of this article were CC-NUMA machines, which the commodity OS’s were not fully able to exploit. This warranted extensive changes to OS’s and legacy applications to make them portable for the new hardware. This process was not only time intensive, but also prone to system instability for a certain period of time.

    Contributions:
    - In order to avoid extensive changes to the OS’s running on scalable hardware, the authors introduced a software layer called monitor between the hardware and the commodity OS’s. Monitor was kept to a few thousand lines of code and was made to exploit the distributed- system protocols. This made scalability easier as the monitor and protocols only need to scale to the size of the machine.
    - The large overheads of memory virtualization was overcome by enabling sharing of memory across VM’s. There was also provision for similar OS’s to share root file system and kernel code.
    - The page replication/migration was built in to Disco to overcome remote access of shared pages. Disco can detect hot pages and migrate them to a particular VM or replicate them across certain VM’s
    - In order to assuage the damage to performance due to TLB flushes during a machine switch, a software managed second level TLB was introduced at VM level.

    Evaluation:
    The article presents a detailed evaluation of Disco on axes such as execution overheads, memory overheads etc. The authors observe a slowdown of 3-16% for the gamut of workloads that are run. Execution overheads are mainly due to Disco trap emulation and TLB reload misses. They further observe the reduction in kernel time for some workloads and credit this to monitor handling several kernel related work. The authors perform a sensitivity analysis with respect to page size to show that execution overhead decreases with increased page size, as it increases the TLB reach. With respect to memory overhead, the authors observe that the effective sharing of kernel text and buffer cache limits the memory overheads of running multiple VMs. The authors also present satisfactory results for scalability and advantages of page migration/replication.

    Confusions:
    How would a VMM handle a multi- threaded application running atop a guest OS? Wouldn’t the OS be expected to be CC-NUMA aware for this?

    Summary:

    In this paper, the authors describe the virtual machine monitors which is a layer of software above the hardware to run multiple commodity OSes on a scalable multiprocessor. The paper discusses the implementation of Disco, a virtual machine monitor on CC NUMA machine which is evaluated against different workloads.

    Problem:

    The central problem that this paper tries to address is providing system software for scalable multiprocessors. When the existing operating systems are modified to run on scalable shared memory multiprocessors, there is an enormous overhead involved as they need to be modified extensively. Instead the authors discuss the idea of having a layer of software between the hardware and the OS to virtualize all the resources of the machine.

    Contribution:

    The idea of using a comparatively smaller layer of software called the virtual machine monitor on top of the hardware to accommodate multiple commodity OSes is the most important contribution. Fine grained resource sharing is possible by workloads as VMs managed only the global policies. By providing support for commodity software, users will be able to migrate their applications without much difficulty. Also, those commodity software need not be re-engineered to a vast extent to suit the innovative hardware. By using the VMM approach, specialized OSes for specific workloads can be supported which helped in scalability. By using dynamic page migration and page replication, the VMM provides a UMA view that makes non NUMA aware operating systems work on NUMA. Fault Containment was possible as failures in an OS was contained in the VM itself. The implementation details of Disco contained important nuances. Locality was improved by placing the code segment of Disco on all memories of FLASH system. By using direct execution on real CPU for emulating a virtual CPU, the speed of most instructions were not compromised. The use of second level software TLB mitigates the effect of TLB misses during CPU switches. The copy on write semantics help Disco to abstract the disk. Though VMs communicate via the NFS, data duplication is prevented by using a virtual subnet managed by Disco.

    Evaluation:

    The implementation of Disco was evaluated against different workloads. The execution overheads showed that Disco performed comparatively on Engineering and Raytrace workloads. The overheads were greater for Pmake and Database workloads as they make heavy use of OS services for file systems that cannot be handled directly and caused traps into the VMM. Memory sharing overheads in Disco was evaluated by comparing the virtual physical footprint and actual machine footprint which shows that the memory sharing optimizations in Disco are indeed effective. Dynamic page migration and replication proved to be effective as they effected a reduction in execution time. Though the evaluations seems convincing, they were run on short workloads and hence we cannot strongly conclude on the performance of Disco.

    Confused about:

    I could not fully understand the section which deals about the changes made to the HAL and IRIX.

    Another question I've been thinking about tonight -- are VMware ESXi and Windows Hyper-V basically modern day implementations of a Disco-inspired architecture? Or are there fundamental differences? Can we talk about this in class tomorrow?

    Summary:

    The paper implements a Virtual Machine Monitor(VMM) namely Disco, which helps to port different OSs to a scalable multiprocessor. Disco consists of VMs for each OS and helps improve performance across multiple OSs by sharing program code and similar data structures. It provides a layer of abstraction between the OSs and the hardware.

    Problems:

    As scalable multiprocessors came into existence, it became increasingly complex to support existing operation system so that they completely utilize this hardware capability. A lot of effort was required to modified the OS to run on this advanced hardware and as each OS consists of many lines of code, the chances of introducing bugs was high too. So a method was needed so that the existing OS could be migrated to the new hardware without much effort.

    Contributions:

    Disco revisited the idea of VMMs to provide an abstraction between the scalable multiprocessor hardware and the different OSs. It made possible to run different kind of OS to run parallely each running on a separate VM. Disco virtualized the memory, processor and I/O devices.

    Disco solved the overhead of trapping into VMM every time a privileged instruction was executed by using instructions such as load and store on special addresses. Also a replication of VMM code was made on each processor so that each can have the code locally. To provide optimal memory usage, Disco does dynamic page migration and replication as needed. Example, if a page is read frequently, the page is replicated and the node would have the page available locally for fast access. If a page needs to be shared, a mapping is done between the physical page and machine pages. Certain data structures like pmap and memmap are maintained for mapping virtual addresses, physical address and machine address. As for I/O devices, DMA is used to help avoid unnecessary traps. To share persistent data structure, distributed system protocols are used.

    Evaluation:

    Disco has clearly discussed the problems that would arise with using VMMs and has provided a solution for those. A few questions still exist like does the time waited for synchronization between the VMs small enough to make it compete with message sharing systems? Also, there is a lot of virtualization. Is the overall overhead subdued by the using speed of execution on multiple processors? A proper implementation and benchmarking would help to answer these questions.

    Confusions:

    How does the load and store on special addresses exactly work for privileged instructions?

    Summary :
    The paper talks about a virtual machine monitor system named Disco above which various commodity operating systems could work without having to be tailored to the underlying scalable multiprocessor non-uniform memory access architecture. This approach achieves scalability with minimal or no change to the operating systems of the virtual machines and also the flexibility to support different kinds of workloads.

    Problem :
    The problem in question is to design the virtual machine monitor such that it effectively virtualizes the CPU, the memory and the I/O devices transparently and employs mechanisms to reduce the overhead of the additional level of indirection that is introduced due to the VMM.

    Contributions :
    1. Uses appropriate data structures to store and retrieve the state of a virtual CPU in the process of scheduling between virtual CPU’s. Provision of a supervisor mode for the virtual machine operating systems so that they could access certain segments of the address space.
    2. The VMM translates the physical address for a virtual address generated by the OS into the corresponding machine address and inserts it into the TLB. Since the TLB is flushed each time the virtual CPU switch happens, it maintains a second level TLB caching recent virtual to machine address translations.
    3. In order to support the NUMA architecture, it replicates the read-shared pages across multiple nodes if all the nodes frequently access it. Another mechanism is that of page migration which migrates the page (changes the TLB mapping) only to the node that accesses it. Uses the memmap data structure to perform the TLB shootdown on replication or migration.
    4. Enables multiple virtual machines that share a disk to share the machine memory avoiding unnecessary accesses to the disk each time. Uses the copy on write mechanism for writes that are not shared across VMs.
    5. Virtual machines can communicate using distributed protocols like NFS. Enables sharing of files between virtual machines.

    Evaluation :
    The SimOS has been used to simulate the hardware and the comparison of performance of IRIX and IRIX with Disco has been made. The system has been evaluated using 4 different varying workloads. The performance has been analysed extensively in terms of the time spent in each of the services. The results quantify how the performance of workloads has improved by replication and data migration, memory sharing etc. As disco is NUMA-aware, the increase in VM’s has led to improvement in scalability.

    Confusion :
    The changes to the Hardware Abstraction level in IRIX were not quite clear. The paper mentions about wait free synchronization using MIPS instruction pair for data structures that are shared by multiple processors. How does that work?

    Summary: This paper introduce the idea of virtual machines to run multiple commodity OS on a scalable multiprocessors. They insert an additional layer between the hardware and the OS, which acts like a virtual machine monitor in that multiple copies of commodity OS can run on a single computer. They implement the idea on the FLASH shared memory multiprocessors. Evaluation shows customized OS for scalable multiprocessors with smaller implementation effort.

    Problem: People take a lot of implementing effort to change the OS to adapt the innovation of scalable shared-memory multiprocessors, which significantly delays the software update. Therefore, system developers must find new ways to make the software more quickly and with fewer risks of incompatibility and instability.

    Contribution:

    1. They use another way: virtual machine monitors, to make multiple commodity OS running on a single scalable computer. This monitor allows OSs to efficiently cooperate and share resources with each other. The monitor decreases the implementation effort and the commodity OS increase the compatability and stablity. The structure is (different OS on virtual machines)->(virtual machine monitor)->hardware

    2. Based on the idea, they implement prototype OS called Disco. Disco reduces monitor overheads and enhances resource sharing between virtual machines. It connect different virtual machines by standard distributed system protocols.

    3 Disco makes NUMA machines more compatible by dynamic page migration and replication. Disco moves pages to nodes where they are being used, so the multiple OSs seems running on a single shared memory.

    Evaluation: Disco is evaluated based on real workload. The basic overhead of the virtual monitor is smaller than 16% for all uniprocessor workloads. Disco with 8 virtual machines can run 40% faster on some workload than on commercial symmetric multiprocessor OS.

    Confusion: In the evaluation part of 8 virtual machines, the workload seems too small (less than 10s) from real workload. And I am not sure what will happen for 16, 32, 64 virtual machines given the 40% speed up in 8 virtual machines.

    Summary:
    The paper describes the implementation and benefits of Disco, a virtual machine monitor designed as an abstraction between the operating system and hardware, which allows for OSes to run efficiently on shared memory multiprocessors.

    Problem:
    When newer hardware is designed, significant changes must be made to the operating system which creates a lag between system software and hardware. Apart from developmental costs, software bugs are more likely to be introduced in the system with these changes since many of the core OS modules such as the scheduler may have to be redesigned. The authors make the case in point by targeting the ccNuma architecture which was not being adopted by commodity OSs.

    Contributions:
    1. The VMM was an important abstraction because it virtualized all the machine resources to the OS. Since it managed all the resources it allowed the flexibility of running multiple OSs on the same multiprocessor. Being relatively lightweight in code, the risk of software bugs was significantly reduced.

    2. Disco made NUMA machines more compatible for commodity OSs by using dynamic page migration and replication. Keeping track of “hot” pages, DISCO moves pages to nodes where they are being used, which makes the machine look closer to uniform memory access.

    3. DISCO introduces another level of address translation by keeping its own physical-to-machine map. This allows for a software based TLB and other mechanisms to manage the hardware TLB’s in place.

    Evaluation:
    Disco was tested on SimOS, a machine simulator, for a range of different memory and IO intensive workloads. Overall the virtualization overheads measure ranged from 3% for scientific computing workloads to 16% for more OS and IO intensive workloads from software development and database applications. This overhead was mostly attributed to the common path to enter and leave the kernel for page faults, system calls and interrupts.
    Running 8VMs simultaneously shows sharing across memory machines led to a significant reduction of memory overheads. The workload scalability of Disco is also shown as execution time for similar workloads was 60% on 8VMs than what they were on a non VM system.

    Confusions:
    The authors say nominal modifications are needed to make OSs run on Disco. That sort of defeats the purpose of a VM which should be able to run the OSs as they are.

    Summary:
    The authors believe that efficiency and scalability in multiprocessor environment can be achieved by implementing virtual machine monitors over which multiple OS can be executed. Rather than making modifications to the OS code to make it efficient on scalable multiprocessors, a new layer was introduced between the hardware and the operating system. They demonstrated their approach by developing a prototype monitor called DISCO and evaluated its performance.


    Problem:
    To optimize the commodity operating systems to run on scalable shared memory multiprocessor hardware, it would require significant changes in the system software, as well as partitioning the system into scalable units. Even though if changes are made to the system software it will require porting sooner or later once better hardware arrives.

    Contribution:
    With smaller implementation effort, the benefits of operating systems customized for scalable multiprocessor, can be achieved with the remodeled virtual machine monitor - Disco. Its new deign solved many problems faced by older virtual machine monitors, for example, it incorporated efficient mechanisms for resource sharing between various virtual machines and communication via TCP/IP and NFS. Disco exposes a simplified hardware interface to the OS running on the virtual machines by virtualizing all resources of the machine and scheduling them on the actual hardware. The monitors also have global policies to distribute free memory and dynamic scheduling of processors for efficient load sharing. Commodity operating systems can share memory regions across virtual machine boundaries after implementing the virtual memory segment driver in the OS. The authors’ implementation of Disco as a multithreaded shared memory program, could provide the emulation of MIPS processor, provided dynamic page migration and replication to provide uniform memory access time architecture, and interception and remapping of the all IO commands. The virtual machines could share files via NFS using Disco’s virtual subnet which prevents duplication of data. Disco also uses the copy-on-write mechanism to avoid copying though more memory sharing (remapping).


    Evaluation:
    Disco was executed over SimOS - a machine simulator for MIPS hardware which could run the monitor and IRIX operating system. Using a simulator resulted in a lot of overheads hence the authors were resorted to using only smaller workloads — software development (parallel compilation of a chess application), hardware development(Verilog simulator), scientific computing (rendering of models and sorting integers) and commercial database(running decision support queries). The authors do find that due to trap emulation and TLB reload misses led to an execution overhead of 3-16%.

    Confusions:
    I was not able to clearly understand the working of TLB miss handler emulated by the Disco.

    Summary

    Recent developments in large-scale multiprocessors offer major performance advantages, but existing operating systems require significant development to use them to their full potential. This paper proposes Disco, an abstraction layer that sits on top of the hardware, allowing multiple existing operating systems to run in virtual machines with little modification.

    Problem

    Updating existing operating systems to accommodate new hardware requires a massive development effort. Scalable shared memory multiprocessors in particular require major fundamental changes. With hardware generally evolving much faster than software, it's unlikely operating systems will be able to keep up with these demands. One possible solution is to use virtual machines to abstract the hardware layer, but these have typically been plagued by performance overheads, resource management and communication problems.

    Contributions

    The major contribution of this paper is is the approach of adding a virtualization layer between the hardware and any number of commodity operating systems. These operating systems all run in virtual machines which use a variety of abstractions to provide efficient and fast access to hardware resources. These include virtual physical memory (physical to machine address mappings), NUMA memory management via transparent page replication, intercepting all device accesses to virtualize I/O and transparently sharing pages between virtual machines. Disco allows many operating systems to efficiently sharing complex system resources, and due to only being 13,000 lines of code, can easily be adapted to meet future demands.

    Evaluation

    Disco was designed for a machine that was not yet available, so instead the authors use the SimOS machine simulations configured to resemble a large scale multiprocessor. The execution overheads turned out to be very low, anywhere between 3% and 16% depending on the workloads. TLB misses and I/O operations accounted for a large part of the overhead. Disco performs particularly well on workloads where it can enhance memory locality, achieving large gains on the Engineering workload (33%) and Raytrace (38%). Overall, these evaluations show it to be of comparable performance and better scalability than other systems -- but we have to consider these tests were run on a simulator and real hardware could vary.

    Confusions

    The low level technical implementations were difficult to understand. I was particularly confused about specifics on how Disco virtualizes physical memory and the problems with TLB misses.

    1. summary
    The paper introduces Disco, a prototype system which can run multiple operating systems on a scalable multiprocessor.

    2. Problem
    Scalable computers are configured with tens of even hundreds of processors. Extensive modifications to the operating system are required to efficiently support these machines and the size/complexity of modern operating systems made the modifications resource-intensive. The Disco method inserts an additional layer of software between the hardware and operating system to avoid the extensive modification.

    3. Contributions
    (1) The Disco system, based on virtual machine monitors, provides a flexible system software solution for multiprocessor machines. Each virtual machines is allocated resources such as processor and memory and uses standard distributed protocols to communicate. The VM monitors uses global policies to manage the resources to achieve load balancing. The implementation effort is very small. When application's resource needs exceed the scalability, explicitly shared memory regions across VM boundaries are allowed, or specialized os is supported. The approach also handles scalability and fault-containment of machines and solves the NUMA memory management issues. It can also allow the co-residence of multiple version of a system.
    (2) The Disco has a level of address translation and maintains physical-to-machine address mappings to virtualize physical memory. TLB is used to do the translation and second-level is used to reduce TLB-miss. It uses dynamic page migration and replication to export a nearly uniform memory access time mem architecture. It exports special abstractions for the SCSI disk and network devices. The copy-on-write mechanism make multiple virtual machines accessing a shared disk end up sharing machine memory. The networking interfaces use copy-on-write mapping to allow for memory sharing.

    4. Evaluation
    The paper gives simulation-based experimental results and studies the 4 typical workloads. The performance overhead of virtualization ranges from 3% to 16% based on the property of workload such as TLB miss rate. The memory overhead is largely limited by the effective sharing. Data for scalability is also given. The system achieves significant performance improvements by using page migration and replication. It also hides part of the NUMA mem latencies from the kernel.

    5. Confusion
    My questions are mainly about the evaluation part. The author says "using realistic but short workloads, we ...". From the table, most of the execution time is less than 10s. Wouldn't that be too small for an system experiment if you consider start overhead, variance, etc? And I am not sure whether using 8 virtual machines to reduce the execution time by 40% is a good sign of scalability.

    1. Summary
    The paper proposes a virtual machine monitor (VMM) based design that extends and enables multiple commodity operating systems to run efficiently on a large-scale shared memory multiprocessor system.
    2. Problem
    Even though building scalable operating system can be efficient compared to virtual machines, it entails other issues like huge development cost, time to market, code stability, application compatibility and portability. VMMs reduce the code changes to an operating system by separating its interaction with hardware through an additional software layer. But, traditional virtual machines add additional overhead for processor and memory management. Disco attempts to minimize these overheads and enables cooperation and sharing among virtual machines.
    3. Contributions
    Disco uses multiple techniques to minimize the overheads of VMMs. The Disco code segment is replicated across all memories and data structures are partitioned to reduce remote accesses. It uses shared memory except for a few operations that require IPI.
    The virtual CPU is directly executed on the hardware. Few privileged operations are trapped and emulated by the VMM. The physical-to-machine translations are stored in the pmap structure of a VM and are cached in TLB. A second level TLB is used to reduce TLB misses. Disco uses a cache-miss count based policy of page migration and replication, to reduce remote accesses in a NUMA system. A per page memmap data structure is used for accounting the usage of the page.
    I/O devices are handled by special Disco device drivers. Disco uses copy-on-write disks. DMA requests are intercepted and processed by changing the mapping for a read-only page, thus enabling multiple VMs to share disk space. Multiple VMs communicate using standard distributed protocols and use copy-on-write mechanisms to enable buffer sharing.
    4. Evaluation
    The evaluation is done on a SimOS R10000 machine simulator which limited execution to only short benchmarks. Several parameters are evaluated and compared, including execution and memory overheads, scalability and page migration policies. Some interesting observations are: high overheads due to single entry point trap and emulation, reduction in memory overheads because of memory sharing, the reduction in synchronization overheads with problem distribution among multiple VMs. An overall 6-8% slowdown is observed. But considering that the applications are small and many other optimizations can still be done, I agree with the authors that the overheads are tolerable.
    5. Confusion
    I did not understand the disco device driver structure and the role of monitor call. Is this driver still emulating multiple accesses to a device or does it provide exclusive access to a VM?

    1. Summary
    This paper introduces how virtual machine monitor is used to overcome the gap between creative hardware and simple modification for software to accommodate with such hardware. It further illustrates the mechanism, implementation and evaluation of the Disco system, which is used to run commodity operating systems on scalable multiprocessors without too much overhead.

    2. Problems
    There are two major problems in running commodity operating systems. The first one is that to it is getting more and more difficult to extend modern operating systems to run efficiently on large-scale shared-memory multiprocessor as the innovative development of hardware. Significant OS changes are required to achieve efficient designs. Virtual machine monitors seem are the solution to this issue, however, the second problem comes as the virtual machine adds an extra layer between OS and hardware, the impair of performance is inevitable. Thus, how to reduce the overhead of virtual machine is a critical problem in providing system software for scalable multiprocessors. Specific aspects include additional exception processing, instruction execution and memory for virtualization; resource management and communication and sharing.

    3. Contributions
    Generally, the contribution of Disco is to achieve goals that limit overhead of virtualization in a smaller range with simple modification to modern commercial operating systems. It uses global policies to manage all the resources of the machine; dynamically schedules virtual processors on the physical processors to balance workload across the machine. Since this monitor can be done with a relatively simple piece code, such flexible approach allows relatively simple modification of systems and simpler support for specialized operating systems.
    - Disco extends the architecture to support efficient access to some processor functions by using direct execution and provides an abstraction of main memory that adds a level of address translation and maintains physical-to-machine address mappings.
    - Implements dynamic page migration and page replication to reduce the overhead of virtual CPU’s cache miss.
    - Virtualize I/O devices for each virtual machine and control the access properly.
    - Copy-On-Write Disks mechanism which supports request of reading a disk block that is already in main memory without going to disk.
    - A virtual subnet used to allow virtual machines to communicate with each other.
    - Modifications for commodity OS in MIPS architecture, device drivers, HAL(hardware abstraction level) and other relative changes to IRIS.

    4. Evaluation
    This article provides evaluation sessions to illustrate the result of their work. Four typical uses of workload are chosen, maybe not sufficient but representative. Each workload runs both on IRIX directly and IRIX on Disco. The performance of each kernel service is compared separately in order to find the source of overhead. Slowdown is categorized to five parts.

    Evaluation includes execution, memory, scalability and dynamic page migration and replication. The results shows that The overhead of virtualization is modest; reduction in virtual memory faults and the operation of the operating system reduces system overhead substantially; effective sharing of kernel text and buffer cache limits the memory overheads; partitioning the system into different virtual machines significantly improves the scalability and page migration and replication enhance the memory locality therefore provides benefits.

    In my opinion. These evaluation though not very sufficient as the variety of workload is not very wide, it is still persuasive. It is better all the workloads can run on FLASH directly instead of running on a similar configured SimOS.

    5. Confusion
    Why passing hint to monitor is so useful in making better resource management decisions and what is a “hint”.

    What is a “remote cache miss”?

    Summary
    This paper discusses design and implementation of prototype Disco, a Virtual Machine Monitor (VMM) that adds a layer of abstraction between the operating system and hardware to achieve most of the benefits of operating systems for scalable shared memory multiprocessors without a lot of implementation effort.

    Problem
    The authors try to address the problem seen by computer vendors attempting to provide system software for their innovative hardware. With every new hardware in market, significant operating system changes are required to extend support and deal with issues such as non-uniform memory access time. These changes can incur heavy development cost and bring a decent gap between software and hardware releases. Moreover, frequent, late incompatible changes to system software are subject of bugs and scalability.

    Contributions
    Rather than attempting to modify existing OS, the authors solution to above mentioned problem is to add a level of abstraction between commodity operating systems and raw hardware, which they call the VMM and running multiple OS on different virtual machines resulting in greater scalability and fault containment. The monitors allow these OS to efficiently cooperate and share resources with each other by use of copy on write disks and global buffer cache. Dynamic Page migration and replication to exploit nearly uniform memory architecture to the software, which was important to support OS not developed for NUMA machines can be considered as another great contribution. By exporting multiple virtual machines, prototype Disco can have multiple different OS simultaneously running on it providing a stable platform for legacy application. Second level TLB makes virtual machines appear to have much larger TLB.

    Evaluation
    The paper describes some performance testing of the Disco system by running set of workloads and comparing results with commodity operating system. The execution overheads ranged from 3% to 16% and the overhead was mainly due to Disco trap emulation of TLB reload misses. The authors also present benefits of resource sharing by running workload onto multiple virtual machines. Results also show that performance increase with the page migration and replication policy.

    Confusion
    What is virtual footprint ? and how do physical and machine page differ.

    1. Summary
    This paper describes Disco, a virtual machine monitor system that allows multiple operating systems simultaneously on a shared memory multiprocessor. The authors achieve this by adding an extra layer between the operating system and hardware that provides basic minimum abstractions required for multiplexing operating systems.

    2. Problem
    The problem that the authors consider is reducing the costs of keeping systems software updated according to the rapid changes in the hardware. Changing the software frequently is costly and it leaves the software susceptible to bugs.

    3. Contributions
    The paper proposes using virtual machine monitors to solve the problem. They add a software layer between the hardware and the operating system. The monitor being a relatively small piece of code is easily developed and it provides a convential hardware viw to the running operating systems. Thus very few changes are required in the operating system itself. The virtual machine monitor described in this paper provides abstractions for the processor, physical memory, I/O devices, disks and network interfaces. It maintains address translations from physical memory of a guest operating system to actual machine memory. All this is done while maintaining low memory and time overheads due to the monitor itself. The Disco virtual machine monitor uses additional data structures like pmap and memmap to achieve the desired speedup.

    4. Evaluation
    The authors have evaluated the system extensively on SimOS which is a machine simulator. They evaluate Disco using distinct targeted workloads to test the system on execution and memory overhead and scalability. The authors demonstrate impressive performance of Disco in theor experimental workloads. They also suggest that performance on actual hardware will also be close to their simulation results. However, the authors do not mention any experiments to measure performance of I/O devices, disks or networking.

    5. Confusion
    The way disks are handled and the differences between persistent and non-persistent disks are not very clear.

    Summary:
    The paper introduces DISCO, a mutli-threaded shared memory program which introduces a software layer between the hardware and operating system; thus virtualizing resources so that multiple virtual machines (running different OSes) could run on the same multiprocessor. The idea is to promote scalability and flexibility in the mutliprocessor systems. Evaluation has shown that DISCO could run 1.7 times faster with 8 VMs but overhead increases by 3-16% depending on the workload.

    Problem:
    1. It was hard to extend system software to scale well with the new hardware developments (significant OS code changes, partitioning the system into scalable units, single system image across the units, fault containment, cache coherence etc.)
    2. Support multiple OS on the same hardware (and thus achieve scalability).
    3. Run/Support commodity OSes on ccNUMA processors.

    Contributions:
    1. Addition of software layer between OS and hardware which manages all the resources so that multiple virtual machines can coexist on the same multiprocessor.
    1. Scalabiliy: explicit memory sharing between processes running on different VMs through virtual segment driver.
    2. Flexibility: Support multiple operating systems.
    3. Fault containment: Any failures/bugs in the system software (in one VM) donot affect the entire machine.
    4. Page placement, dynamic page migration & replication provide a more conventional view of the memory thus hiding the NUMA-ness of the system and hence extending support for non cc-NUMA OSes (commodity OSes like Windows NT).
    5. Special network interface to handle large file sizes without fragmentation.

    Evaluation:
    An extensive evaluation has been carried out. Four different types of workloads (pmake, engineering, scientic computing and database) have been considering covering different domains such as multiprogrammed, short lived processes(OS- and I/O-intensive), long running, OS-nonintensive, single shared memory parallel applications. Execution time overhead range between 3%-16% in Disco due to Disco trap emulation and TLB reload misses. Effective sharing of kernel text and buffer cache lower the memor overhead. Partitioning has been shown to reduce overhead and help in scaling the system. For evaluating dynamic page migration and replication, engineeing workloads were considered which have poor memory system behavior. Significat performance gains attribute to memory locality of the workloads. A comparison of UMA IRIX with DISCO shows the degree to which DISCO has been to hide its non-UMA architecture.

    Confusion:
    Copy-on-write section is not completely clear.

    Summary :
    This paper introduces the idea of Virtual Machine Monitors to run multiple commodity operating systems as virtual machines on a scalable multiprocessor. This approach helps in dealing with the nonuniform memory access time of the systems and reduces the memory overhead associated with running multiple operating systems.

    Problem :
    With continuously changing hardware, operating systems have to continuously adapt to match the performance that the hardware provides. This means changing the OS every time and this could be hard in terms of cost and complexity. Disco solves this problem by adding a new layer between the hardware and the operating system, which manages resources for multiple OSes to coexist.

    Contributions :
    1. Inserting the Virtual Machine Monitor between the Operating System and the hardware which virtualizes all the resources of the machine, exposing a new conventional hardware interface to the operating system.
    2. Enabling the VMM to execute multiple commodity operating systems on a CC-NUMA multiprocessor
    3. The technique of replication and partitioning of system-wide memory data structures for locality in individual nodes and employing wait-free synchronization widely.
    4. Implementing a three layer memory addressing hierarchy where the individual operating systems do the virtual to physical address translation and the VMM does another physical to machine address translation. For maintaining the TLB entries, VMM has a per virtual machine data structure, “pmap”. A software-level TLB is also implemented for performance reasons.
    5. The disks are implemented as copy on write. When multiple virtual machines want to access a single page,it is mapped read-only for them to access. Any write to this page will result in a copy-on-write fault which the VMM handles internally.
    6. Updating the HAL to pass on more hints to the VMM about the resource utilization of each of the virtual machines which results in the monitor making better resource management decisions.

    Evaluations :
    The paper presents the overheads of running on Disco. This was evaluated by running IRIX once on hardware directly and once by running it over the monitor. The overhead of virtualization seems to vary based on workloads, being 3% for Raytrace and 16% for Pmake and Database workloads. The main overhead was because of the TLB misses resulting in trapping into the monitor a lot. Disco’s memory sharing techniques perform better than existing systems in that they provide around 38% performance boost by implementing dynamic page migration and replication.

    What I found confusing :
    Copy on write disks is not very clear to me.

    Summary:
    This paper discusses the use of virtual machine monitors in abstracting hardware from operating systems by managing memory, processors and fault containment for scalable memory-shared multiprocessors system. It also discusses the disadvantages of this abstraction like overheads in extra processing for managing resources, and ways to reduce those overheads by using interoperating capabilities of modern operating systems in distributed environment. Finally, the evaluation of this idea is shown by running a prototype system called Disco.

    Problem:
    The problem is to provide system software for scalable multiprocessors, such that the gap between hardware innovations and adaptation of system software does not hinder use of existing operating systems on scalable shared-memory multiprocessors. This has to be done without making a massive development effort for building a system from scratch.

    Solution:
    The idea is to abstract hardware from operating system by using virtual machine monitors(with some modification). Virtual machine handles following:
    -Moving memory between virtual machines to avoid paging
    -dynamic scheduling of virtual processors on physical processors
    -allow processes running on different virtual machines to share memory in case the processes exceeds the resource usage limits of the commodity OS.
    -Hardware and system fault containment possible.


    Evaluation:
    The paper shows a good comparative analysis for hardware abstraction using the prototype Disco implementation. The overhead in terms of execution time for applications turns out max to be 16%. One notable thing in the graph was the cut-down on time spent in kernel of OS, when Disco is running.

    Learning/Confusions:
    The rate at which the hardware is evolving, how far into the future this model keeps on scaling is important to consider. Specially for network-intensive features running in the OS.

    Summary:
    This paper discusses the use of virtual machine monitors in abstracting hardware from operating systems by managing memory, processors and fault containment for scalable memory-shared multiprocessors system. It also discusses the disadvantages of this abstraction like overheads in extra processing for managing resources, and ways to reduce those overheads by using interoperating capabilities of modern operating systems in distributed environment. Finally, the evaluation of this idea is shown by running a prototype system called Disco.

    Problem:
    The problem is to provide system software for scalable multiprocessors, such that the gap between hardware innovations and adaptation of system software does not hinder use of existing operating systems on scalable shared-memory multiprocessors. This has to be done without making a massive development effort for building a system from scratch.

    Solution:
    The idea is to abstract hardware from operating system by using virtual machine monitors(with some modification). Virtual machine handles following:
    -Moving memory between virtual machines to avoid paging
    -dynamic scheduling of virtual processors on physical processors
    -allow processes running on different virtual machines to share memory in case the processes exceeds the resource usage limits of the commodity OS.
    -Hardware and system fault containment possible.


    Evaluation:
    The paper shows a good comparative analysis for hardware abstraction using the prototype Disco implementation. The overhead in terms of execution time for applications turns out max to be 16%. One notable thing in the graph was the cut-down on time spent in kernel of OS, when Disco is running.

    Learning/Confusions:
    The rate at which the hardware is evolving, how far into the future this model keeps on scaling is important to consider. Specially for network-intensive features running in the OS.

    1. Summary
    This paper presents a novel approach to developing system software for highly scalable shared-memory processors - using a Virtual M/c monitor to present conventional m/c interface and run multiple commodity/specialized OSs. The approach is demonstrated using Disco which is empirically evaluated using simulations and early hardware. The results suggest that Disco offers comparable performance to native OS with low virtualization overheads and greater potential for scalability.

    2. Problem
    The introduction of scalable shared-memory multiprocessors with new features such as NUMA memory represents a major challenge to system software engineering. Conventional OSs require vast modifications to efficiently exploit these scalable machines. This even acts as an inertia against innovation. This paper suggests adding a level of indirection between native HW and the OS, using the idea of the virtual machine monitor. It claims that most of the machine specifics can be taken care of by the VMM and present a conventional interface to commodity OS.

    3. Contributions
    The idea of a multiple VMs having a partial-single system image through distributed system communication is a very innovative idea. Disco provides a largely native HW interface to guest OS, except for certain privileged operations such as enabling interrupts, accessing privileged registers etc. Guest OS run in a less privileged mode than Disco and privileged actions will trap into Disco, which will then emulate them. For virtualizing memory, Disco adds an extra level of translation between the guest physical memory and machine memory. This increases the TLB miss handling time, which is partially alleviated through a large 2nd level SW TLB. Disco uses page migration and replication to hide the NUMA nature of underlying HW and ensure that most cache misses are serviced from local memory. Disco intercepts access to all I/O devices and forwards them to physical devices. This gives the opportunity to share disk and memory resources among virtual machines. Virtual machines can communicate with each other through standard protocols such as NFS. Thanks to these features, Disco can run commodity operating systems with minimal changes to their source code.

    4. Evaluation
    The overheads of virtualization is empirically evaluated through detailed simulations of four workloads. The results suggest that virtualization adds between 3-16% of overhead for uniprocessor workloads. This overhead reduces with larger page sizes, which increase TLB range. Disco’s memory sharing techniques significantly reduce the memory footprint for running multiple VMs. Page migration and replication techniques seem to effectively hide the NUMA architecture and provide 33-38% performance over commodity OS running on native HW.

    5. Confusions
    How is an OS UMA- or NUMA-aware? The concept of copy-on-write disks with temporary and persistent disks is not clear. During the evaluation, they first boot IRIX and then ‘jump’ into Disco. How does this work?

    1. summary
    This 1997 Stanford paper presents Disco: a virtual machine monitor prototype. This approach uses monitors to create virtual machines on which existing Operating Systems can run with only minor overhead.
    2. Problem
    Hardware - specifically the architecture of scalable shared-memory multiprocessors - is evolving quickly. New CC-NUMA architectures are increasingly commonplace, yet difficult to optimize for. Complicated, monolithic operating systems require modifications for each new type of architecture. In order to guarantee reliability, these changes mean software typically lags behind hardware by a significant time period; hardware evolution is hindered as a result. One possible solution presented in the paper is that of the virtual machine monitor. However, these virtual machines have typically had several issues- notably high overhead along with difficulties in resource management and communication.
    3. Contributions
    The main contribution of the paper is the concept of adding a lightweight, “virtual machine monitor” (or VMM) layer between hardware and existing, commodity operating systems. This monitor essentially virtualizes the hardware, and allows for the multiplexing of virtual machines (and virtual CPUs) across multiple processors. The VMM maintains any necessary data structures to make this virtualization possible, such as TLB contents and privileged registers for each virtual machine. Interrupts are handled globally by the VMM rather than a specific operating system kernel. Virtual processors are scheduled on physical processors in a round-robin fashion, and these virtual machines run through limited direct execution. This allows each operating system to run its existing code (and as a result all legacy operations) with only minor modifications.

    A second major contribution was the specific enhancements for cc-NUMA architectures. First, Disco itself was designed to operate efficiently on such machines. The code is replicated in all processor memories, few locks are used, and processor specific structures are stored on the respective process. Additionally, enhancements are made to allow UMA operating systems to operate efficiently on a non uniform access machine. This is done through page migration or duplication. Hardware cache misses are recorded on a per processor basis to detect “hot pages”. This hot pages can be migrated or replicated to the processor accessing this memory, and the TLB entry is simply updated.
    4. Evaluation
    The authors provide a variety of tests to show that their virtual machine approach provides similar results to existing, native operating system execution. They show that on uniprocessor workloads, Disco virtualization overhead ranges from 3-16%. This is due primarily to the overhead of trapping into the monitor. TLB misses are particularly costly. When compared to a non-NUMA aware kernel (in this case IRIX5.3) Disco performs favorably. It is found that partitioning a problem into only two virtual machines causes a net positive effect when compared to overhead. Finally, the memory management features perform well; applications which have data locality issues execute around 35% faster than IRIX. In all cases, the authors have shown that the Disco system performs comparably to existing solutions, and can scale efficiently.
    5. Confusion
    The copy-on-write disks are somewhat confusing in this paper. I understand the concept, but am confused by the sentence “Disco logs the modified sectors so that the copy-on-write disk is never actually modified.” What does this mean?
    Why does a message sent from one VM need to be mapped into its own address space? Can’t it forget about that message?

    1. Summary
    In the paper "Disco: running commodity operating systems on scalable multiprocessors", the authors revisit the idea of using virtual machines as a solution to the problem of developing system software for scalable shared memory multi-processors. They propose solutions to the challenges faced by earlier versions of virtual machines ( memory overhead, execution overhead, resource management, IO communication and sharing) and demonstrate it with a prototype, Disco running multiple copies of non-NUMA aware IRIX operating system on a standard FLASH CC-NUMA multiprocessor.

    2. Problem
    To adapt to the hardware innovation, the million line code base of system software needs to be modified which involves a huge development cost. Also, these changes might impact standard modules such as virtual memory management, scheduling etc. At the same time, the system software should provide backward compatibility to run legacy applications in its older versions until the application vendor provides an upgrade. The authors provide a solution for the problem of providing flexibility to support wide variety of operating systems and support commodity operating systems which were not developed for the ccNUMA architecture using Virtual Machine Monitors.

    3. Contributions
    - Abstracting NUMA memory as UMA memory, allowing non NUMA aware OS to run on NUMA machines
    - Replicating the 13000 lines of code of Disco across all memories of FLASH to serve cache misses locally thereby improving NUMA locality
    - Uses dynamic page migration and replication to effectively manage the non-uniform memory of flash machine and achieve memory overhead near to that uniform memory access time architecture
    - Fault containment i.e failure of one VM does not affect another
    - Emulating execution on virtual CPU by Disco using direct execution on real CPU; reducing overhead of TLB misses by maintaining privileged registers, TLB contents and other process table entries of the virtual CPU
    - Since during every context switch of virtual CPU TLB entries get flushed out, TLB misses become more expensive. To reduce the overhead, Disco caches recent VA_PA mapping on a second software level TLB
    - Provides copy-on-write disks shared by multiple virtual machines (but disk writes issued by the individual OS are private). I guess this implies the state of a VM is lost when it is shut down. However, only a single VM can have a persistent disk mounted at a time
    - Sharing memory between the VM's using NFS protocol (but Disco ensures both client and server don't have the date duplicated in the main memory)
    - Handles execution overhead by converting frequently used privileged instructions as load and store instruction to special page in the HAL of IRIX

    4. Evaluation
    - The evaluation was not run on the targeted FLASH processor(as it was not available) and instead SimOS was configured to resemble FLASH characteristics. Also, the experiments were carried out only on short running workloads and hence it is not conclusive proof about Disco's performance.
    - Running multiple copies of single IRIX OS and benchmarking Disco performance cannot be used to generalize the performance of Disco (even for running multiple non NUMA aware Operating systems).
    - Dynamic Page Migration & Replication policy of the monitor how close Disco achieves UMA performance.
    - Disco flushes TLB entries when scheduling other virtual machine on a context switch which can hamper performance since the application has to start with TLB misses when it is scheduled.

    5. Confusions
    - Passing hints to the monitor to help it make better resource management decisions is not very clear to me
    - Under scalability, I don't understand how increasing the number of virtual machines from 2 - 8 reduces the kernel execution time by 60% (since the time spent stalling is due to contention for semaphores)

    Summary - This paper proposes a solution to run commodity operating systems efficiently on modern hardware like scalable multiprocessors with minimal OS changes, through virtual machine monitors. The authors describe their prototype VMM, Disco, which is a software layer between the OS and the hardware allowing multiple OSes to run simultaneously on scalable hardware, while offering minimal virtualization overheads.

    Problem - The authors claim that modern operating systems are complex and inflexible. As a result, significant development costs are incurred to extend them to support scalable machines or tackle issues such as non-uniform memory access times which are associated with innovative hardware. This cost of modifying commodity OSes for running on newer hardware is thus hurting the adoption of such machines.

    Contributions - The main contribution of this paper seems to be rediscovering the benefits of an old concept - virtual machine monitors, for solving a modern problem. Virtual machine monitors improves the performance of system software on modern hardware by adding a simple, easily extensible software layer between the hardware and operating system. Thus, applications running on non-NUMA aware OSes could benefit from a NUMA-aware monitor in NUMA systems. The paper discusses techniques like page migration and replication as means of providing NUMA-awareness in the hypervisor. Also, various techniques are proposed to minimize virtualization overheads, like resource sharing between various virtual machines sharing the hardware.

    Evaluation - The authors demonstrate the performance of Disco, by running workloads from varied domains such as scientific computing and databases on simulating Flash-like hardware on SimOS, a machine simulator. Virtualization overheads are evaluated by running these workloads on uniprocessors with and without the VMM layer, and result in about 3-16% increase in the execution times. These are mostly a result of slow trap handling and other system services. The importance of resource sharing to limit memory overheads is also shown, by running multiple instances of a workload on different VMs, resulting in about 50% memory reduction for the case of similar workloads running on 8 VMs simultaneously. The improved scalability offered by virtual machines and the benefits of page migration and replication are also evaluated suitably. Overall, the evaluation aspect of this paper is highly satisfactory, barring the inability to show results for long-running workloads and the absence of I/O performance metrics.

    Confusions - Why were VMMs popular in the 1970s? Also, I couldn't understand this statement in section 5.2 - 'IRIX 5.3 uses the same TLB-wired entry for
    different purposes in user mode and in the kernel'.

    Summary:
    This paper describes the implementation and design of type-1 VMM : Disco which helps to run multiple modern operating system efficiently on large-scaled shared-memory multiprocessors. Disco is targeted for ccNUMA architecture and tries to reduce cost and time to build new operating systems for these new hardware advances.

    Problem:
    1. Commodity OS present at that time were not well suited for ccNUMA architecture. Development cost of new operating system was high, it was time consuming and prone to errors because of large code. Author tries to solve the problem of transparently running any operating system with no or little modifications which was designed for single processor execution on multi-processor hardware.
    2. VM present at that time had problem of high overhead and resource sharing between virtual machines was not efficient.

    Contribution:
    1. Helps operating system which were originally not designed for multi-processor to run on multi-processor with help of new layer between operating system and hardware. It reduces development cost of operating system and makes it more reliable. Monitor code is relatively small and developers can add new code in VMM without introducing software bugs.
    2. Disco separated hardware and software layer which helps in portability of VM's. Multiple virtual machines are run at the same time and thus it improves scalability. Failure of one VM doesn't affect other VM and thus it helps in fault containment.
    3. Another important contribution was introduction of second layer software TLB. Since TLB needs to be flushed during context switch, addition of second level TLB effectively increased size of TLB and thus improved performance.
    4. Introduces dynamic page migration and page replication policy with help of cache miss counters maintained by FLASH hardware. Because of accessing data from local memory performance was increased.
    5. Disco's copy-on-write disks allows virtual memory to share both main memory and disk storage resources efficiently. Concept of copy-on-write was implemented only for non-persistent disks to simplify implementation.

    Evaluation:
    Uses SimOS as flash hardware was not available during testing. Exhaustive testing was done with different kind of workload's. The execution overheads was mainly because of trap emulation of TLB reload misses and was between 3% to 16%. Author claim large page size will reduce this overhead. With 8 VM running execution time was about 60% of time as required to execute on non VM system.

    Confusion:
    Performance results may be skewed because of use of simulator. Benchmarking done on real hardware matched to simulation performance?

    Summary
    This article describes the motivation and implementation of Disco, which is a virtual machine monitor that allows multiple operating systems to be run efficiently on one system. Disco provides another level of software in between the hardware and the operating systems that virtualizes all of the resources (such as I/O, network interface, and memory) of the system and provides a uniform interface to the operating system. Because of this, multiple operating systems that are customized to specific workloads can be run efficiently on the same machine.
    Problem
    Disco solves the problem that with each iteration of new hardware change, the system software to exploit the new hardware had to be changed dramatically, which was time consuming and challenging. Disco was an attempt to minimize the gap between hardware innovations and the accommodating software by introducing a layer of indirection over all of the system’s resources, but also doing this in efficient manner and not add too much overhead with the introduction of the virtual machine monitor.
    Contributions
    The main contributions of Disco was the efficient implementation of a virtual machine monitor that allowed multiple operating systems to be run on the same machine. For example, when addressing memory Disco provided another layer of indirection that mapped physical addresses to machine address. This mapping was performed by the hypervisor, but to make it more efficient, instead of populating the TLB with physical addresses, it would provide the corresponding machine address. In the case of a TLB miss, it forwards it to a second level software TLB which is simply a caching of recent virtual to machine translations. Another contribution of Disco was the dynamic page migration and replication system to the address non-uniform memory access of the system. In order to provide the illusion of uniform memory access, Disco either replicates or moves pages to provide locality to a virtual CPU and its memory pages.
    Evaluation
    Overall, the introduction of the virtual machine monitor incurred a manageable amount of overhead, only contributing 3% to 16%. Disco was evaluated from 4 different workloads including Software Development, Hardware Development, Scientific Computing, and Commercial Database. They found that the I/O bound Software Development workload was particularly stressful on the system, which produced the 16% overhead. The paper was also able to demonstrate how their page replication and page migration implementation was able to increase performance. For the engineering workload, they saw a 33% decrease from running on IRIX to running on Disco. The Raytrace workload saw a similar percentage decrease in 38% from IRIX to Disco.
    Confusions
    I’m a bit confused on about the Copy-On-Write disks of Disco. I understand the usefulness of not copying the data until you have to write to it, but what was the use of the B-Tree index for this implementation? Are they simply used to know which pages below to the virtual machine’s virtual disk?

    Summary: This paper presents a novel way to support new multi-processor hardware using existing operating systems. The trick is to use virtualization: add a new layer of software called virtual machine monitors between OS and hardware, and make upper-level operating systems running as virtual machines.

    Problem: When new hardware is released, substantial changes have to be made to existing operating systems in order to support the new hardware. As modern operating systems become more and more complicated, the software change requires a lot of human efforts and will cost a lot of money. Sometimes the lack of software support has been one of the obstacles preventing hardware innovations.

    Contributions:
    1. Virtualization: an innovative approach to support large scale multi-processor machines. A software layer called virtual machine monitors is added between hardware and existing operating systems. Different operating systems can run as a guest machine on top of that. The virtual machine monitors will manage and schedule hardware for guest OSes, and as a result every guest OS will think it has the entire control of the hardware, which is actually virtual hardware managed by virtual machine monitors. Disco is implemented in a way that the overhead of virtualization is controlled to be very small.

    2. The way to manages virtual CPUs. There is a data structure that stores the state of every virtualized CPU when it is not scheduled on a real CPU. In addition, when the system is running Disco code, it is in kernel mode. When running guest OS code, it is in supervisor mode. When running user programs, it is in user mode.

    3. Virtual physical memory management. The physical memory appears to guest OSes is actually virtual physical memory managed by Disco. There is an added layer of address translation where virtual physical address is translated into real physical address. When the guest OS tries to modify the TLB, Disco will emulate this operation by replacing the virtual physical address with the real machine address.

    4. Emulating IO devices by intercepting every access to an IO device and forwarding it to a real device. In case of DMA accesses, Disco has to translate the physical address into machine address and then the access is done by Disco directly.

    Evaluation: The authors evaluated Disco by running a set of workloads on Disco and compare their performance with the performance running on commodity OSes. They demonstrated that the overhead for virtualization is acceptable and by doing virtualization they can make certain workloads scalable. However the evaluation is too simplistic: benchmarks are done on simulators not on real hardware, and the simulator does not even simulate a real MIPS10000 processor.

    Confusion:
    I am confused when Disco will be useful. Disco is designed to support multi-processor large scale systems, but applications eventually have to run on those commodity operating systems that do not have good support for multiprocessor machines. It seems to me that what Disco has done is turnning one powerful machine into individual normal machines. Under what circumstances will Disco be useful?

    Post a comment