CS 736 Reviews - Spring 2016: Remus: High Availability via Asynchronous Virtual Machine Replication

Summary : It is expensive to design highly available system to survive hardware failure as it involves using redundant special-purpose hardware and re-engineering software to include complicated recovery logic. The authors aim to provide high availability system that is general(regardless of applications and hardware), transparent(doesn’t require modification of OS and applications) and seamless hardware failure recovery(no external visibility os state lost and failure recovery should be fast). They hence propose Remus - a virtual machine(VM) based whole system replication where the whole VM state is frequently checkpointed and the protected and backup VMs are located on different physical hosts. Remus employs speculative execution where the state is buffered to synchronous backup later and execution is continued ahead of synchronous point. In addition to the above asynchronous replication is used where output at primary server is buffered allowing replication asynchronously and primary VM execution overlaps state transmission. The failure model of Remus is tolerable, protected system's data is left in crash-cosistent state and data is not visible until associated system is committed. They implemented it based on Xen’s live migration semantics to provide fine-grained checkpoints(25 ms) across 2 hosts connected via redundant gigabit Ethernet connections. At every failure there was an observed drop in network throughput, outbound packet latency and performance whereas backup of disk image continued to be consistent.
Confusion :1) Which live migration approach pre-copy or post-copy is being used? Which approach would fair better for contributions such as pipelined checkpointing?

Posted by: Shruthi Racha | April 21, 2016 09:11 AM

Summary:
This paper talks about Remus, a high availability service that allows unmodifed exisiting software to be protected from physical machine failures. Remus runs the application in a VM on a physical machine paired with another as backup host, it asynchronously propagates changed state to backup using checkpointing , uses speculative execution to concurrently run active VM wiht backup. At failure backup starts and resumes with the last checkpoint state. Remus guarentees to not lose any externally visible state regardless of the moment at which primary fails.

Confusion:
How the execution starts back again from failure ? Does latency is taken into account during recovery ?

Posted by: Ankur Srivastava | April 21, 2016 09:00 AM

1. Summary
Remus was designed to provide high availability(HA) service that require no special purpose hardware or any modifications to existing OS. This uses speculative execution and VM-based asynchronous system replication with generality, transparency, seamless failure recovery being its primary goals. Dirtied pages are replicated in four steps with this epoch-based pipelined checkpoint system for live migration. Heartbeat/timeout is used to detect failures. Since this was primarily designed for physical disaster recovery, there's a 25ms checkpoint with a 1s network delay in the evaluation carried out on top of Xen X11. Crash consistency is maintained in the failure of both the machines. With its low cost and transparency, paired configuration provides HA dynamically with some performance overhead for low latency applications.

2. Question
Specific on Xen architecture: domain 0.
What are lightweight copy-on-write snapshots used by Parallax?

Posted by: Sejal Chauhan | April 21, 2016 08:58 AM

Summary This paper presents Remus, a general and transparent high availability service offered by a virtualization platform that protects existing unmodified software from failure when run on commodity hardware. The whole state of the primary VM is asynchronously and frequently checkpointed to a backup VM on a different physical host(they maintain the same system state), using speculative execution the primary continues execution beyond the synchronous state.The implementation is based on Xen’s support for live migration,using an epoch based approach and optimized stop and copy phase of live migration the high frequency checkpoint operation is pipelined. Outbound network packets are queued until checkpoint is acknowledged (thus external state is synchronous) and writes to active VM disks is configured to be write through to backup memory buffer.The system functionality is validated as backup takes over for primary upon failure,and the performance is evaluated using different workloads.

Confusion How does the size of the memory buffer on the backup VM impact the HA service provided ?

Posted by: shreya kamath | April 21, 2016 08:55 AM

Remus is designed to allow applications to survive hardware failures. It is a transparent (because the application need not be changed at all) and high availability (complete replication) service that allows existing unmodified software to be protected from failure of physical machine on which it runs. It asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, uses speculative execution to run the active VM ahead of the replicated system state. The network buffering required to ensure consistent replication imposes a performance overhead on applications that require very low latency.
The system has been evaluated with variety of workloads and key results show that workloads which are RAM-hungry and also very sensitive to network latency make a poor fit for the implementation as the implementation trades network delay for memory throughput. The effect of disk buffering mechanisms have also been studied and shown that replication does not have significant impact on disk performance

Confusion:
Can we please discuss the flat curve observed with increase in checkpoints for network buffers?

Posted by: Vishakha Dhelia | April 21, 2016 08:54 AM

Remus is designed to allow applications to survive hardware failures. It is a transparent (because the application need not be changed at all) and high availability (complete replication) service that allows existing unmodified software to be protected from failure of physical machine on which it runs. It asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, uses speculative execution to run the active VM ahead of the replicated system state. The network buffering required to ensure consistent replication imposes a performance overhead on applications that require very low latency.
The system has been evaluated with variety of workloads and key results show that workloads which are RAM-hungry and also very sensitive to network latency make a poor fit for the implementation as the implementation trades network delay for memory throughput. The effect of disk buffering mechanisms have also been studied and shown that replication does not have significant impact on disk performance

Confusion:
Can we please discuss the flat curve observed with increase in checkpoints for network buffers?

Posted by: Vishakha Dhelia | April 21, 2016 08:54 AM

Summary:
This paper presents a design and implementation of Remus, a high availability solution that protects unmodified applications running on virtual machines from host machine failures by transparently letting them continue execution on a different physical machine exploiting the live migration capability of VMs. The system uses a primary backup mechanism where changes are asynchronously replicated onto backup and achieves high performance through speculative execution where primary is ahead of the replicated state but this state is not made externally visible by buffering external output and not releasing them till checkpoint is complete.

Confusion:
What happens when the primary recover from failure? Does the execution switch to primary or does it become a backup and receive checkpoints?

Posted by: Aishwarya Ganesan | April 21, 2016 08:51 AM

1. Summary
Remus is a generic and transparent high availability service in VMM layer that allows existing unmodified software to be protected from failure of physical machine. Remus provides a high degree of fault tolerance by allowing the running system to resume execution on a different host by encapsulating protected software in a virtual machine and asynchronously copying changed state to the backup host/VM. Remus discretizes the execution of a Virtual Machine into a series of replicated snapshots. The system is designed on the basis of the providing the goals of generality, transparency and failure recovery by pipelining checkpoints, copying memory & CPU state, network buffering between checkpoints, disk buffering and failure detection. Evaluation results show that Remus provides high availability at reasonable overheads for most workloads.

5. Confusion
What is speculative execution?
Could you explain network buffering?

Posted by: Anshul Purohit | April 21, 2016 08:50 AM

Summary
The paper discusses the design, implementation and evaluation of Remus, a software system that provides High Availability (HA) as a service to client applications by using asynchronous Virtual Machine(VM)-based whole state replication. Remus provides high fault tolerance in an OS and application agnostic way on general purpose commodity systems by frequently checkpointing the entire state of the running application's VM between the primary and backup (standby) hosts and by conducting a seamless and fast failure recovery. Remus internally uses techniques of speculative execution (primary host continues execution assuming that previous checkpoint will complete replication on backup), asynchronous replication (buffering application output at primary so that replication can be performed asynchronously), pipelined checkpointing, timeout based failure detection, and intelligent identification and handling of modified memory pages and disk blocks to reduce the overheads associated with ensuring transparent high availability. The evaluation of the Remus implementation indicated that it was able to provide high availability only at the cost of significant degradation in performance, which was especially high for network latency sensitive applications. These observations prompted the authors to suggest multiple improvements to their design.

Questions / Confusion
1. The part on modifications made in the suspend request handler for paravirtual guests in Xen was not clear.
2. What is the exact role of activation records in recovering from multi-host crashes ?

Posted by: Shantanu Bhate | April 21, 2016 08:49 AM

Summary
This paper talks about the Remus system, that provides transparent high availability using commodity hardware and operating system. A key goal of the system is to quickly/seamlessly recover from a fault-stop failure. Remus uses frequent checkpoints that replicate the state of an active VM. In order to retain acceptable performance Remus use speculative execution and buffers the network output of the VM until the previous checkpoint commits.

Confusion
When they talk about crash consistency in this paper are they referring to file system or application crash consistency? further, the use of the disk activation record was not clear.

Posted by: Brian Coutinho | April 21, 2016 08:49 AM

1. Summary
This paper describes a system which leverages Xen to maintain two machines, an active VM and a backup machine which replicates the memory state of the active VM as frequently as 40 times a second. Network requests are held until the backup VM is updated.
4. Confusion
Has this work been extended to maintain, e.g., a backup cluster (so that network requests within the cluster have respond immediately)?

Posted by: Stephen N. Lee | April 21, 2016 08:40 AM

Summary Remus allows unmodified applications and operating systems to transparently enjoy the benefits of high availability via replication and checkpointing of virtual machine state. The system consists of a primary machine, and one or more replica machines. Rather than attempting to synchronize multiple (possibly non-deterministic executions), the primary machine's state is copied to the replicas periodically, and output is buffered and only released when the primary's state is checkpointed.

Confusion The performance doesn't seem great - how does Remus compare with other high availability solutions? For what use cases is Remus acceptable?

Posted by: Michael Vaughn | April 21, 2016 08:40 AM

Summary
With Remus, the authors provide high availability system in commodity machines as opposed to employing redundant components or expensive special-purpose hardware, transparently wo/ any modifications in OS and app, and finally, achieve fast and seamless failure recovery. It does so transparently in a virtualized environment by continually live migrating a copy of a running VM to a backup server, which automatically activates if the primary fails. Key features are asynchronously transferring entire system states(up-to-date and exact copy) through output buffering, implements checkpointing as repeated executions of the final round of live migration, and speculative execution by the primary that allows to continue execution ahead of synchronous point. It is implemented on Xen VMM, at very failure point, there is low network throughput and performance, but no inconsistency in backup disk image.
Confusion
How do they arrive at the parameter values in timeouts to detect failure, buffering etc. Some of the Xen terminologies were not explained in the paper. Disk buffering mechanism was not clear.

Posted by: Tithy Sahu | April 21, 2016 08:39 AM

1. Summary
The paper talks about Remus, a transparent and general system that provides fault tolerance for softwares running on commodity hardware. Remus ensures high availability through speculated execution on a VM and asynchronously replicating its state on a backup VM. In the event of failure, the backup VM is kicked in within short interval of time and system resumes executions from a recently known state.

2. Confusions
How is frequency of checkpointing related to the fault tolerance provided by Remus?
You mentioned in previous class that around only 1% hit on performance is accepted for providing fault tolerant solutions. Did ideas like Remus get implemented in real time systems? How is fault tolerance provided in current current servers?

Posted by: Bharadwaj Krishnamurthy | April 21, 2016 08:33 AM

1. Summary
This paper talks about the design and implementation of Remus, a highly available system which does not require specialized hardware or changes to application. Being OS and application agnostic allows Remus to provide high availability to broad class of applications including legacy ones. Remus ensures that the fail-stop failure of any single host is tolerable and in the event of both the primary and the backup failing, the protected systems data will be left in a crash-consistent state. Remus does not make the output externally visible until the associated system has been committed to the replica. In face of a failure, Remus switches to the replica without loss of any active state. Remus ensures high availability using periodic checkpointing to back up the state of the primary host. This checkpointing is done asynchronously. Delayed commit and speculative execution help ensure that replication happens with reasonable degradation in common case scenario.

2. Questions
1. What are queueing disciplines in linux?
2. How is high availability provided in current systems?

Posted by: Urmish Thakker | April 21, 2016 08:31 AM

Summary
The paper describes Remus, a software system that provides OS- and application-agnostic high availability on commodity hardware. It achieves this by frequently checkpointing whole Virtual Machine state and keeping the Protected VM and Backup VM in different physical host. The state is buffered to synchronous backup later and execution is continue ahead of synchronous point. Buffering output at the primary server allows replication to be performed asynchronously. Primary VM execution is to overlap state transmission. Remus implementation is based on Xen’s support for live migration to provide fine-grained checkpoints. Checkpointing runs in high frequency. The virtual machine does not actually execute on the backup host until a failure occurs.

Confusion
Terminologies of Xen

Posted by: Nivetha Singara Vadivelu | April 21, 2016 08:15 AM

Summary:
The paper discusses about Remus, which helps provide high availability and fail stop fault tolerance for applications running on commodity hardware, through VM replication in the event of hardware failures. Does not require changes to OS or hardware. Application state is simultaneously maintained in a VM both on a primary machine and a backup machine. Uses speculative execution, asynchronous checkpointing to efficiently maintain state in the backup for faster recovery. On primary’s failure, backup starts running from the latest checkpointed state. Performs output buffering and releases output only after checkpointing state in the backup host (consistency). Additional optimizations to improve performance like batching network output. Evaluation: Network interface bottleneck, not suitable for applications for which n/w latency is critical under SLA.

Confusion:
Currently used techniques for high availability of VMs in data centers? Does Remus also have a user space level component to perform interact with the backup host or is it all in the VMM?

Posted by: Siddharth Suresh | April 21, 2016 08:13 AM

Summary
This paper discusses constructing high availability service using Virtual Machine Monitor, presenting a system extension to Xen VMM called Remus that provides high quality HA support for common server hardware and software. HA is achieved by keeping a backup physical host mirroring the state of the serving host with high frequency check-pointing, on the order of tens of milliseconds. This is achieved by highly pipelining the memory copy operations, borrowing and optimizing the live migration mechanism in the Xen system. Remus also guarantees to hold states of the VM for any output network packet before it is sent out, so that the crash of the VM will be brought back up invisible to any external observer.
Confusion
Can we have some background knowledge introduced on the different functionalities provided by Xen, as is mentioned in the paper?

Posted by: Fujie Zhan | April 21, 2016 06:28 AM

Summary
The paper presents Remus, a novel system for retrofitting high availability onto existing software running on commodity hardware. It describes an approach to bring high availability as a platform service for virtual machines. Remus provides fault tolerance against fail-stop failure of any single host and ensures crash-consistency in the event both primary and backup hosts fail concurrently. More over outputs are guaranteed to be externally visible only after the associate system state is committed to the replica. Thus a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of down time. Host state like active network connections are not lost. It encapsulates protected software in a virtual machine to facilitate whole system replication and performs asynchronous replication. Remus runs paired servers in an active-passive configuration but instead of running them in lock-step, it uses speculative execution to concurrently run the active VM slightly ahead of the replicated system state. Evaluations in the paper show the Remus provides high degree of fault tolerance but is sensitive to network latency.
Confusions
What is meant by the statement that block devices are incorporated in the state replication protocol?

Posted by: Amrita Roy Chowdhury | April 21, 2016 04:16 AM

1. Summary
The paper introduces Remus, a solution to provide High Availability as a service without specialized hardware or application modification. It accomplishes this by replicating virtual machines instead of applications. The paper develops a checkpointing protocol to allow the backup VM to stay in lock step with the primary allowing seamless recovery, transparent to all network clients. This goal is accomplished by checkpointing both memory and disk contents in contrast with most solutions that only replicate storage leading to a crash consistent image. This solution does impose performance penalties especially on network heavy applications and would require many more cycles of optimizations till it offers a reasonable reliance/performance tradeoff.
2. Confusion
What are linux queuing disciplines? Does there exist an alternative to memory state replication as that is what takes a large chunk of the checkpointing time?

Posted by: Abhinav Mehra | April 21, 2016 03:36 AM

Summary
The paper describes a subsystem - Remus that aims to provide high availability for applications even on a hardware failure without any changes to it or the underlying commodity operating system. It does this by running applications in a VM on a physical machine which is paired with another machine as it's backup. Remus uses techniques like speculative execution, asynchronous checkpointing to optimally keep backups of the system state. On a physical failure of the host, the backup starts up and resumes execution with the latest checkpoint. On evaluating such a system, it turns out that the system works good only for applications that are not sensitive to network latency

Confusion
How are clients moved to the backup machine ?
Is such network latency tolerated in real world ?

Posted by: Akshay Kanfade | April 21, 2016 03:31 AM

Summary
This paper is about the design and implementation of Remus, a high availability system that allows execution of applications transparently on an alternate machine during hardware failures on virtualized infrastructure. This is achieved by asynchronously propagating changes to the system state such as memory, CPU, disk to a backup host, buffering of outputs and releasing them only after receiving checkpoint acknowledgments from backup, combined with speculative execution on the primary host. Remus implementation is based on the Xen VMM with changes introduced for features such as checkpoint support, asynchronous transmission.

Confusion
When does Live Migration happen in the case of Remus? Is it only during failure of primary host?

Posted by: Sharanya Devaraj | April 21, 2016 03:16 AM

Service Provided :
High availability / fail-stop fault tolerance by enabling a running system to transparently continue on an alternate physical host.

Goals :
Generality, Transparency, Seamless failure recovery.

Benefits :
Software system.
No externally visible state is lost
Transparent to applications.
Commodity hardware
OS and application agnostic.

How :
Whole system replication - Backup VM on another physical host.
Speculative execution - synchronization points, buffer output.
Asynchronous state replication (4 phases) and checkpointing.
Many optimizations to boost performance.

Limitations :
Introduces significant network delay.

Questions :
Which parts of Remus live where? If in the VMM, how does it coordinate across physical hosts?
Cost - Benefit analysis for Remus, real world examples that would use it?

Posted by: Adithya Bhat | April 21, 2016 02:44 AM

1. Summary
This paper presents the design and implementation of Remus, a software system the provides OS and application agnostic high availability on commodity hardware. Unlike existing solutions that required special-purpose hardware and complicated modifications in applications for recovery, Remus aims to provide high availability and rapid recovery in face of hardware failure in a generic fashion ( not specific to an application or a hardware) without any modifications to OS or application. It uses the asynchronous replication based periodic check-pointing to back up the state of a primary VM running on one physical host onto a secondary VM running on a different physical host. By using a combination delayed commit and speculative execution, it makes sure no state is lost on hardware failure, with a reasonable degradation of common-case performance.
2. Confusion
* The paper talk about crash consistency at a number of places? What exactly is the meaning of crash consistency in this context?
* Why are the writes to disk from active VM treated as right through?
* Could you talk about the queuing disciplines in Linux? How does Remus use them?
* Can you give examples of some applications/areas where high availability is required and such solutions would be valued?

Posted by: Lokesh Jindal | April 21, 2016 02:36 AM

1. Summary
The authors have come up with a design of providing seamless, high availability (HA) for running unmodified applications on VMs, which come at a cost of higher latency. HA was achieved by asynchronously transferring the changes to the backup host (upto intervals of 25ms), maintaining checkpoints and write-through to the physical machine. They provide it as a service on the virtualization layer which can be simply enabled. The main limitation was the network interface, which was evident in the evaluation.

2. Confusion
I would like to discuss more about terminologies in Xen - what is xend, xenstore, domain 0?
Also, what are the current migration optimizations which are used today?

Posted by: Vikas Goel | April 21, 2016 01:28 AM

Summary
Remus is a fault-tolerance system that essentially creates snapshots of software to protect them from failure. One advantage is that it does not require modification of the original software, and on the backup, may begin executing (almost) immediately once failure is detected. It uses a four-step process of checkpointing, replication, acknowledgment, and release with aggressive pipelining in each epoch in order to create backup copies. In exchange for this increased robustness, though, Remus introduces significant network latency.

Confusion
What minimum requirements does Remus place on a network in terms of bandwidth/loss order to maintain performance?

Posted by: En-Ui Lin | April 21, 2016 01:26 AM

Summary
The paper explains Remus, a fault tolerant system which provides high availability by using virtual machine to run unmodified applications and replicating the state of machine asynchronously by using checkpoints.Remus also allows the host to executive speculatively to avoid any delays. The networks packets are buffered until the backup acknowledges the checkpoint whereas the disk is write through and requests are sent simultaneously to buffer present at the backup. Overall the system performs well in case of failure but is not suited for network sensitive systems but the paper does discuss how it could be overcome by using compression or XOR for checkpoints.
Confusion
Could you please explain how Remus can be scaled for multi-core systems?
How does AWS or other similar services provide high availability with less cost? Do they maintain a N-1 backup mechanism or paired systems like Remus.

Posted by: Mushahid Alam | April 21, 2016 01:26 AM

Summary
This paper talk about a new design called Remus whose main responsibility is to provide a high degree of fault tolerance while being both system and OS agnostic (no changes to hardware or software). It takes ideas from live migration of VM, asynchronous replication and speculative execution. The paper starts by explaining in brief the goals(generality, transparency, seamless failure recovery) and approaches of the system. It then explains the design and implementation details of the system explaining how checkpointing and migration of memory, cpu, disk (with buffering) work and how output to client gets buffered (network buffering till checkpoint-acknowledge is received). Finally they talk about the significant amount of performance overhead caused by using this system and also some modifications made to improve the performance of the system.

Confusion
How often does checkpointing/migration happen and is there any correlation between the rate of checkpointing and the amount of memory modified? Does the conversion of inbound to outbound traffic through intermediate queuing device (because linux queuing discipline only operates on outgoing traffic) still happen?

Posted by: Anubhavnidhi "Archie" Abhashkumar | April 21, 2016 01:00 AM

Summary:
The paper describes the design and implementation of Remus, a system that ensures availability of existing applications running on commodity hardware, in the event of hardware failure. Remus employs a service at the Virtual Machine Monitor that does regular, asynchronous checkpointing of externally visible state (includes memory and persistent data) of a given VM to a backup host (different physical machine) to ensure seamless continuation of execution at the backup host. Remus primarily follows suspend-and-copy semantics and ensures that outbound network packets are buffered until an acknowledgement of a successful checkpoint is received from the backup host to maintain an externally consistent state.

Confusion:
Are applications that sensitive to hardware failures? Why isn’t the plain vanilla style of recovering from crash consistent persistent state good enough?

Posted by: Prashanth Balasubramanian | April 21, 2016 12:46 AM

Summary
Remus aims to provide high degree tolerance of hardware failure without requiring specialized hardware or sophisticated software redesign. It chooses virtual machine as the basic unit and tries to achieve fast and frequent replication. The primary host is allowed to continue execution while checkpointing but the output is only visible to client when an acknowledge from the backup host is received.

Confusion
What does it mean by saying “the suspend program was converted from a one-shot procedure into a daemon process”? (p165 top-right corner)

Posted by: Xiangjin Wu | April 21, 2016 12:46 AM

1. Summary
This paper introduces Remus with goal to provide high availability system by using virtualization to encapsulate protected VM and frequently checkpointing whole system to asynchronously replicate the state of single speculatively executing VM. It provides properties of generality, transparency and seamless hardware failure recovery thus handling fail-stop failures leaving system in crash consistent state. Such system could be more useful for providing distributed services to customers.
2. Confusion
Could you please explain Xen specific terms like domain 0? How are VMM layer and VM management failures handled in event of hardware failures? How does Remus transfer checkpoints in multiprocessing environment?

Posted by: Unmesh Phalak | April 21, 2016 12:25 AM

Summary
This paper describes Remus, a software system that provides OS/application agnostic High Availability (HA) on commodity hardware. The system employs virtualisation and leverages the advances made for Xen's live-migration feature to provide frequent asynchronous whole-system replication with speculative execution to build a solution that is transparent, generic and yet bestows seamless failure recovery.

Confusion
Disk mirroring using blktap and tapdisk is not clearly explained
Per-domain watchdog counters are also not clearly explained.
The latest design for XenServer HA is available at http://xapi-project.github.io/features/HA/HA.html

Posted by: Vinothkumar Siddharth | April 21, 2016 12:13 AM

Summary

The paper discuss Remus - a software system that provides high availability on commodity hardware and the software(OS + application) doesn't need to be modified. Remus uses virtualization to encapsulate a protected VM and doesn't allow two hosts to run in lock-step. Instead it let the primary host execute speculatively and then perform frequent whole-system checkpoint to replicate the state of the VM.

Confusion

1. I am confused why they have separate way of handling memory and disk. Any change to on-disk data will have to be done via memory. So clever checkpointing can/should ensure that consistency is guaranteed. Right?

2. Wouldn't ops see high latency in a cloud environment when the external output will not be released until the checkpoint is done. Frequent checkpointing will hurt a lot.

Posted by: Yuvraj | April 21, 2016 12:01 AM

Summary
This paper describes a software subsystem known as Remus to provide High Availability. Normally, high availability is achieved either by modifying the application software to include complex recovery logic or by using specialised hardware. However, Remus strives to provide OS and application agnostic high availability on commodity hardware. The authors describe the various features of Remus : active-passive configuration, speculative execution, asynchronous high frequency checkpointing - memory and CPU checkpointing,disk buffering and network buffering. Next, the authors go on to evaluate Remus using 3 different workloads
and come to the conclusion that Remus is not the best choice for network latency critical applications.

Confusion
Does Xen still use the same approach described in the paper to carry out network buffering? Aren't the memory and CPU checkpoints flushed to disk?

Posted by: Arjun Singhvi | April 20, 2016 11:45 PM

1. Summary
The paper proposes a new system design for providing high availability in face of common machine failures. In this design, a VM is used to encapsulate the state of a service, and this VM is replicated on another machine periodically, even though the entire process is transparent to the VM and no stopping the VM occurs while the replication is performed. Through the use of speculative execution during replication as well as withholding outbound network packets until after a replication is successful, the system is able to guarantee sufficient performance as well as strong consistency and transparency to the users.
2. Confusion
I’m confused about how they find modified memory pages? They say they “quickly” filter out clean pages. What is their mechanism?

Posted by: Arman Shanjani | April 20, 2016 11:24 PM

Summary: This paper aims to increase the availability of systems, without making any changes to the application or hardware, by using a VMM, and replicating the guest VM state periodically using checkpoints. Realizing that this would seriously degrade the performance, the authors used optimizations like batching network output, pipelining checkpoints, and speculative execution.
Confusion: What is live migration? It is mentioned that the proposed system does not let any state to be visible until checkpoint has been ack'ed by the backup. Does this increase the response time of interactive applications? In case, the primary fails, how is user input directed to the backup? Is the VMM of primary contacted in this case to forward this request to backup?

Posted by: Mohit | April 20, 2016 11:12 PM

1. Summary
The paper describes a system which ensures high availability on commodity hardware transparently. This system is called Remus and is designed to so that their are minimum changes to OS to ensure quick checkpointing and no changes to software running on the OS. The system which was evaluated showed that it is not suitable for applications sensitive to network latency even though it is very efficient at state replication.

2. Confusion
I did not understand how memory and CPU checkpointing can work in tandem with disk buffering and ensure the entire system is consistent.

Posted by: Mihir Shete | April 20, 2016 11:06 PM

Summary
This paper presents the design and implementation of Remus, a software system that provides high availability on commodity hardware without requiring any modifications to the underlying operating system or the applications. The novelty of the paper lies in the use of whole system checkpoints to asynchronously replicate the state of a virtual machine between a primary and backup host, while achieving higher performance guarantees through speculative execution on the primary host. The authors claim that Remus is a significant point in design space of modern servers as it enables high availability for VMs to be dynamically provided as a service at a “click of a button”.

Confusion
Is Remus still a good idea in current cloud computing settings where thin VMs have network attached cloud storage? Or in other words, what kind of applications are the benefits of speculative execution restricted to?

Posted by: Saket Saurabh | April 20, 2016 11:04 PM

summary
The Remus, a software running on top of virtual machine, provides high availability with copying any changed state into backup machines. It replicates snapshots of an entire an active OS at each epoch. All the state are maintained by two machines to switch the active host with backup host when there is a fail on active host. The disk and network output is controlled with buffering to synchronize the state between active and backup host. In order to improve the performance, it uses asynchronous checkpoint method which copies the state while the active host is running speculatively rather than synchronous method, which active host needs to be wait until the checkpointing is done. In addition to this, Remus uses pipelined checkpoints: 1) pause the running VM and copy any changed state into buffer, 2) buffered state is transmitted and stored in memory on backup, 3) acknowledge to the active host when receiving the state is done, 4) buffered network output is released.

Confusion
Active host is running on top of virtual machine. How can active host has speculative execution during checkpointing?

Posted by: Choungki Song | April 20, 2016 10:57 PM

CS 736 Reviews - Spring 2016

Remus: High Availability via Asynchronous Virtual Machine Replication

Comments

Post a comment