« Scale and Performance in a Distributed File System | Main | Kerberos: An Authentication Service for Open Network Systems »

Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures

Rx: Treating Bugs As Allergies— A Safe Method to Survive Software Failures. Feng Qin, Joseph Tucek, Jagadeesan Sundaresan and Yuanyuan Zhou. SOSP 2005

Reviews due Tuesday 4/11

Comments

Summary:
This paper presents RX which is a checkpointing based and a client oblivious way of system recovery which primarily relies on changing the environment functions to survive software failures during re-execution. RX can recover from deterministic as well as non deterministic bugs.

Problem:
Software defects account for 40% of system defects and hence is a major culprit that affects system availability. Pre-existing techniques for surviving software failures like whole program restart, micro reboot of partial system, software rejuvenation, progressive retry and other application specific recover mechanisms were limited in that they were not well suited for software failures, required restructuring legacy software and could not survive deterministic bugs and were also unsafe for correctness critical systems. Another limitation in these techniques was the it gave insufficient feedback to the developers for debugging. RX is designed to overcome these issues.

Contributions:
This main idea of RX was to rollback a program to recent checkpoint when a bug is detected. Secondly Rx explores rule based and correctness preserving execution environment modifications such as Memory management based(eg. delaying the recycling of free buffers), timing based(eg. change the order of events during re-execution) and user request based(eg. drop malicious requests).

The Rx recovery system consists of following components
1. Sensor 2. CR component 3. Environmental Wrapper 4. Control Unit 5.Proxy

1. Sensor is responsible for for dynamically monitoring application's execution failure and notify the control unit with information about identifying the bug.

2. The checkpoint and Rollback component's task is to save checkpoints and transparently rollback the application to previous instances

3. The environment wrappers implemented at user level, performs environment changes during re-execution to avert failures. Some of the wrappers include delaying memory free, padding buffers, allocation isolation, changing processes priority. Message wrapper is implemented in the proxy., Other techniques include delaying and dropping users requests and dropping user requests.

4. Control Unit co-ordinates all components. It controls the CR module to checkpoint server periodically and also rollback the server during failures. It maintains a failure table to capture the recovery experience for future reference and also helps in diagnosing an occurring failure based on failure symptoms. It also serves the programmers useful crash related information for helping them in bug analysis

5. Proxy helps helps make the server side failure and recovery oblivious to its clients. It replays to the the server those requests received since the checkpoint where the server is rolled back to.


The authors also discuss mechanisms to rollback a hierarchal servers, although it's not implemented in Rx. They also discuss issues associated with Multi-threaded Process such as to avoid any performance implications on the normal system I/O calls of the application, the checkpointing interval cannot be set too low.

Evaluation:
Authors have implemented Rx Proxy on Linux and use it evaluate four different server applications containing variety of bugs including two synthetically introduced bugs such as buffer overflow, data race, dangling pointer read etc. They evaluate the key aspects of Rx using four sets of experiments and compare their approach with the whole program restart, simple rollback and re-execution without environment changes approach in terms of recovery time, throughput and average response time. It is found that Rx can successfully avoid various common type of bugs and thereby provide highly available services despite software bugs including deterministic bugs. It is found that in most cases the approach of applying environmental changes is the key reason for Rx's success. The authors carry out an analysis if the time(frequency of checkpointing) and space overhead(for each checkpoint) and even with 200 ms of checkpointing interval a good performance is observed. In all the experiments it is fond that Rx has a very low recovery time and hence can deal with higher bug arrival rates.

Confusion:
The authors discuss a potential future work of extending RX to support multi server hierarchy rollback. My question is how feasible is it to extend an approach like this to an arbitrary server hierarchy used by practical applications.

1. Summary
This paper describes Rx, a tool to recover from bugs in production Linux servers. Rx attempts to rollback the program to a recent checkpoint and re-execute in a modified environment, reporting information about the environment to the programmer if the changes avoid the bug.

2. Problem
Programmers often introduce bugs into code, maliciously or otherwise. Automated test systems during development catch some, but not all, of these bugs. In production systems, software bugs cause many system failures. In certain cases, applications want to recover automatically from such bugs and continue running, and report information about the bug to developers so that they can fix it later.

3. Contributions
Rx periodically writes information about a running application to a checkpoint in memory. Rx contains sensors that detect certain types of software errors, such as assertion failures, buffer overflows, etc. When such an error occurs, the program is rolled back to the previous checkpoint and re-executed in a modified environment. Possible changes to the environment include padding allocated memory blocks with zeros to avoid some buffer overflow errors, changing thread priority to reduce the likelihood of data races, etc. Rx applies these changes in a certain order, based on several rules created by the authors. If no changes to the environment prevent the bug from crashing the program, Rx returns to the previous checkpoint and tries again. This process continues until a threshold of time chosen by the authors, after which the program simply restarts. Rx uses a proxy between the client and the server to buffer new requests during recovery, and to keep track of which old requests occur between different checkpoints. Rx can store information about previous bugs to try the most likely environmental change first if a bug occurs again.

The authors make many unstated assumptions. They assume that when a bug occurs, the code should be re-executed, rather than aborted completely. They also assume that a bug always causes a crash immediately, when in reality, certain bugs might set a memory location to the wrong value and not crash, or introduce memory corruption that causes a crash much later.

4. Evaluation
The evaluation section is fairly thorough. It tests six bugs in real systems, four of which were introduced by the original programmers. The authors show that the Squid web server experiences roughly 5 seconds of downtime if it restarts after a bug, but experiences similar performance using Rx (with a minor spike in response time after the crash) compared to the version of Rx without the bug. If requests repeatedly trigger a bug, possibly due to a malicious attack, then the restart-based approach degrades, while Rx maintains the system's throughput and response time. If there are malicious requests, then the throughput and response time are similar in the baseline as in the buggy system with Rx.

5. Confusion
This paper was cited many times, but I personally would not use it in any production system today. What ideas from the paper are still applicable to modern systems?

1. summary
The paper proposed a new techinque Rx for trying to recover programs from bugs based on rollbacking the program to a recent chackpoint upon a software failure and changing the environment in which the program runs. Environment changes include padding buffers, zeor-fill buffers, delay deallocation, change scheduling quanta and drop user requests. But from the view of the client, the failure is invisible if the system get recovered, just some latency.

2. Problem
For the technique to handle software failure, using reboot cannot deal with deterministic failure and will cause unavailability. Using checkpoint and recovery also cannot work for deterministic bugs. So the paper would like to solve the problem by changing environment so that (maybe) it will not manifest. Specifically, multi-tiered system is very hard to debug, and will stil be hard for this techinqueue.

3. Contributions
a) sensors: to detect failure at runtime, including software error and software bugs.
2) Checkpoint and Rollback: checkpoint application and system states and reducing space overhead by bound recovery time.
3) environment wrapper to change environment during re-execution. It need to ensure correctness while avoid failure. They use memory wrapper to avoid dangling pointer, doublding free, buffer overflow, memory corruption, specifically through delaying free, padding buffers, allocation isolcation and zero-filling. They use message wrapper to shuffle the order of requests. THey use scheduling wrapper to change process priority. They use signal delivery to reduce concurrency bugs.
4) proxy: it is used to make the recovery invisible to client. In recovery mode, it will buffer new requests during re-execution and drop request that has been done.
5) control unit: it will diagnose failure, decide what changes should be applied based on symptom and learning from experience (failure table)

4. Evaluation
The compared Rx with other alternative approaches for the correctness and provided the result of measuring the throughput and average response time of Squid with Rx and Restart for one bug occurence. They also compare Rx with application baseline on MySQL and Rx in the view of performance.
a) They are measuring replay, but what is the frequency of replaying.
b) Why didn't they compare to failure oblivious?

5. Confusion
I'm wondering the accurancy of sensor.

Summary:
The paper presents a novel technique, called Rx, to recover from many types of software bugs with an aim to improve system reliability. On encountering a software bug, Rx rollbacks the program to an earlier (not necessary latest) checkpoint and re-executes the program in a modified environment, with no or minimal invasion to user program.

Problem:
Availability is one of the most important factor of a system. System unavailability can cost loss of user trust apart from millions of dollars an hour for business-critical systems. The paper claims that most (40%) of these failures are due to software bugs. Earlier works on managing system failures had their own shortcomings and can be categorized in 4 classes: 1) Partial/Whole system rebooting – It is more suitable to fix hardware issues and less likely to resolve software bugs, 2) General checkpoint and recovery – has the same limitations as of rebooting, 3) Application specific recovery mechanisms e.g. Apache HTTP server spawning a new process and killing failed process has its own limitations e.g. they are unable to resolve deterministic bugs like malicious request, 4) Failure oblivious computing and Reactive immune systems provide a “speculative” fix and can’t be used for correctness critical applications such as online banking.

Contributions:
1. Comprehensive – Rx can recover from a variety (if not all) of bugs both deterministic and non-deterministic.
2. Safe – Unlike failure oblivious computing and reactive immune systems, Rx does not try to “speculatively” fix the issue. Rather, it modifies the execution environment to prevent bugs from manifesting themselves.
3. (Almost) Non-invasive: Require no to minimal changes in user application. Can be applied to legacy software as well.
4. Efficient and Informative: Significantly reduces downtime and has reasonable good performance. Also, does not hide bugs and provides additional diagnostic information.
5. Uses varieties of modified environments such as memory management based (for memory issues), Time based (for issues such as deadlocks), User Request based (for issues such as malicious requests).
6. Sensors – detects failures by dynamically monitoring application’s execution. 2 Types – 1) Software failures such as assertion failures, divide by zero error, 2) Software failures such as buffer overflows, accesses to freed memory etc. On software failures, sensors pass diagnostic information to control units.
7. Checkpoint and Rollback (CR): takes checkpoints of target server application periodically, and automatically and transparently roll backs the program to an earlier checkpoint in case of a software bug. Motivated from fork(), CR copies program memory in C-O-W fashion to avoid system overheads. Supports multiple checkpoints and rollback to any of these checkpoints is allowed.
8. Environment Wrappers: Rx uses environmental wrappers such as memory wrapper, message wrapper, process scheduling, signal delivery, dropping user requests to do environmental changes while re-executing the user program to recover from the software failure.
9. Proxy: Helps re-execution on a failed server and makes errors on server side and recovery oblivious to the client side. It runs as a standalone process to avoid being corrupted by target server’s software defects.
10. Control Unit: coordinates all other components of Rx – 1) Asks CR for periodic checkpoints, and to rollback in case of failure, 2) uses the diagnostic information of the failure and its own experience to determine course of actions such as environmental changes etc, 3) provides programmer informative postmortem bug analysis report.

Evaluation:
The paper is an easy read, albeit verbose occasionally. Authors design 4 experiments to evaluate different key aspects of Rx i.e.– 1) functionality of Rx under common software failures, 2) performance overhead for server throughput and avg response time, 3) behavior under malicious attacks, and 4) benefits of learning from heuristics stored in failure table. These results showed that Rx was able to successfully recover from most of the bugs and was broader in scope and more performant than Rebooting and other alternatives. Authors also provide limitation of Rx such as unable to recover from semantic bugs and resource leak issues.

Confusion:
What techniques are used in sensors to detect various errors? Also, how prevalent is Rx (or its successor, if any) currently?

Rx
1. Summary
Services needing high availability need extensive testing and robust implementation. The failures in production require immediate and decisive action to fix minimise the downtime. This paper discusses automated system Rx to bypass detected failures by modifying execution environment and reducing recovery overhead.
2. Problem
The typical solution to failing service in production is to reboot the application. The reboot can only tackle non-deterministic bugs. Manual intervention is costly due to high downtime. The authors propose technique Rx to restore application from its checkpoint and replay the events under modified execution environment. The authors discuss the challenges in periodic checkpoints, environment modifications and recovery techniques.
3. Contributions
Rx technique can survive more software defects than traditional reboot. The techniques employed by Rx are safe and do not cause any side effects. Rx does not require any modification to existing source code. Rx offers faster recovery from checkpoint as compared to reboot. Also, Rx records more information in case of failures to ease root cause analysis.
Rx uses sensors to monitor software failures in the system. The failures include access violations, exceptions, Segmentation fault, etc. The sensors notify Control Unit about detected failures which then acts on them. Control Unit restores the last known checkpoint. Then, the events after the checkpoint are played under modified execution environment. The modifications are of 3 main types – Memory Management, Timing and User request. Modifications span across kernel and user space. These modifications are only active until the system surpasses the point of failure.
The Rx uses Proxy between clients and server. Proxy is useful during recovery. To hide failures from clients, proxy buffers the client requests. Proxy also buffers generated responses during recovery mode. This is to ensure exactly once semantics for clients.
The Control Unit maintains a Failure Table containing a map of Failure signature and solutions that worked in the history. This information is useful in determining the required modifications during recovery phase. Also, Control Unit is responsible for taking periodic checkpoints.
4. Evaluation
Authors have performed extensive evaluation of Rx techniques on four servers Mysql, Squid, Apache and CVS. The measurements include the overhead of periodic checkpoint, recovery overhead, effectiveness against failures and effect of failure table. For 3 of 4 servers, Rx has minimum negative impact on Throughput and response time. Also, the recovery overhead is minimal for each of the servers. The space overhead of storing checkpoint is in the order of Kilo Bytes and takes time in the order of millisecond. These are acceptable overheads for production grade systems requiring high availability.
5. Confusion
How does Rx tackle non-idempotent applications? Such applications cannot allow re-execution of code.

***Please ignore previous upload. It was a wrong version***

Summary

The paper discusses a new approach for recovering from software failures both deterministic and non-deterministic.

Problem

System availability is very crucial in online services as down time directly causes revenue loss. Software crashes contribute majorly to the availability of services. Existing methods to deal with software crash recovery include system reboot, checkpointing and recovery, application specific recovery methods and failure oblivious computing. Since many of these techniques were originally designed to handle hardware failures, most of them are ill-suited for surviving software failures. For example, they cannot deal with deterministic software bugs, a major cause of software failures, because these bugs will still occur even after rebooting. Another important limitation for example of restarting the service is that the service will be unavailable while restarting, which can take up to several seconds.

Contribution

The key idea introduced by Rx is environment change while crash recovery. The environment here includes everything from hardware such as processor architectures, devices to OS kernel such as scheduling, virtual memory management, devices drivers and also user level libraries.

Rx design is made of five components: 1. Sensors for detecting and identifying software failures or software defects at run time, 2. A Checkpoint-and-Rollback (CR) component for taking checkpoints of the target server application and rolling back the application to a previous checkpoint upon failure, 3. environment wrappers for changing execution environments during re-execution, 4. a proxy for making server recovery process transparent to clients, and 5. a control unit for maintaining checkpoints during normal execution, and devising a recovery strategy once software failures are reported by sensors.

Evaluation

Four real world server applications namely web server, web cache and proxy server and a database server and a concurrent version control server were used to evaluate Rx which was implemented in Linux 2.6.10. Mainly, the following aspects of Rx were evaluated :
Functionality when faced with a variety of bugs.
Performance overhead of Rx for both server throughput and average response time without bug occurrence.
Behavior under certain degree of malicious attacks that continuously send bug-exposing requests triggering buffer overflow or other software defects.
Benets of Rx's mechanism of learning from previous failure experiences.

It is observed that Rx outperformed all the existing alternative recovery mechanisms in terms of functionality and performance (except in the case of CVS).

Confusion

I don’t understand how is Rx different from speculative recovery except that it provides better reporting. My point is Rx can have correctness issues as well.

Summary

The paper discusses a new approach for recovering from software failures both deterministic and non-deterministic.

Problem

System availability very crucial in online services as down time directly causes revenue loss. Software crashes contribute majorly to the availability of services. Existing methods to deal with software crash recovery include system reboot, checkpointing and recovery, application specific recovery methods and failure oblivious computing. All these

Contribution

The key idea introduced by Rx is environment change while crash recovery. The environment here includes everything from hardware such as processor architectures, devices to OS kernel such as scheduling, virtual memory management, devices drivers and also user level libraries.

Rx design is made of five components: 1. Sensors for detecting and identifying software failures or software defects at run time, 2. A Checkpoint-and-Rollback (CR) component for taking checkpoints of the target server application and rolling back the application
to a previous checkpoint upon failure, 3. environment wrappers for changing execution environments during re-execution, 4. a proxy for making server recovery process transparent to clients, and 5. a control unit for maintaining checkpoints during normal
execution, and devising a recovery strategy once software failures are reported by sensors.

Evaluation

Four real world server applications namely web server, web cache and proxy server and a database server and a concurrent version control server were used to evaluate Rx which was implemented in Linux 2.6.10. Mainly, the following aspects of Rx were evaluated :

1. Functionality when faced with a variety of bugs.

2. Performance overhead of Rx for both server throughput and average response time without bug occurrence.

3. Behavior under certain degree of malicious attacks that continuously send bug-exposing requests triggering buffer overflow or other software defects.

4. Benets of Rx's mechanism of learning from previous failure experiences.

It is observed that Rx outperformed all the existing alternative recovery mechanisms in terms of functionality and performance (except in the case of CVS).

Confusion

I don’t understand how is Rx different from speculative recovery except that it provides better reporting. My point is Rx can have correctness issues as well.

Summary
This paper discusses a checkpoint-rollback technique which provides a safe way to handle software failures. Rx provides quick recovery from failures by running the failed program from a checkpoint on a modified environment.

Problem
Availability of software services is paramount than ever and failures come at business cost. Thus, to be more reliable software has to be fail proof which is almost impossible as a sophisticated software cannot be exhaustively tested. Thus, to survive software failures, many approaches have been take such as: flavours of rebooting, checkpointing and recovery, application specific recovery and, speculative fixes. All these have drawbacks of their own and none of them provides safe and (almost) non-invasive recovery solution. Rx does this by a version of rollback and recovery with modified environment in which the failed applications are re-run.

Contribution
Main contributions of the paper are: i) safe ii) (almost) non-invasive iii) quick recovery and iv) diagnostic feedback, technique for software failures. The cardinal technique is to dynamically change the execution environment depending upon the failure symptoms. There are 5 components of rx
1. Sensors: These detect software failures (such as assertions, exceptions, access violations etc) by monitoring application's execution. Senors also provide the failure information to control unit for postmortem bug diagnosis.
2. Checkpoint and Rollback : Rx at regular intervals takes snapshot of applications to create checkpoints. Rx uses COW technique to avoid memory and time overheads. When a failure occurs, these checkpoints are replayed for detecting errors.
3. Environment Wrappers: These change the execution environment of the software for averting failures. For each or combination of these wrappers, the application is re-run to check if the failure occurs or not. Wrappers are ordered in the ascending order of their overheads. Rx uses Memory Wrappers, Message Wrappers, Process Scheduling, Signal Delivery and Dropping user requests to avert the failures.
4. Proxy: These are used in Server situations where clients have to interfaced with proxy to provide the transparency to failure recoveries at the server.
5. Control Unit: This component glues together all other components. It Directs CR to checkpoint periodically and to rollback on failsures; diagnoses a failure based on the symptom and accumulates these "experiences"; provide diagnostic information; maintain a failure table that gradually "learns" the symptoms for efficient future recoveries.

Evaluation
Rx has been implemented on Linux and tested against 4 servers applications. Authors show that Rx is capable for recovering from different type of bugs like Buffer over flow, data race, double free etc., where as other alternatives don't. Authors also show how Rx perform better than traditional restart of application on throughput and average response timeon Squid and MySQL.


Question & Comments
Consider this case: Application called a syscall to write to a register of some hardware. After this, a checkpoint is created. Later, the hardware raises an interrupt and the application is delivered the event and application fails while handling this event. Now how does Rx handle in this case ? Replaying from checkpoint will not reproduce the issue as we do not write to hardware again.

It is funny to co-relate how this paper borrowed from comics and pulled out a Dr. Strange in its approach :D Like how Dr. Strange used Time Stone to lock dormammu in a time loop and defeated him, Rx uses dynamically changed execution environment to re-run the software till it confesses :P


1. Summary
Rx is a program which tries to cure bugs and software failures like allergies. It rolls back the program to a recent checkpoint, and re-executes the program in an environment with some modifications made to prevent the allergies i.e. the software bugs.

2. Problem
High availability is crucial for most of the applications. Software defects or bugs are the main cause of system failure taking the system down. In order to provide a system with high availability, one must try to dodge these software bugs (which might have seeped in even after rigorous testing) as much as possible. The prior work on the survival of a system in the face of software failures can be mainly categorized into – rebooting techniques, checkpointing and recovery, application specific recovery and failure oblivious computing and reactive immune systems. Though these might seem attractive solutions for certain applications, they have many shortcomings such as inability deal with deterministic bugs, service unavailability etc.

3. Contributions
The core idea of Rx is to rollback the program to a recent checkpoint, and re-execute it in a new modified environment based on the software failure symptoms diagnosed. Rx learns from its wisdom and tries to prevent the software bugs known to it. It doesn’t restrict the environment changes just to the application, but apply it from hardware level to the application, wherever the prevention might seem necessary. To make sure the modification environment isn’t changed arbitrarily, it follows 2 principles – 1. It should be correctness-preserving 2. Useful environmental change should potentially avoid software bugs.
First of all, to diagnose the problem, Rx has units called Sensors, which detect software failures by dynamically monitoring them. Rx has an efficient checkpoint and rollback mechanism. It tries to reduce the memory overhead by binding the checkpoints by 2-competitive rule. The checkpoints are thus retired similar to working set. The heart(s) of Rx are the environment wrappers. Like different medicines or preventive measures for different allergies, Rx provides different medicines for different kinds of allergies – Memory Wrapper, Message Wrapper, Process Scheduling, Signal Delivery, Dropping User Requests.
Rx also introduces a proxy in between server and the client so that the client remains agnostic about the recovery process that is going on in the server. It completely handles how a server and the client should communicate and how their responses and requests can be served in most efficient way possible even in the recovery mode. The master of all the units is the Control Unit. It directs the checkpoint creation, diagnoses occurring failure and tries to cure it using the accumulated experiences (by building a failure table) and provides a post-mortem report.

4. Evaluation
The authors implement the Rx system on Linux kernel 2.6.10. Evaluation was performed for four server applications. In which there were four real bugs and two injected bugs of different types like buffer overflow, double free, stack overflow, data race etc. Rx is evaluated from 4 different dimensions – functionality, performance overhead, behaviour of Rx under malicious attack, benefits of learning from past. Rx is compared with alternate approaches for different kinds of bugs and servers. The authors show that Rx can provide transparent and fast recovery for all but in one case of CVS. The paper present statistics to provide proof of the improvement in throughput, response time and recovery time while using Rx.

5. Confusion
1. Can you please shed more light on checkpoint maintenance? How does exponential landmark checkpointing work?
2. Is performing binary search on dropping user requests really the best approach? Though this might help in finding out malicious request, doesn’t it severely affect performance? Can you please discuss this?

1. Summary
The paper talks about Rx which is a safe and non-invasive method to survive software failures caused by common software defects. This is done by re-executing the buggy region of the program in a modified environment. It can handle both deterministic and non-deterministic bugs.

2. Problem
High availability is often crucial for business and productivity. Previous methods of surviving software failures typically involved various forms of rebooting, general checkpointing and recovery or sometimes even application specific recovery. However, there methods are often unsafe and invasive that could often lead to unintended behavior. There was, thus, a need to design better recovery methods.

3. Contributions
The design of Rx is a novel contribution which includes both user-level and kernel-level components to monitor and control the execution environment. Software failures are detected by sensors which dynamically monitor applications’ execution. To avert failures, environment wrappers change the environment during re-execution. These include the memory wrapper (user-level), message wrappers (proxy) and scheduling wrapper s(kernel-level). The proxy helps server that has failed to re-execute. On server failure, proxy replays all messages received from this checkpoint along with other message-based changes in environment. Finally, the control unit coordinates different components in Rx. It provides developers information related to failures and also diagnoses an occurring failure based on symptoms.

4. Evaluation
The authors implement Rx on Linux and tested it with 4 server applications. Rx survived all the different bugs in these applications and also provided user-transparent recovery with rate 20-50 times faster than whole restart approach. There is also negligible overhead introduced by Rx.

5. Confusion
I am not sure if I understood the whole of section 3.4 on Proxy.

1. Summary
Qin et al. present Rx, a system that facilitates the recovery from certain failure scenarios due to an execution environment specific variable. They describe this as an allergen, which, is something that irritates the environment causing the failure. Using some basic techniques, they are able to restore the system without requiring system failover. While software defects are not hidden, it is able to make progress on trickier bugs to debug and provide a continuous user experience.

2. Problem
Developing enterprise code is difficult and while rigorous testing through automation and design reviews is conducted, bugs still slip through into production. How can we reduce the downtime of caused by these errors as well as improve the time to fix these errors?

3. Contribution
Rx contains five components: sensors, checkpoint and rollback (CR), environment wrappers, a proxy, and the control unit. The system operates in either normal or recovery mode and uses previously developed techniques to do checkpointing of the system. For the sensors, they do basic monitoring just by interposing themselves on OS-raised exceptions. Checkpoints are cleverly maintained using a notion of recovery cost and therefore only keeping a history that is within some threshold. Environment wrappers allow for the changes to the execution environment and enabling some simple safety measures such as zeroing and/or padding buffers. The proxy is a basic middleman component that allows for gracefully maintaining connections through retries.

The control unit is the brains of the operation performing three functions. First, it manages checkpointing. Second, it diagnoses failures based on the symptoms and learns from past experiences to decide what environment change should be applied. Third, it provides programmers failure related feedback for postmortem analysis. I see parallels in this work and recent work by Avrilia Floratou with Dhalion and how she is developing an automated recovery system for Heron.

4. Evaluation
Evaluation was basic. They presented 6 cases, 4 of which were production bugs. The non-determinism introduced by Rx is an asset as it is able to rerun the same code but perturbing the execution order. It demonstrates overall very good performance only incurring little overhead. I think the OSDI ‘14 paper by Yuan et al. of University of Toronto is a nice extension (in failure exploration) as they actively sampled product bugs and Remzi also has work analyzing the implications of file system bugs for crash consistency (offline work).

5. Discussion
This makes sense we focus on this class of bugs (non-deterministic). I think it can be somewhat accepted that deterministic bugs can be vetted through code review and QA testing. While they point out this solves the bugs of several scenarios, I’m curious if our process and standards around coding have become more resilient to these types.

Summary:
This paper proposes a safe technique called Rx, to recover from software failures thereby improve system availability. This is inspired from the real life method of treating the allergens, thus based on the idea that bugs are correlated to the execution environment. Rx maintains checkpoints and upon failure does rollback to a recent checkpoint and re-executes the program in modified environment. It also provides diagnostic information for postmortem analysis.
Problem:
Software failures are the bottleneck to achieve the high system availability required by many trending applications like e-commerce. Server downtime can lead to huge loss. Previously proposed solutions like whole program restart, general checkpointing and recovery do not work well for deterministic failures. Application-specific recovery mechanism increase programming difficulty. Others like failure-oblivious and reactive immune method can be unsafe for correctness critical applications.
Contributions:
> The main idea of Rx to handle software failures (both deterministic and nondeterministic) is to periodically take lightweight checkpoints and on failure, rollback to one of the checkpoints and re-execute the program by changing the execution environment.
> Rx sensors detect software failures dynamically and notifies the control unit along with information about the bug.
> Environmental wrappers perform environmental changes during re-execution. These changes consists of memory management based changes like delaying free, padding buffers to avoid overflow etc, timing based like changing process scheduling time quantum, message related ( for concurrency bugs) like reordering requests from different connections, varying size etc.
> The less intrusive solution is applied first and the most extreme (like dropping requests) is deferred for later.
> Proxy helps to make the server failure transparent to the clients. It buffers the requests for replaying and also make sure response is not sent twice.
> Control Unit maintains a failure table with a score vector of environment changes for each for each failure. Thus the control unit is able to make better decisions based on experience.
Evaluation:
Rx is implemented on Linux and evaluated using four real-world server applications: Apache http, Squid, MySQL and CVS. These applications had four bugs originally and two bugs were introduced for testing. Rx could successfully cope up with all six failures in contrast other recovery techniques like whole program restart. Also Rx is 21-53 times faster.
Confusion:
The paper says that the proxy will buffer the response during re-execution temporarily, but if the buffer is full it sends the response to the corresponding client. How is correctness ensured in this situation? (There is a possibility that re-execution is not successful)

1. Summary:
The paper introduces Rx, a system for improving software availability by providing automated crash detection and recovery. The authors motivate the increasing importance of software availability and hence the relevance of their system. The proposed solution takes inspiration from existing techniques such as checkpoint and restore, and providing wrapper methods for managing memory. Finally, the most important aspect of the solution is that it requires little to no modifications to the program under scrutiny

2. Problem:
Detecting and recovering from crashes due to buggy and/or malicious code.
This is a big problem, as motivated by the authors due to the huge economic loss for web services dependent on server availability. The problem only intensifies as the world moves towards cloud services for its day to day computing needs (as is the case today), it becomes essential that these services have high availability. The work tries to identify opportunities and provide solutions to make this possible.

3. Contributions:
The work takes known defensive programming techniques and uses them to develop an automated bug detection and recovery system. The essential contribution, in my opinion, is to show that we can apply these mechanisms external to the program source. Hence providing foundations for a non-invasive method to prevent crashes in buggy/malicious code, with strategies to recover from them as well.
They also show that introducing non-deterministic changes or having a non-deterministic replay from a checkpoint can help avoid certain kinds of bugs.
The authors discuss in detail possible recovery methods with their pros and cons, providing the readers with a breadth overview of the field itself.
The authors also do a good job defining the kinds of bugs they target providing broad classifications for future work.
The Rx proxy mechanism is a good design for providing the necessary infrastructure isolation to allowing for checkpointing and seamless recovery/replay of server side processes.

4. Evaluation:
The authors provide a clear explanation of their experimental design and setup. They do a good job at evaluating their system against a naive failure recovery approach(restart). But the reader is left wondering about competing strategies (provided by related works) and their comparison to the proposed method. Overall the authors do a decent job convincing the reader that their approach works. I would have liked to see a longer trace of the experiments in a real-world setting.

5. Confusions:
One pain-point of the paper is the lack of supporting evidence used for defining classification. and from reading some of the features seem like truthful hyperbole Bugs can be non-deterministic hence I am a bit skeptical of the coverage of the tool, it would be interesting to talk abut the relevance of identifying such bugs and their mitigation measures.

1. summary
This paper describes a system of recovering from, and allowing a buggy server to continue running after certain failures.
2. Problem
Server systems have come to require extremely high availability with any and all downtime costing considerable amounts of money. When server software fails, a full program restart is generally required to recover from the problem. Many common bugs are dependent on the environment in which the program is run and the paper points out that a change in environment can often eliminate the crashes while requiring only minor checkpoint rollbacks.
3. Contributions
The Rx system presented uses a series of detection methods and checkpointing to recover from errors, modify the execution environment, and retry the section that caused the error. Sensors are added to be able to detect OS raised exceptions and erroneous memory accesses. A checkpointing mechanism is also added that periodically stores program state before allowing the server program to continue execution. When a problem is detected in execution by the sensors, Rx returns the program state to a previous checkpoint and attempts to rerun it with a slightly modified execution environment. Environment changes include such things as padding memory allocation, isolating allocated memory regions, changing process scheduling ordering, which all effect to remove various types of common program errors. A proxy module is also added to track incoming server requests so that they can be replayed in the event that a rollback occurs.
4. Evaluation
Evaluation is done over a small selection of server applications with known bugs that can be exploited. They show the Rx system is able to recover from the errors present while at the same time introducing very small program overheads. I’m not sure how exactly to do it, but it seems like a situation where the bug that was present wasn’t already known would have been interesting. It seemed like one of the main draws of the system was the logging of which change helped remove the bug in assisting with debug but that was never highlighted in the evaluation.
5. Confusion
With highly redundant server systems used today, is this even relevant anymore outside of maybe the part helping to point out what changes could be used to avoid the bug?

1. Summary
This paper introduces Rx, a method to save software from failures by combining checkpoints and execution environment modifications.
2. Problem
Previous work on surviving software failures can be classified to four categories. The first two are rebooting and checkpointing are originally developed for hardware bugs and are ill-suited for software. For example, they can’t deal with deterministic bugs. The third category is application-specific recovery mechanism which requires software to be failure-aware or need to kill a process and retry. The fourth category is the non-conventional approaches such as oblivious computing, but they are unsafe for correctness critical applications.
3.Contribution
This paper proposes Rx, a new method to save software failures by using checkpoints and do re-execution in modified environments. Execution changes used in Rx are memory management based (buffer padding, delaying the recycling of freed buffers, e.g.), timing based (increasing scheduling time slot to avoid context switch on buggy critical sections) and user request based (drop potential malicious use request). Rx consists of five components, including sensors for detecting and identifying software failures at run-time, checkpoint and rollback component for taking checkpoints and rolling back upon failure, an environment wrapper for changing execution environments during re-execution, a proxy which makes server recovery transparent to clients and a control unit for maintaining checkpoints during normal execution. Checkpoints are lightweight are kept in memory in common cases, and the number of checkpoints is bound by the constraint the worst Rx recovery time cannot exceed twice of rebooting. Rx uses a failure table to gather statistics about each failure and makes predictions based on that, thus improve recovery speed over time. The newest checkpoints are tried first upon failures with various environment changes, and later older checkpoints are tested if the failure is not resolved. Authors also consider implementing checkpoints across server hierarchy and for mutli-threaded applications.
4.Evaluation
This paper measures Rx with real-world server applications on various aspects, including functionality to survive failures, recovery time, performance overheads, performance under malicious attacks and efficiency of failure table, and demonstrates that Rx outperforms checkpointing-only approach and rebooting, and introduces negligible overheads compared with baseline (no recovery).
5.Confusion
Proxy ignores new responses for old requests while replay. But if the server generates a different response, won’t the client and server be in inconsistent state and will application function correctly. Or is this not a serious problem in most cases? And the buffer padding, what if the application reads senseless data? Is this considered as correct behavior in Rx?

1. Summary
This paper introduces Rx, a method to save software from failures by combining checkpoints and execution environment modifications.
2. Problem
Previous work on surviving software failures can be classified to four categories. The first two are rebooting and checkpointing are originally developed for hardware bugs and are ill-suited for software. For example, they can’t deal with deterministic bugs. The third category is application-specific recovery mechanism which requires software to be failure-aware or need to kill a process and retry. The fourth category is the non-conventional approaches such as oblivious computing, but they are unsafe for correctness critical applications.
3.Contribution
This paper proposes Rx, a new method to save software failures by using checkpoints and do re-execution in modified environments. Execution changes used in Rx are memory management based (buffer padding, delaying the recycling of freed buffers, e.g.), timing based (increasing scheduling time slot to avoid context switch on buggy critical sections) and user request based (drop potential malicious use request). Rx consists of five components, including sensors for detecting and identifying software failures at run-time, checkpoint and rollback component for taking checkpoints and rolling back upon failure, an environment wrapper for changing execution environments during re-execution, a proxy which makes server recovery transparent to clients and a control unit for maintaining checkpoints during normal execution. Checkpoints are lightweight are kept in memory in common cases, and the number of checkpoints is bound by the constraint the worst Rx recovery time cannot exceed twice of rebooting. Rx uses a failure table to gather statistics about each failure and makes predictions based on that, thus improve recovery speed over time. The newest checkpoints are tried first upon failures with various environment changes, and later older checkpoints are tested if the failure is not resolved. Authors also consider implementing checkpoints across server hierarchy and for mutli-threaded applications.
4.Evaluation
This paper measures Rx with real-world server applications on various aspects, including functionality to survive failures, recovery time, performance overheads, performance under malicious attacks and efficiency of failure table, and demonstrates that Rx outperforms checkpointing-only approach and rebooting, and introduces negligible overheads compared with baseline (no recovery).
5.Confusion
Proxy ignores new responses for old requests while replay. But if the server generates a different response, won’t the client and server be in inconsistent state and will application function correctly. Or is this not a serious problem in most cases? And the buffer padding, what if the application reads senseless data? Is this considered as correct behavior in Rx?

1. Summary
This paper introduces Rx, a method to save software from failures by combining checkpoints and execution environment modifications.
2. Problem
Previous work on surviving software failures can be classified to four categories. The first two are rebooting and checkpointing are originally developed for hardware bugs and are ill-suited for software. For example, they can’t deal with deterministic bugs. The third category is application-specific recovery mechanism which requires software to be failure-aware or need to kill a process and retry. The fourth category is the non-conventional approaches such as oblivious computing, but they are unsafe for correctness critical applications.
3.Contribution
This paper proposes Rx, a new method to save software failures by using checkpoints and do re-execution in modified environments. Execution changes used in Rx are memory management based (buffer padding, delaying the recycling of freed buffers, e.g.), timing based (increasing scheduling time slot to avoid context switch on buggy critical sections) and user request based (drop potential malicious use request). Rx consists of five components, including sensors for detecting and identifying software failures at run-time, checkpoint and rollback component for taking checkpoints and rolling back upon failure, an environment wrapper for changing execution environments during re-execution, a proxy which makes server recovery transparent to clients and a control unit for maintaining checkpoints during normal execution. Checkpoints are lightweight are kept in memory in common cases, and the number of checkpoints is bound by the constraint the worst Rx recovery time cannot exceed twice of rebooting. Rx uses a failure table to gather statistics about each failure and makes predictions based on that, thus improve recovery speed over time. The newest checkpoints are tried first upon failures with various environment changes, and later older checkpoints are tested if the failure is not resolved. Authors also consider implementing checkpoints across server hierarchy and for mutli-threaded applications.
4.Evaluation
This paper measures Rx with real-world server applications on various aspects, including functionality to survive failures, recovery time, performance overheads, performance under malicious attacks and efficiency of failure table, and demonstrates that Rx outperforms checkpointing-only approach and rebooting, and introduces negligible overheads compared with baseline (no recovery).
5.Confusion
Proxy ignores new responses for old requests while replay. But if the server generates a different response, won’t the client and server be in inconsistent state and will application function correctly. Or is this not a serious problem in most cases? And the buffer padding, what if the application reads senseless data? Is this considered as correct behavior in Rx?

Summary
This paper presents Rx, a safe method to recover from software bugs. Rx creates checkpoints, rolls back in case of failure and re-executes from those checkpoints with modified environment. It can resolve non-deterministic as well as deterministic bugs, requires only few to no modification to applications, provides feedback to programmers for bug diagnosis and is much more efficient than other techniques.

Problem
There are several applications, such as online transaction, that need high availability. Even after performing intense testing of the application, some software bugs are always present in the application that are responsible for system failure. Previous techniques have focussed on rebooting the system which takes a lot of time and thus makes the system unavailable during that duration. Moreover, it is unable to resolve deterministic bugs. Other techniques focus on using speculative values which can lead to program misbehaviour.

Contribution
The basic idea behind Rx is to create checkpoints, rollback the program to a recent checkpoint when a bug is detected, modify the execution environment and re-execute the program in the modified environment. The primary components of Rx are:
1. Sensors: They are used to detect software failures and notify control unit to take appropriate action.
2. Checkpoint and Rollback: They take snapshot of the system periodically and rollback to one of the recent checkpoints in case of failure.
3. Environment Wrappers: They perform environment changes during re-execution of the program. There are several types of changes that could be made. For example, delayed recycling of freed buffer can avoid bugs related to double fee and dangling pointer, different scheduling can avoid the bug related to data race.
4. Proxy: It is used to make server recovery process transparent to clients. It can operate in two modes. In normal mode, it forwards request/response messages between client and server, buffers request and marks the waiting-for-sending request for each checkpoint. In recovery mode, it rolls back to the last checkpoint, introduces message-related environmental changes and buffers incoming messages from clients without forwarding them to the server until the server successfully survives the software failure.
5. Control Unit: It maintains checkpoints during normal execution and devises a recovery strategy once software fails and failure is reported by sensors.

Evaluation
Authors implemented Rx on Linux and evaluated it on four sets of experiments and six different software bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow and double free. Experiments results showed that Rx could survive all the six software failures and provided 21-53 times faster recovery than the whole program restart approach for all but one case (CVS). In contrast, other methods for recovery were not able to survive deterministic bugs and had only 40% recovery rate for non deterministic bugs.

Confusion
Could you please explain the working of proxy in detail.

Summary
The paper introduced Rx, a safe technique which can quickly recover programs from software failures. Rx uses technique motivated from allergy treatment in real life. It uses rollback to recent checkpoint upon failure and reexcute it in a modified environment (removing allergen) with few or no modification to the program code.

Problem
There were many issues with existing recovery system like whole program restart approach, and a simple rollback and reexecution without environment change, micro rebooting, reorder messages, app specific recovery. These existing approaches cannot successfully recover the deterministic and non deterministic bugs. And they are unsafe to use as they speculate on programmer’s intentions, which can lead to program misbehaviour.

Contributions
Rx tries to solve above problems using a safe recovery technique.
Rx rollbacks to the last checkpoint, and execute in dynamically changed environment and if the recovery is successful it disables the changes. Rx applies this changes lazily to decrease the implementation cost.
Rx as paper claims is comprehensive(solve many software defects) , safe (not speculative), noninvasive(few or no modification to program code), efficient( requires no rebooting), informative(provide extra information about bug to programmers).

Then paper talks about what is the execution environment and what are the parameters to look when modifying it
1. correctness preserving
2. should be potentially avoid some software bugs.
some of the useful exception environment changes are
1. memory management based
- padding buffers
- recycling late
- zero filling
2. timing based
- change timings of events
3. user request based
- drop request

Rx architecture has a few main components
1. Sensors
- detect software failures
- implemented by taking over OS-raised exceptions
2. Checkpoint and Rollback
- CR stores a snapshot of application memory in COW fashion in memory
- worst case recovery time parameter is used to manage checkpoints - deleting the last checkpoint when worst case recovery time is greater than the time T( the maximum time taken by Rx for reexecution/rollback process)
3. Environment Wrappers
- perform environment changes during reexecution for averting failures.
- There are a few types: Memory wrapper, message wrapper, process scheduling, signal delivery, dropping user requests
4. Proxy
- helps a failed server to reexecute and makes server side failure and recover oblivious to clients
- replays all the messages from the last checkpoint
- stand alone process
- executes in normal and recovery mode
- in normal mode buffer request message from client, does not buffer response message ( except in applications where the requests are dependent on previous responses i.e. e-commerce, MD5 hash of response is stored)
- in both modes request and response are forwarded at request granularity and response granularity.
5. Control Unit
- maintains check points in normal execution and devise recovery strategies
- direct CR for checkpoint and rollbacks, diagnose fails and decide what environment changes to include
- maintains failure table for future use in case of failure

Evaluations
The authors evaluated Rx on both a client and server machines and used several known and new bugs added by authors. The Rx is compared with existing techniques whole reboot and checkpointing. Rx can survive all software failures and provide fast recovery, much faster (21-53 times) than these existing techniques but one case(CVS).

Confusions
- The paper focuses only on limited number of bugs (6 bugs), how does this make Rx comprehensive as the paper claims.
- Dropping user requests doesn’t look practice approach for recovery purpose, has this been incorporated in any real time system?
- The paper talks very less about sensors which is very important to detect failure, would like to more about how they are implemented in some of the successful recovery systems.

1) Summary

Availability is a common requirement of software services. However, buggy software can lead to downtime. The authors propose a technique for recovering from crashes and faults without modifying the application.

2) Problem

A lot of work has gone into making applications fault-tolerant. Fault-tolerance allows a system to maintain correctness and/or liveness despite failures. This is a well-known problem in distributed systems.

The authors claim that a large portion of bugs are due to the environment, including the memory allocator or timing errors. They propose changing the environment to allow execution to get past buggy code.

At the same time, traditional error reporting usually give information that is hard to use for debugging, such as core dumps or logs. The authors propose a technique that may be used for improving diagnosis of bugs.

3) Contributions

The authors propose a technique for recovering from failures caused by buggy software. Their system includes a kernel-level orchestration component, a failure detector, a proxy, and checkpointing mechanism.

The failure detector is intended to be a modular component which detects when a failure occurs and notifies the orchestration layer. The checkpointing mechanism provides periodic checkpoints from which the application can recover from failures. The proxy allows the buffering and replay of messages for use during recovery.

Upon a failure, the application is rollback to a recent checkpoint, part of the environment is changed, and the application is replayed with the change. The hope is that this causes the bug not to manifest itself, so the application can proceed.

4) Evaluation

I'll be honest, I had very strong negative feelings about this paper for several reasons.

First, the motivation is unconvincing. To my knowledge, most downtime comes from improper configuration, unrecoverable (correctness) bugs, hardware/network failures, or hacking. None of these is really targeted strongly by Rx. I have never really heard that environmental factors are commonly to blame.

Second, the premise of wanting software to survive bug-crashes seems like a special case. I can understand why this might be desirable for a webserver that posts cat-GIFs, but for most other online services (and indeed most programs), a crash/failure due to a bug indicates a correctness issue that should be resolved. In fact, that's the whole point of crashing! We could just build systems that fail silently if we wanted to. Crashes exist for a reason; we should not hide them. If fact, I would go further to argue that often causing the program to proceed when it would have otherwise crashed is harmful or could lead to invalid state. For example, consider the following incorrect C code:

void foo() {
struct bar *b;

if (b->x) {
send("Account += $100");
} else {
send("Account -= $100");
}
}

This is undefined behavior in C. It may cause a segfault or it may evaluate to something and send a message changing the balance in the account. Whatever it does, it is clearly incorrect for either message to be sent. We would never like this code to survive a segfault, if one occurs. Yet, Rx would increase the chances of this code completing without error. The problem is that sometimes we want code to crash and stay dead.

Third, this is wrong way to fix bugs. I am a TA for Intro Programming, and one of the first things we teach people is that they should understand why their code is failing, rather than changing random things until it works. This is because if it does work, you don't know why it worked. Consider what would happen if a data race was avoided by zeroing out memory. The programmer reading Rx's report might be misled into thinking that they had a memory bug when they really have a race condition. I would further argue that the proper way to fix memory bugs isn't to randomly change properties of the memory allocator; rather, the answer is to use safe programming languages like Rust.

Fourth, the authors' definition of the problem and their solution is not very rigorous. They claim very boldly that their solution is "safe", but this is clearly not true, as shown above. The authors never define their assumptions about bugs, program behaviors, or crash types. They never prove their claim that their technique is "safe" in any way -- not even informally. As a matter of fact, in retrospect, it appears that their solution assumes that bugs do not exhibit undefined behavior, as my example above demonstrates. They also don't define what it means to "recover" from a bug. One might define this as "execution continues after the bug occurs", but as shown above this is contrary to the notion of "safe" execution.

Finally, their evaluation seems somewhat contrived. They pick 6 bugs that they know they can recover from, and they recover from them. The evaluation does not show they can recover or properly diagnose previously unknown bugs. In my opinion they should have used a wider selection of bugs, not just those for which they design their failure detectors.

That said, I did find the related work and methodology sections satifying. The related work is pretty complete and touches on a number of other techniques for fault-tolerance. They also do a good job of explaining the benefits of their technique compared to previous techniques, such as the ability to "tolerate" deterministic bugs or to run without modifying the application. Their methodology and results sections for the bugs they did test do a good job of illustrating the "fault-tolerance" of the system, and they use several useful metrics to measure and compare their performance with alternate solutions, such as full rebooting.

5) Confusions

How is this an SOSP paper?

1. Summary
The paper presents a system called Rx to provide high availability in the face of deterministic and non-deterministic software failures by rolling back the program to a recent checkpoint, and re-executing it in a modified environment. These modifications include techniques such as block padding, delaying freeing etc. to deal with memory-related bugs, and changing CPU priority to deal with scheduling bugs.

2. Problem
Software defects account for 40% of all system failures. The traditional techniques for dealing with these software failures have several issues. Rebooting techniques result in unavailability for several seconds. Microrebooting requires software reconstruction. Checkpoint & rollback mechanisms can not deal with deterministic failures. N-version programming requires multiple software versions which would result in prohibitive development costs. Application-specific recovery approaches can help but are generally limited and can’t be refactored. The authors propose a solution to several deterministic and non-deterministic bugs by rolling back to a recent checkpoint, and re-executing the program in a modified environment.

3. Contributions
Their system Rx makes the following contributions:
1. Detection of software failures through sensors and periodic checkpointing to aid in recovery in case of failures.
2. Environment wrappers for changing the execution environment incase of a failure. The wrappers include:
2.1. Memory wrappers - Performs actions such as delaying freeing of buffers, padding buffers, allocation isolation and zero filling to deal with memory bugs
2.2. Scheduling - Changing process priority to get bigger time quantum
2.3. Signal delivery - Record signals and replay them in random order after recovery.
2.4. Dropping user requests - To deal with client requests causing crashes
3. Proxy for storing messages exchanged with clients. This is important in order to regenerate state from an old checkpoint
4. Control Unit which decides the changes needed in the execution environment based on failure symptoms. It also keeps a history of failures and steps applied to improve its diagnosis, and attempts to apply changes based on their cost.

4. Evaluation
Rx was implemented in the Linux kernel 2.6.10. To verify that Rx is able to deal with software failures, four benchmarks with different server applications (Apache httpd, Squid, MySQL, CVS) are run. The servers contain 4 bugs related to buffer overflow, data race etc. as introduced by the original programmers. Two new bugs deliberately injected for uninitialised read or dangling pointer bugs in Squid. It is found that Rx is able to handle all the 6 bugs (5 deterministic, 1 concurrency). The other alternatives of restart and simple rollback/execution can not recover from most of these issues. Rx also has significantly better recovery time (21-53X) except for one test case (CVS). The performance overhead of Rx is significant. For a short period of time (17-161 msec), server throughput drops by 33% and average response time increases by a factor of 2. Restart on the other hand causes a downtime of 5 seconds, with zero throughput.

The evaluation section fails to cover several things. The efficiency of the program checkpointing mechanism is not discussed. Are their mechanisms able to efficiently generate checkpoints even if the server apps have a lot of state? How do they deal with clock timers, FDs etc? They also do not talk about the scalability of the Proxy component. How many flows can it handle? How does it deal with persistent TCP connections?

5. Confusion
1. A discussion about the scalability of the Proxy buffering techniques would be good.

1. Summary
This paper examined the implementation of Rx, which is a new way of recovering from software failures. Because it is almost impossible to have software free of bugs especially in large applications it is important that there are mechanisms like Rx to be able to survive software bugs. The basic idea behind Rx is that a lot of these software bugs are dependent on the execution environment.

2. Problem
Previous techniques for surviving software failures such as restarting or checkpointing and roll back in the same environment still suffered from deterministic bugs when rerun. These previous techniques also had significant downtime associated with them leading to lost productivity as compared to the Rx design. Another potential fix was failure-oblivious computing but the speculative fixes from this technique could lead to silent errors that are hard to diagnose.

3. Contributions
Rx is meant to solve the problems of these previous recovery techniques by modifying the environment the code is rerun in. The paper points out that much like allergies software bugs are often dependent on the environment they are in. By using a combination of checkpointing and changing the environment on rerun Rx achieves vastly superior recovery time and handles more bug types than previous techniques. On top of these performance improvements Rx accomplishes all this with minimal source code changes and can provide debugging support. To combat storage space overhead the checkpoint mechanism only needs to store enough checkpoints to handle several iterations of the recovery loop. The recovery loop will only iterate up to a time when it is no longer less than the time it would take to restart. To eliminate extra overhead Rx does not apply environmental changes from the beginning but instead applies them only when a bug is found and starts with small overhead fixes first.

4. Evaluation
To evaluate Rx the authors used the technique on both a client and server machines and used several known and added bugs to test. The paper uses two of the more common previous techniques, full restart and checkpointing, to compare to the Rx system. All but the CVS application have much higher average recovery times and only Rx is able to recover from all the bugs tested. The reason given for CVS's fast recovery time is that for its usage CVS has to have a fast startup time. For availlability it is also important to note that the system with Rx has a throughput drop on its server of only 33% for a short period whereas with full restart its zero throughput for several seconds.

5. Confusion
The extra help with debugging seems like an added bonus for the Rx technique. However, I was questioning how often there might be a false positive when choosing a new environment. If there was a non-deterministic bug and in the rerun trial with a new environment the bug did not show then it is not necessarily the change in the environment that “fixed” the bug.

1. Summary
This paper introduces Rx, a system that works as an execution environment for software to help software survive bugs, failures or malicious requests.

2. Problem
Server applications require high availability, however, for complex server applications, there are many bugs in the software that can cause failures. We can debug the software, but we also want the software keep working and do not crash even when there are bugs in the code. This paper notices that most failure can be avoided if we change the environment of the application (like memory management policy or scheduling policy), and proposes a system transparent to user that can perform change environment and re-execution automatically.

3.Contribution
The main idea in this this paper to deal with failure is: when failure happens, rollback to a previous state, change some environment, re-execute the application, and hope it can work this time. To achieve this goal, this system has a checkpoint system that periodically snapshot the application status (memory, file, registers) using copy on write, and can rollback to previous checkpoints. This makes sure that the system does not need to restart from the very beginning thus reduce the re-execution time. In order to change environment, this paper introduces environment wrapper that keeps the memory and timing API the same, but with different underlying implementation. Using this, we do not need to modify the application source code, and can perform environment quickly. Different environment changes are applied using different priority, and changes are recorded for future failure, and also report to programmers for debug.

4.Evaluation
This paper uses different real server applications with bugs (some bugs are implemented by this paper to test this system). Experiment results show that Rx can help application survive different kinds of failures, deterministic or not, recovery rate is much higher than other approaches. The proxy works as an interface to client, and client does not even know that failure happens. Average response time for Rx is also smaller compared to other approaches because of checkpoint, and can respond quickly even when bug arrival rate is very high.

5.Confusion
(1)Not very sure about how to do the checkpoint for database applications like MySQL, because this application requires a large amount of memory, and the data in memory is frequently updated. I think COW may not help much in this case?
(2)Not very clear about the background of recovery, like those techniques mentioned in section 1.1, micro-rebooting or backup server. Can you introduce something about this?

1. Summary
Rx is a recovery technique for server applications that uses checkpoints to roll back from errors during execution and re-executes the program in a modified environment to avoid the possible cause of the error. Rx can avoid even deterministic bugs and can be used to provide highly available services with a low overhead.

2. Problem
The paper claims that software defects constitute a large number of system failures. Lack of availability due to a server application crashing can lead to enormous financial losses as well. Previous studies and techniques in availability mainly focused on hardware failures. The techniques that did target software failures needed significant changes to the application, or were expensive to deploy, or couldn't handle deterministic bugs.

3. Contributions
The paper proposes a system which can regularly checkpoint the state of an application during execution and roll back to a previous state when an error is encountered. The execution is resumed with a slightly modified environment in order to address some of the possible causes of the failure. If the execution fails again, a different modification is tried. This cycle continues until either the execution proceeds successfully, or it is forced to reboot. A proxy unit acts as an interface between servers and clients, and makes recovery from errors transparent to clients. When a particular modification to the environment succeeds, a notification informing the error and the modification that worked is provided. The kind of errors that can be detected are memory management errors, timing errors or user-related errors. This set of errors seems too small and limits the applicability of this system to other kinds of software failures. Also, a modification to the environment may hide a bug instead of fixing it. This might result in the developer getting an incorrect root cause and leading them down the wrong path during debug. The underlying bug may not cause a crash due re-execution in Rx, but it is not guaranteed that the program now behaves correctly.

4. Evaluation
The authors evaluate four different real-world applications with 4 real bugs and 2 bugs that were introduced. Rx ensures that clients don't experience failures with these applications and their bugs. Rx also has good recovery performance, because of which the throughput and average response times of applications are not dramatically affected. Space overhead for checkpoints is 2-3 MB which is surprisingly small. The failure table to learn about previous errors helps improve recovery performance.

5. Confusion
What is the guarantee that re-executing with a modified environment fixes the bug instead of masking it?

1. Summary

This paper describes Rx: which is a system to ensure availability of real time software systems. Rx works by taking periodic snapshot of system state and reverting back to one of the recent snapshot when a system failure occurs. It then tries various changes in the execution environment of the system depending upon the inferred failure cause. If no change works, it reverts to the traditional technique of rebooting the entire system, otherwise it records the failure and the change that worked to revert it for future use.

2. Problem

System availability in the face of software failures is a crucial requirement for online systems. Any break in that leads to direct loss of revenue. The existing approaches for ensuring system availability include on software failures includes hardware reboot, and n-version programmes, and periodic checkpointing etc. Not only these approaches are very inefficient, they only work well against stochastic failures. Rx is based on the premise that many of the systems bugs manifest themselves under certain environmental conditions. By changing the environment it might be possible to evade expression of the bug during that particular time hence ensuring system availability. Note that this doesn't remove the bug itself from the system.

3. Contributions

Rx consists of various important components. This include sensors which monitor the application for the occurrence of bugs and errors. These sensors are implemented by catching the OS-raised exceptions during the execution of the application. The second important component of Rx is checkpointing and rollback. The checkpointing frequency is carefully determined to avoid the excessive overhead. Moreover, instead of just maintaining a single checkpoint, Rx maintains multiple checkpoints so it can revert to whichever one helps avoid the bug. Finally, Rx has environment wrappers which implement system calls in a modified way during replay. These modifications are the key things which help avoid the expression of the bug. These include padding of any allocated buffer to avoid buffer overflows, zeroing of allocated memory to avoid uninitialized reads. Allocating memory at different physical locations to avoid memory corruption, increasing time quantum of the system to avoid concurrency bugs, and reordering of any received messages to avoid malformed user requests. If none of these approaches work on the last snapshot, Rx goes back one snapshot and tries them again until the failure is avoided or system is rebooted. To make this process more efficient, the fix and the bug are remembered for future problems. Rx also has a proxy which helps reorder the user messages and handles the system responses. It also helps make sure that a client is not sent conflicting responses during replay.

4. Evaluation

Surprisingly, the authors only evaluated their system on 6 bugs in various systems 4 of which were already known to them. They found that Rx was successfully able to recover from all the bugs in a short time. This time was much lower than hardware reboot. The authors report a bunch of different matrices from their evaluation which include average response time, throughput and time and space overheads. However, the fact that all this was done for just 6 bugs creates doubts over system reliability.

5. Confusion

Some discussion on the functioning of the proxy would be good.

1.Summary
Rx is a new approach for dealing with the server failures caused by the common software errors such as memory corruptions and concurrency issues. The idea of the Rx is to replay the operations from the point where the error occurred after changing the execution environment. The Rx resolves many of the deterministic and non deterministic server errors in a manner that is transparent to the client.

2. Problem
The current methods for dealing with server failures like restarting the whole program, replaying the program from a checkpoint have various limitations. These approaches might not fix the deterministic errors, lead to a long recovery time, the clients experience downtime for a significant time, might need the applications to be altered to fix the bugs and so on.

3. Contribution
The Rx rollback the execution to the latest checkpoint where the bug was detected, dynamically modifies the execution environment using the failure symptoms and then re-executes the failed region in the new environment. The Rx sensors detects the error like assertion failures, access violations, divide by zero exceptions, buffer overflows, access to the freed memory etc and notifies to the control unit upon software failure. The control unit comes up with the execution environment which might prevent the bug from happening and suggests it to the environment wrappers. The environment wrappers include memory wrapper, message wrapper, process scheduling, signal delivery, dropping user requests. Memory wrapper avoids double free and the dangling pointer issues by delaying the freeing of memory, padding buffers deals with the problem of buffer overrun, zero filling eliminates the uninitialized reads etc. Message wrapper reorders the messages from the different sources, scheduling a process at a different time interval might avoid context switching in the middle of unprotected critical section. Dropping requests from the malicious users might also prevent the server from crashing. The proxy helps the server to recover from the crash in the manner that is transparent to the client. Control unit also builds the failure table to remember the failures and the solutions that fixed them, so that the successful failures are recovered sooner than the first failure of the kind.

4. Evaluation
The Rx is evaluated in four different manners. Firstly, it is evaluated against whole restart and simple rollback to evaluate surviving software failures caused by common software defects. Secondly, the performance overhead of Rx for both server throughput and average response time is evaluated. The behaviour of the Rx under certain malicious attack and the benefits of the Rx’s mechanism from learning from its previous experience using failure tables has also been examined. It is seen that Rx performs better than other approaches in all the servers except for CVS where replaying the whole program takes less time to recover than using Rx.

5. Confusion
Is the approach proposed by Rx used in real systems to deal with the failures in the server?

1. summary
This paper avoids the failures due to bugs at the runtime by changing execution evnironment to prevent the bugs from being triggered.

2. Problem
The problem this paper want to solve is the high availability of software. High availability is necessary for many applications such as trading systems. However, even with the strictest testing software may still fail due to bugs. As bugs are inevitable, a mechanism is needed to allow systems to survive the effects of uneliminated bugs.

3. Contributions
This paper presents Rx, a safe technique that enables quick recovery from many types of software failures caused by common bugs. The idea comes from researches showing that a large quantity of failures are dependent on the execution environment. When a bug is detected, Rx rollbacks the program to a recent checkpoint, dynamically changes the execution environment based on the failure symtoms and Rx's experiences, and re-executes the buggy code region in the new environment. After the successful execution of the buggy area, the new environment changes are disabled. The Rx system has several advantages:


  • comprehensive: Rx can survive many common software defects.

  • safe: Rx does not introduce uncertainty or misbehavior.

  • noninvasive: Rx requires few modifications.

  • efficient: Rx does not involve any rebooting or warm-up.

  • informative: Rx reports software bugs to the programmer.

4. Evaluation
The experiments are done on two machines (one server and one client) with 2.4GHz Pentium processor, 512KB L2 cache, 1GB of memory, and a 100Mbps Ethernet connection. The OS is Linux 2.6.10. The authors choose four applications with all kinds of bugs and four sets of experiments:


  • Evaluates the functionality of Rx in surviving software failures caused by common software defects by rollback and re-execution with environmental changes.

  • Evaluates the performance overhead of Rx for both server throughput and average response time without bug occurrence.

  • Evaluates how Rx would behave under certain degree of malicious attacks that continuously send bug-exposing requests triggering buffer overflow or other software defects.

  • Evaluates the benefits of Rx’s mechanism of learning from previous failure experiences, which are stored in the failure table to speed up recovery.


5. Confusion
How does the proxy communicate with the server in detail? Is it possible that the communication between the proxy and the server triggers unexpected bugs?
The failure table algorithm. How to maintain? How to use?


Summary:
In this paper, the authors present a new technique called Rx to recover from software failures for server applications. Unlike prior work on surviving software failures, Rx can recover from many deterministic software bugs as well.

Problem:
Software defects account for up to 40% of system failures, thus severely reducing system availability. Earlier work on surviving software failures suffers from limitations such as need for application restructuring, unsafe speculation on program execution, long recovery time and inability to address deterministic software bugs. Rx is aimed at addressing all these issues while efficiently improving software availability.

Contributions:
In this technique, checkpoints are used for rolling back upon software failure and the program is re-executed in a modified environment. Their intuition is that many bugs are correlated with the execution environment and can be removed by modifying it. Some of the techniques used for changing execution environment are padding allocated memory blocks, zero filling newly allocated memory buffers, message reordering, scheduling changes etc.
Rx detects software failures by using sensors which dynamically monitor application’s execution. Sensors notify the control unit upon failure with information to help identify the occurring bug for recovery. Environment wrappers perform environmental changes during re-execution for averting failures. Server recovery process is kept transparent to clients with the help of a module named proxy. Rx maintains a recovery table to record previous failures and the environmental changes required to fix them. This table is used on subsequent failures for fast recovery. The worst case recovery time in Rx is twice the amount of time required to restart the whole program.

Evaluation:
The authors evaluated four real-world applications which contained six types of bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow and double free. The results show that Rx can recover from all of these bugs. Whereas alternative approaches, whole program restart and a simple rollback and re-execution without environmental changes could not recover from deterministic bugs and only had 40% recovery rate for a non-deterministic concurrency bug. Recovery time is much faster in Rx as compared to whole program restart approach except for CVS application.

Confusion:
The authors seem to be fixing simplest possible bugs in server applications. The paper mentions hierarchical server rollback and multi-threaded process checkpointing which were not implemented in Rx yet. Do you think Rx would work for these cases?

1. Summary
This paper introduces Rx, which is a tool to recovery programs from software bugs. The key feature of Rx is to rollback the program to a previous checkpoint, and re-execute it in a modified environment (memory, network message, etc).

2. Problem
Detection and recovery from bugs in software is absolutely an important problem to solve in computer systems. Previous work such as rebooting, simply checkpointing and replay cannot solve the problem of deterministic bugs, because these bugs generally will appear again in same programming environment setting. Other work either needs restructuring application program, or may lead to program misbehavior.

3. Contributions
The most important contribution is to modify programming environment when replaying the program upon checkpoint. The system (Rx) contains 5 parts. The first part (sensor) is to monitor program's execution. The sensor intercepts OS-raised exceptions, and notifies the other parts of system of the failure. The second part (checkpoint-and-rollback) takes charge of storing snapshot of application into memory (memory, file states) by using copy-on-write for efficiency. The method of Rx is to rollback the program's state to a previous checkpoint and reply with a modified environment. If the program runs normally without failure upon a time threshold, the modified environment is changed back. Several rounds of rollback will be executed util a restarting of program is used. The third part (environment wrapper) of system is used to modify program's environment. Memory wrapper around memory-related library calls include delaying free operations, padding, isolating and zero-filling memory. Network message wrapper (the fourth part of system - proxy) reorders messages and delivers messages in random-sized packets. Other wrappers change process scheduling time, signal delivery time, and drop user request (drop message). The fifth part (control unit) builds a failure table to record previous recovery experience for future more efficient recovery. For each failure (exception), each kind of environment changes (e.g. padding allocated memory) have a integer (score) which denotes the number of successful re-execution under such modified environment. High scored environment changes will be used to modify environment when re-executing program.

4. Evaluation
The authors used 4 applications (web server, web cache and proxy server, database server, version control server) to compare Rx against whole program restart, and Rx without modifying environment. Each application has bugs when execution and the result showed Rx was able to recover from all bugs, while other two methods failed to recover from all but with 40% probability of one application (MySQL). The average recovery time for Rx was much better than whole program restart approach except for CVS. The result also showed the average memory space overhead for each Rx's checkpoint was small (no more than 500 KB), and subsequent recovery time of Rx was much faster than first recovery due to the use of failure table.

5. Confusion
Was there any more realistic workload to evaluate Rx? What kinds of workload/application makes Rx perform worse or better?

Summary

The paper introduced Rx, which is a safe technique aiming to quickly recover programs from both deterministic and non-deterministic software bugs.

The main idea of Rx is to rollback the program to a recent checkpoint upon a software failure and then, re-execute the program in a modified environment. The key component is the modified execution environment as Rx was proposed base on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by modifying the execution environment. Rx also collect diagnostic information based on environment modifications made to make the server survive software bugs which can be used to speed up future error restore and help developers to fix software errors. In addition, Rx’s checkpointing system is designed to be lightweight, imposing small time and space overhead.

Experiments has been conducted to show that Rx can survive a wide variety of software failures and provide transparent fast recovery 21-53 times faster than the whole program restart approach for most cases.

Problem

- Motivation

Increasing amount of applications, like process control or online transaction monitoring, require high availability of the server. At the same time, software failures account up to 40% of system failures.

- Existing methods and drawbacks

This fact motivates various mechanism to be designed to help system survive the effects from software bugs to the largest extent possible. Previously proposed mechanisms have been categorized into 4 types by the paper. The paper also pointed out the drawback of each category.

1. Different rebooting methods. Drawbacks including inability to solve deterministic errors and make service unavailable during restarting.
2. General checkpoint and recovery. Also have the problem of inability to solve deterministic errors. More non-determinism has been proposed for this type of approach.
3. Application-specific recovery mechanisms, such as the multi-process model, exception handling, etc. The drawbacks including of inability to solve deterministic errors and inability to deal with corrupted share data structures etc.
4. Non-conventional proposals such as failure-oblivious computing and the reactive immune system. According to the paper, these approaches are not safe as they “speculate” on programmers’ intentions which can lead to misbehavior.

The paper also argued current mechanisms do not provide enough feedback information on failure to help developers to debug.

Contribution

According to the paper, Rx have five advantages over existing systems:
1. Comprehensive: can survive many common software defects, including both non-deterministic and deterministic
2. Safe: Rx only changes the program’s execution environment thus not introducing uncertainty or misbehavior.
3. Noninvasive: Requires few to no modifications to applications’ source code.
4. Efficient: Usually can avoid rebooting
5. Informative: can generate more diagnostic information in additional to usual bug report package

Given a clear definition of Execution environment: “In our idea, the execution environment can include almost everything that is external to the target application but can affect the execution of the target application.” (section 2)

Discussed the order of applying environmental changes: 1. Use experience from past, 2. Use changes with less overhead before changes with larger overhead, 3. Use changes with negative side effects last.

The paper discussed the five main design component in detail:

1. Sensors: detect software failures by dynamically monitoring application’s execution.
2. Checkpoint and Rollback (CR): Keeps track of other system states such as file states as well as memory states. Supports multiple checkpoints. Use RTime calculation to drop the oldest check point to save space.
3. Environment Wrappers: a variety of environment wrappers (including memory wrapper etc.) to perform environmental changes during re-execution for averting failures. The key is that environment wrapper cannot violate system policies (even though they might introduce overheads), so that they are “safe” to apply. Since these wrappers cause overheads, they should be turned off after the server survives the software error.
4. Proxy:
a. Normal mode: buffer requests, mark the wait-for-sending request, do not buffer response, make sure to avoid repeat of self-conflicting response
b. Recovery mode: rollback, replay buffered requests, buffer incoming requests and avoid repeat or self-conflicting response.
5. Control Unit:
a. Make checkpoint and rollback
b. Diagnostic failure and use experiment to decide environmental changes to apply
c. Output diagnostic info to developers
The cool part of Control Unit is the failure table. It records “past experience” to help future recovery. More or less, it reminds me of “reinforcement learning”.

When it comes to implementation: inter-server communication, multi-threaded process checkpoint and unavoidable bugs were discussed.

Evaluation

Rx was evaluated on Linux kernel 2.6.10. The paper chose four different real-world server applications, including a web server, a web cache and proxy server, a database server, and a concurrent version control server. Various types of bugs were investigated, including the ones embedded in the server application originally as well as two new injected ones.

During the evaluation, Rx outperformed its alternatives (whole program restart, simple rollback and re-execution without environment changes) in recovery time, ability to recover and hiding recovery process to the clients. Rx also demonstrated higher throughput and lower average response time than whole program restart.

Rx always showed very little performance and moderate space overhead when comparing to baseline (configuration without Rx). Meanwhile, the use of failure table can greatly reduce future recovery time.

Confusion

1. In section 3.3, why is signal delivery a factor for concurrency bugs? How can rescheduling signal delivery solve these concurrency bugs?
2. In section 3.4, for Proxy behavior in recovery mode, the proxy sends out response if the buffer is full. Wouldn’t this be “unsafe”, as the re-execution may turn out to fail later?

1. Summary
Rx is a runtime system that allows software to quickly and dynamically recover from bugs by replaying buggy regions of execution in an altered environment. This allows many bugs to be avoided without changing the program’s contents or affecting its correctness.

2. Problem
Server software has bugs that cause crashes, and downtime is expensive for service providers. Current systems for dynamic bug recovery involve resetting the whole system, re-executing checkpoints without changing anything, or redesigning applications for reliability. These solutions incur too much overhead, cannot avoid deterministic bugs, or put too much burden on the programmer.

3. Contributions
The authors present a solution to dynamically handling bugs that combines runtime support, a checkpointing system, and failure sensing. I think their main contribution is demonstrating that such a system is tractable. It seems like an overly complicated prospect, but they implement their system with low performance and storage overhead.
The idea of altering the execution environment to avoid bugs is another contribution. Other approaches either do nothing to avoid bugs or require the software itself to be changed; this approach is an interesting middle ground that allows dynamic bug avoidance in a way that is client and server-software transparent.

4. Evaluation
The authors implement their system as a user-level process. Their test environment features two machines: one server and one client. The client’s request streams are generated synthetically and are designed to exercise bugs latent in the server software. The server software they run is real software: an Apache webserver, a cache/proxy, a DBMS, and version control software. They take advantage of documented bugs in the software, as well as injecting their own bugs in a few cases.
They perform several tests. They compare their system’s throughput and latency performance in the present of bugs versus that of a full-restart system. Since a full-restart system goes down while recovering from a bug, their system performs much better. Their system also maintains throughput and latency performance as the frequency of bugs increases. They show their system does not incur significant performance overhead versus a system with no dynamic bug detection. They calculate the storage overhead of their system (in memory) as 2 - 3 MB. Finally, they show that the learning behavior of their system (via the Failure Table) helps Rx’s performance improve over time.

5. Confusion
I was confused by how the buffering system in the proxy service works.

Post a comment