**RX: Treating bugs as Allergies** ================================== QUESTION: * What aspect (reliability or availability) does Rx help to improve? - Availability: strictly speaking, Rx does not mask failure, but it detect system failure and provide fast repair by changing execution environment, so that the system in re-execution (hopefully) avoid that failure - Moreover, the proxy in Rx make re-execution transparent to client, hence, from client view, there might be a increase in response time, but client doesn't see the server has crashed.s * Why not simply reboot? - Reboot does not helps with deterministic bug - Reboot is slow, need time to initialize the system state, to warm cache * Trade-off: when to use reboot, when to use Rx? - # 0. Take away: - how to improve availability? how to tolerate software failures? + Alternative 1: reboot > cannot deal with *deterministic failure* > unavailability during reboot > require time to make cache warm + Alternative 2: checkpoint and recovery > cannot deal with "deterministic bugs" Solution: message ordering, n-version programming + Alternative 3: application-specific > require software to be failure-aware ==> affect programming difficulty and code readability + Alternative 4: non-conventional (like failure oblivious computing) > for out-of-bound read, just provide dummy value rather than panic ==> may work for certain bugs and application, but not for all - idea in this paper: a "safe" approach, i.e rather fix bug at run time, change environment so that it does not manifest + rollback program to recent checkpoint + *change* execution environment based on failure symptoms + re-execute buggy code region in new environment + after passing the region, changes are disable to avoid overhead What if re-execute fails? - rollback again, with different changes or older checkpoint - after some threshold of iteration, switch to alternative (reboot) - Assumption: + bugs related to execution environment: > dirty read: ==> fill allocated buffer with zero > dangling pointer ==> delay recycling of freed buffer > buffer overrun ==> padding > data race ==> change timing of related event (e.g increase time slot) > malicious user ==> drop request + can be avoided by removing the "allergen" from environment Question: 1) HOW to find the *correct* allergen, and remove it from environment? (this is what the paper is all about) 2) Changing environment is expensive, hence when to make change? 3) Which change has to make first? + learn from history (if a change good for similar failure, apply it) + prefer small overhead changes + change with negative effects is tried last (e.g, drop request) Advantages of Rx: - comprehensive: survive a lot of bugs - safe: not fix the bug but change environment, no modification to program execution - non-invasive: few to no modification to source code - efficient: no reboot, no warm-up time - informative: provide additional info for debugging Disadvantage: - depends on accuracy of sensors (i.e failure detector) - cannot work with bugs that not depends on environment (i.e semantic bugs) - memory leakage - checkpoint and rollback is overhead - cannot deal with latent bugs - bugs in which the fault is introduce at a time long before any obvious symptoms. # 1. Design of Rx Five components: 1) sensors: to detect failure at runtime* - software error: e.g assertion failure, access violation - software bugs: buffer overflow, dangling pointer 2) Checkpoint and Rollback - What to checkpoint? 1) application state: COW on apps memory image 2) system states: file state, signal, message - dealing with space overhead? + write old checkpoints to disk when idle but, recovery checkpoint on disk requires IO ==> bound recovery time, hence small number of checkpoint 3) environment wrapper: to change environment during re-execution *Requirement* - correctness preserved - avoid future failure Now, the details - memory wrapper: modify memory related library to introduce changes + delaying free: avoid *dangling pointer* and *double free* > allocated only no other available free memory or after certain amount of time + padding buffers: avoid *buffer overflow* > but waste memory :) + allocation isolation: all memory allocated during re-execution is placed in isolate place ==> avoid memory corruption (buffer overflow) + Zero-filling: avoid *uninitialized read* > time over head - message wrapper: implemented in the proxy + shuffle order of requests + random-sized packets - scheduling wrapper: implemented in kernel + change process's priority ==> reduce the chance a process to switch off in unprotected critical region - signal delivery: + may affect probability of a concurrency bug's occurrence rate + record signal in a kernel-table before delivering + h/w interrupt: deliver at random time but preserving order + s/w timer signal: ignore during rollback + s/w exception: capture by sensors (as a sign of failure) - dropping user request: + to avoid malicious user 4) proxy: make client transparent to server recovery - 2 modes: normal and recovery - normal mode: + forwarding request/response between server and client + buffer request + record which message is answered - recovery mode: + replay the requests + introduce environmental changes + buffer new requests during re-execution + drop request that has been answered - for strict session consistency (i.e ensuring replay a message producing exactly same result) ==> use hashing, if not match, abort session - need to tweak application protocols to interpret the messages 5) control unit - direct checkpoint and rollback - diagnose failure, decide what changes should be applied based on symptom + symptoms could be: ~ type of exceptions ~ call chain ~ instruction counters ~ etc ... - provide failure-related info for further debug - learn from experience: + build failure table, and a score vector for each failure + once a failure is detected, search failure table, if match, apply those change first # Some issues - inter-server communication: web sever - app server - database server + apply Rx to all servers in hierarchy + Rx in tiered servers take checkpoints coordinately + When Rx detects a failure, it rollbacks the failed server and broadcast its rollbacks to other correlated servers, which then roll back correspondingly - multi-threaded process checkpointing: + when checkpoint, some threads may block inside kernel on a system call + potential problem: ~ some state where kernel locks have been acquired ~ roll back such state may cause two processes holding same kernel locks + solution: force all threads to stay at user level before checkpointing ~ how: send a signal to all threads ~ then, Rx resumes the prematurely returned system call Implication: frequent checkpointing affect performance of normal IOes - may not works for all bugs (this is the weakness) + memory leakage bugs: take days to make server crash ==> better reboot + semantic bugs (may not related to environment), hence Rx can do nothing + depends on accuracy of sensors