CS 736 – Fall 2006
Midterm 2 Review
- Performance
- Papers:
i. AFS
ii. Frangipani / Petal
iii. Chubby
- General Problems
i. Latency: do things faster
1. E.g. RPC turn around
ii. Throughput
1. Handle more requests/operations per second
iii. Scale out:
1. run on bigger data sets on more machines
2. Handle more data on more computers / faster
computers
3. Run well on a cluster with a billion clients
- General Solutions
i. Locality
ii. Partitioning – distribute load to multiple
servers
1. AFS
2. Petal
3. Frangipani
iii. Replication – more read throughput
1. AFS
2. Petal
iv. Caching
1. AFS
2. Frangipani
3. Chubby
v. Change data structures
1. Logging in Petal, Frangipani
vi. Batching – reduce startup costs
1. Delayed write / group commit in LFS / Frangipani,
NFS
vii. Callbacks / leases – reduce server load
1. AFS
2. Frangipani
3. Chubby
viii.
Move work to client
1. AFS name translation
ix. Notifications – notify other participant of
semantically interesting events
1. AFS callbacks, Petal locks
x. Multi-level policy
1. Petal global / physical maps
- Reliability
- Terms:
i. Fault = code bug
ii. Error = memory corrupted by that bug
iii. Failure = system misbehavior
- metrics
i. MTTF = mean time to failure == reliability
ii. MTTR = mean time to repair
iii. Availability = MTTF / (MTTF + MTTR), measured in
9s: 0.9, 0.99, 0.999
- General design principles
i. End-to-end design
1. If code has to handle a problem anyway (middle layers
canŐt completely solve a problem), it is a good place to handle the problem
completely
ii. Recovery-oriented computing
1. It may be easier to let things fail and recover
quickly than to make them perfect
a. Improve MTTR instead of MTTF
iii. Transactions
1. Provide a general purpose error handling mechanism
by abort
- Failure Models
i. Timing – miss a deadline
ii. Output – produce incorrect output
iii. Omission – skip an output
iv. Crash – skip an output, produce no more
output
v. Byzantine
1. Anything can happen, including malicious behaviors
- General Approaches
i. Fault Avoidance
1. Prevention: make sure bugs never enter code
2. Removal: Remove bugs from existing programs
3. Work-around: donŐt execute buggy code
a. E.g. Fire walls
ii. Fault Tolerance
1. Redundancy – execute multiple times
2. Diversity: multiple versions for deterministic
bugs. Can also be diverse environment (change how memory allocated, scheduling
works)
3. Isolation: confine errors to a single component
4. Modularity: keep components small
5. Error detection: why important if have isolation?
a. A: Needed for availability if doing wrong things
6. Recovery
a. Forwards / Backwards
b. Concealing / revealing
7. Where do you provide fault tolerance
a. In the application?
b. In a library
c. In the OS
d. Around a component (e.g. Nooks)
e. In the HW
f. If everything above layer X is identical, can
tolerate faults at X or below automatically
g. If have some diversity above X, can tolerate
heisenbugs above layer X
- Systems
i. Process Pairs
1. All work done in persistent transactions
2. Process 1 sends request to its pair, tries to do
work. On failure, transaction aborts and process 2 retries.
ii. Nooks
1. Isolating device drivers
2. Recovery by restart
iii. Recovery Oriented Computing
1. FIG: inject system call failures to determine bugs
in error handling
a. Goal: gracefully handle failures
2. Recursive restartability: allow an application to
be partially rebooted
a. Goal: improve MTTR
3. Undo for Operators: log system state & mgmt
operations, allow for rollback, repair, replay
a. Goal: improve MTTR of management
iv. QuickSilver
1. Goal: transactions to simplify error handling in a
distributed system
2. Optimized for fast cases: read only, volatile data
3. Common TM, LM per system
- Security
- Key threats:
i. Privacy – uncontrolled release of sensitive
information
ii. Integrity – uncontrolled change to sensitive
information
iii. Denial of service – uncontrolled prevention
of service
- Guard model
i. Guard enforcing access control : impenetrable wall
with a door
ii. Authorization check : guard demands something it
can check against its database
iii. Protected information : database of tamperproof
information
iv. Decision procedure : mechanism for making a
decision
v. SIMPLEST solution: complete isolation
1. Reject everything
- Systems
i. Needham and Schroeder
1. Secret Key vs. Public key
2. On-line vs off-line
3. QUESTION: When use secret key?
a. When humans need to remember a key
b. When servers have little power
4. QUESTION: When use public key
a. When have a can distribute certificates out-of-band
(e.g. with browser / OS)
b. When no simple trust relationship
5. QUESTION: How handle authentication between realms
/ domains?
a. Referrals
b. Certificate chains
ii. SFS
1. Self-certifying names encode hash of key in name
a. Client verifies that server has key corresponding
to name
2. Key management (e.g. what keys a client should
trust) separated from protocol
3. Client authentication separated from server
authentication
4. Perfect forward secrecy of communication if client
public key changes regularly
iii. Terra
1. QUESTION: What does it provide?
a. Assurance of what SW running on a remote computer
b. How provided? Each layer acts as AS for next upper
layer, signs key + certificate
c. Remote system can verify certificate chain
2. QUESTION: What problems can it solve>?
a. Spyware?
b. Gaming?
c. Password sniffing terminals?
i.