Notes: telecom is 5 nines, not 7 nines
MicrosoftŐs first major attempt at high availability service (not fault tolerant!)
- What kinds of OS infrastructure do you need?
- What kinds of apps / client do you need?
- Big picture: lots of complexity in making clusters work
- This paper is about platform services for clusters
- Definition: collection of nodes work in concert to provide a more powerful / more reliable service
o can grow larger than a single node can
o Can be more reliable
o Can be built from less expensive components
o NOTE: FT clusters built from fairly expensive parts
- MS approach: software clusters on commodity hardware
- All persistent state goes on a disk accessible to all cluster members
- All private state (e.g. volatile) must be consistent
- Clients see a single machine
- Support server applications (file servers, databases, web servers, SAP R3 servers)
- Recovery by migrating resources off a failed machine
- Shared nothing
o Dual ported disks, one machine uses it at a time
o Shared disk: multiple machines access disk at same time, use lock manager to negotiate
o Shared memory: e.g. SMP
- Unaware clients: migrate network addresses as well as services
QUESTION: what kind of failures are they targeting? HW? SW? Which SW?
- membership management
o On awaken, nodes:
¤ Check for other alive nodes
¤ If none, form new cluster
¤ If some, join existing cluster
o QUESTION: what makes it hard?
o ANSWER: failures during joining
¤ Partition: may have two halves of cluster, both think they are the only survivors, make independent contradictory decisions
- Failure detection
o Heartbeats to services to detect failures
o Migrated services from failed machine elsewhere (or restart)
- Cluster: group of nodes providing some services
- Resource: functionality offered at a node
o example: printer, IP address, application, server share, web site
- Quorum resource: resource which, if owned, makes you part of the quorum so you can win elections (see membership)
- Resource Dependenices:
o resources may depend on each other (e.g. web site on database).
o CanŐt restore a resource if dependencies not present
o Tracking dependencies lets you know what to restart, in what order
¤ Like Apple LaunchD, MS Service Controller
- Resource Groups: explicitly named resources treated as a unit
o Simplifies management
- resources have states
- Resources have dependencies
- Resource information maintained in a shared database – replicated everywhere via logs
- Resources implement a generic mgmt interface to allow them to be managed, migrated (start, stop)
o Push: node containing a resource picks a place for it to go, pushes it there (with dependencies).
¤ QUESTION: when does this work?
¤ ANSWER: when node is healthy enough
o Pull: other nodes in cluster pull resources from a failed node
¤ QUESTION: which node gets which resources?
¤ ANSWER: all have same shared info, up to apps to decide
- Client access: via single network name that is a resources that migrates
o HOW: announce IP address via ARP
- Manages who is a member
o Join: 5 phase protocol
¤ tell everyone else, tell the new node, once it joins, tell everyone the join completed, ack new member
¤ WHY? must handle failures during the protocol.
o Regroup: on node failure to establish new membership
¤ Trickiness: handling node failure, partition
á Must pick 1 partition to be the real one, kill others
á QUESTION: How decide?
o Majority of old membership
o Half members + tie breaker node from original cluster
o 1 node + quorum resource (a disk)
¤ Parts of protocol:
á Test clock tick
á Determine which partition is winner
á Pruning: kill all nodes not connected
á Cleanup 1: notify others, filter requests from dead nodes
á Cleanup 2: second phase of cleanup (so have knowledge of how others have progressed)
o HOW SLOW?
¤ Join < 1.5 seconds for 1-12 nodes
¤ Regroup: 2 seconds
- Global update manager
o For propagating shared state, assure everyone in same state
¤ see state machine approach?
o Goal: atomic broadcast
¤ When send a message, either all alive nodes here messages in same order, or nobody does
o Approach: lock
¤ Grab a lock from lock node
¤ Update other nodes in order
¤ release lock
¤ On failure: lock migrates to next alive node. If it doesnŐt have data, nobody does É
o HOW SLOW?
¤ 32 nodes ~ 6 small updates /sec
¤ under load, 10 nodes -> 2-5 seconds to complete, breaks down with 12 nodes
- heartbeat mgmt
- disk drivers: allows having a dual-ported disk
- cluster event logging
- time service
- virtual servers: encapsulate app state relative to a specific machine instead to a virtual OS so it can be migrated. E.g. computer name, address, registry, endpoints of other services
- SQL server
o Failover at machine level, not db level
¤ Not need identical machines (cluster handles that) so app settings can migrate
¤ Handles all protocols (e.g replication), not just client access protocl (ODBC)
- Isolation (to a single machine)
- Fail fast (detect failure quickly with heartbeats)
- Fast recovery (restart on separate machine)
- Persistent shared state (on disk only)
- only scales to two nodes
- What kind of performance do you get? What if you have a failure?