MYNOTE, 2nd revision ==================== Good notes here: http://www-cs-students.stanford.edu/~dbfaria/quals/summaries/Kronenberg-1986.txt GOALS - highly available - scalability Locking, quorum, all requires coordination among all nodes All nodes have equal responsibility, no centralize-node, hence a failure in one node may not affect the whole cluster. WHERE IS THE RELIABILITY? - well they have a layer design from software to hardware support - both software and hardware implements mechanism to deal/mask failure - Hardware: + dual path CI (Computer Interconnect) + CI port support reliable message and block transfer + use of mass storage control (similar to NAS: network attached storage) ~ easy sharing, using message passing ~ node failures doesn't affect the storage ~ dual-ported disks, provide redundancy in case of failure - Software: + connection managers coordinate the cluster ~ Transparently deal with cluster transition when node leaves/joins ~ Prevent network partition by using a quorum voting scheme + lock managers: ~ Transparently deal with a node failure WHAT IS SIMILAR TO V-KERNEL/RPC? - The communication support: datagram, messages, and block transfer - datagram: + unreliable, cheap, fast + used for status and info messages, whose loss is not critical - messages: + small size, but reliable (no duplication, no out of order delivery) + used for carry requests to a services, for example, disk read, write - block data: + can have large size (4K) + reliable transfer + CI Port can copy block data directly from/to process virtual memory ~ avoid OS-level copying (as in message transfer) + optimize for large data transfer + used for carry results of disk read, write requests - The point is: different type of transfer can use different support + for example: disk class driver use message to transfer request, and block data to transfer real data INTERESTING PART is DISTRIBUTED LOCK MANAGER? - goals: + distribute load ~ every node has a lock manager + minimize transfer - operation: acquire/release a lock - Resource can form a tree - Every resource has a master responsible for granting lock - If a node on which the lock request for the root is made, that node become the master of the tree - What need to store? + for every lock: resource lock description ~ list of granted locks ~ queue of waiting requests for that resource + resource lock directory: store mapping (resource name, master node) ~ distributed over the nodes ~ some mechanism to lookup: given resource name, give the directory - What happens when want a lock on a resource R + find out the directory D containing mapping for R + lock manager send message to the directory, asking for a lock ~ if D is the master node for R, D performs a lock requests and confirm ~ if D is not the master node, but R is defined, return a response containing the master ID. No the lock manager go ahead and contact the master ~ if D find that R to be undefined, return a response telling the requesting node to be the master of the resource itself (cool) + SCALE: The number of messages is independent of the number of node in the clusters NOTE: The resource is defined to lock manager when some node first acquire the lock for that resource ... NOTE: this is better than the Centralized approach DATA CACHING? - 16-byte *value block* associated with each resource to support caching - value block can act as version number of resources - releasing the lock on a resource increment associated version number - How it work? + a process cache data and the version + to modify the data, the process lock it and has the updated version + the process compare the new version with the one it cache + if they match, the process can use the cache What is DEFERRED WRITEBACK? - another optimization for data caching - a process holding the lock on a data item can be notified if another process is blocked for that lock + so that the current holder can release the lock LOCK RECOVERY ON NODE FAILURE? - connection manager tell lock manager about the cluster transition - Each lock manager recovers, and all lock managers must complete this before cluster operation can continue - Steps: + a lock manager deallocates all locks acquired on behalf of other node ~ but retain local lock and resource info ~ no resource masters or directory nodes + Each lock manager (on behalf of its host node) require EACH lock it had when the cluster transition began • establish new directory nodes, re-arrange the assignment of master nodes (well this is random order) - OK, if a node left, ==> NET result: release all locks held by that node - if new node join • this recovery is not necessary anyway • but it helps redistribute directory and lock mastering overhead BUT IS WORTH? NO, I think this is bad actually. Why? Every node join/leave leads to a recovery action, involving recalculate directory and mastering database. Moreover, all node has to persistent its locking info. A lot of message exchange. It is worse if the number of node in the system increase. And I don't think this gonna scale. DOES STAR TOPOLOGY MATTER? WHAT IS NULL LOCK? - well the initial cause of creating a lock is expensive: + establish directory entry + creating master + there will be a bunch of message passing back and forth - Hence, the idea is creating before you use it, so you pay overhead of setting up, but later on if you really acquire it, it is faster, because the database for that lock is already create - INTUITION: Locking is expensive What GOOD ABOUT THE LOCKING PROTOCOL? - Automatic, upper level does not aware - Auto adjust when a node leave/join + but may be too eager + for example, if node leaving does not contain any lock (how do you known) no need to recover + new node join, recovery indeed just to redistribute load, but what if the redistribution gain smaller than the overhead pay for recovery? - Fixed # messages of locking protocol --> scale + I think this is a small point ... **VAXcluster: a closely-couple distributed system** =================================================== + Ask Mike about this? How Message passing after to VAX cluster? Compare to NFS, AFS? How this handle large data transfer? How about caching of file system? > because CI Port copy data directly to VM of process... # 0. Take away: - how to make the closely-couple distributed system fast: + star topology + efficient hardware architecture + large size message transfer + problem of locking + reduce number of copy > CI Port copy to VM directly > no interruption to host (need to revise this note) # 1. Introduction - tightly-coupled architecture: + close processors + shared memory (high bandwidth) + single copy of OS - loosely-coupled architecture: + processor physically separated + low-bandwidth message-oriented interprocessor communication + independent OSs - This paper, *closely-coupled* structure: has feature of both + separate processors + message-oriented interconnection among memories + separate copies of distributed OS - Single security domain - Shared disk storage - high-speed memory-to-memory block transfer between node - Question: how to design such system to make it fast? + try to overcome software and communication overhead + specialized communication hardware architecture + efficient software # 2. Hardware structure - start topology: + safely add and remove node + longer operation distance - CI: Computer Interconnect + arbitration, path selection, data transmission + CSMA + each has specific delay time + 2 delay times (short & long) for fairness + ack is send *without* rearbitrating + retries if no ack # 3. CI PORT - connecting host/storage controller to CI - support: datagram, message, block data transfer - transfer protocol is optimized to reduce interrupting VAX host + transfer large amount of data (incase of block data transfer) + confirmation packet generated only when failure occurs # 4. Mass storage control - message oriented + easy sharing (how?) ~ just send message (similar to NAS) + ease of extension to new devices: > single disk class driver, independent of drive specifics > new disk/tapes can be added with little/no modification to host sw + improved performance: > queue of requests create opportunity to optimize # 5. VAXcluster Software - Connection manager: coordinating the cluster + maintain list of all member + prevent partitioning using quorum voting > each node has some votes > connection manager computes total vote of all nodes if this total value < a quorum value, cluster is suspended - Cluster-wide shared file system: + cluster filename = disk device name : directory: real file name + synchronized using lock manager - Distributed lock manager: + must be efficient? How + bypass cluster-specific overhead in case of single node + use null lock at the beginning to avoid creating overhead + support hierarchy lock + distributed among node (how? see the paper) > lock description: can be cached implemented using version number, or a call back (i.e a process requesting an exclusive lock can specify that is should be notified when other lock request on the resource is blocked) > lock directory: mapping between resource and the master node (responsible to grant the lock on the resource) + on failure of a node > all lock manager must perform recovery action before cluster op can resume: 1) deallocate all locks 2) reacquire each lock it had prior to the failure ==> lock held by failure node will be release