(3.1.2) Algorithms for scalable syncronization

John M. Mellor-Crummey, Michael L. Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9(1): 21-65 (1991). ACM DL link

review from 757

Modern processors include hardware support, atomic instructions in particular, to help correctness in shared memory system. As the number of processors accessing a data element increase, so do the wait times and the network traffic resulting in decrease in performance. The paper first talks about spin locks typically used by busy-wait type programs. On a bare minimum, with the help of atomic instructions, a lock can be test_and_set by a requesting thread before the shared memory is accessed. However this is not scalable, the test_and_set operation is expensive. To reduce this read modify write operation, the lock are checked before test and locking it. As no request order is maintained, this method can lead to starvation. Instead, counters indicating number of access requests and releases can be compared to acquire a “ticket” lock. Backoffs now cause a problem, a waiting thread does not know when to re-check availability and this directly causes other threads to wait. Array based queuing locks maintain a FIFO of requests. This eliminates starvation (assuming required FIFO depth is available) but a central FIFO causes network traffic. The authors propose a new “MCS” lock, where a pointer to the lock is locally stored. However a release needs to be explicitly spun, informing every other lock list to drop the exiting processor.

Barriers are essential synchronization constructs. They ensure that threads proceed ahead only when all threads have reached. The location or data structure that maintains barrier information is a shared data structure. The most obvious is the counters that are incremented when each thread reaches the barrier. The last thread to reach resets it, triggering a go ahead. This is not scalable as too many processors will over load the counters. A combining tree like structure ensures that each leaf has only a small number of processors updating it. Dissemination barriers works with processors signaling others about their completion and tournament barriers forward a winner processor up every node. The authors propose a new tree-based barrier where each processor spins on a local copy of the barrier, that is woken up by a winner processor.

The authors measure the performance of spin locks and barrier algorithms on the BBN Butterfly and the Sequent Symmetry multiprocessor machines with distributed shared memory. At every lock acquisition or barrier time taken is saved, averaged out and reported versus number of processors contending. More than the new MCS lock and the barrier proposed, the paper excels in the criteria that are defined for evaluation.

When the locks are locally cached, heavy network traffic occurs each time any change occurs to maintain coherency. Also all these algorithms are built on top of atomic instructions (noteable exception : Lampert’s bakery algo). This would also mean strict in order complete of loads and stores. Lock behavior when precise interrupts or exceptions occur must also be defined. This comprehensive paper talks about the evolution of ideas in locks and barriers till 1991. The criteria and flowchart they suggest in choosing an approach given constraints is its key contribution.