** Locks (continued) ** To go beyond Peterson's algorithm and to build a working lock, we will need some help from our old friend, the hardware. Over the years, a number of different hardware primitives have been added to the instruction sets of various computer architectures; while we won't study how these instructions are implemented (that, after all, is the topic of a computer architecture class), we will study how to use them in order to build a mutual exclusion primitive like a lock. [TEST AND SET] Some architectures provide a form of what is called "test and set" in a single atomically-executed instruction. You can think of this instruction like this: testandset
, rnew, rold What the instruction does is as follows. It loads the old value at "address" into register "rold", and simultaneously takes the value in register "rnew" and stores it into memory at "address". The key is that these two things are performed atomically. We can use such a primitive to build a simple lock that works, like this: -------------------------------------------------------------------------------- void init() { // 0 indicates that lock is available, 1 that it is held by a thread flag = 0; } void lock() { while (TestAndSet(&flag, 1) == 1) ; // spin-wait (do nothing) } void unlock() { flag = 0; } -------------------------------------------------------------------------------- [FIGURE: SIMPLE LOCK WITH TEST-AND-SET INSTRUCTION] In the figure above, we assume that there is a C wrapper around this instruction that looks something like this: int TestAndSet(void *address, int newValue) and it thus puts newValue in register "rnew" from above and returns the value that ends up in register "rold". Let's make sure we understand why this works. Imagine first the case where a thread calls lock() and no other thread currently holds the lock; thus, flag should be 0. When the thread then calls TestAndSet(flag, 1), the routine will return the old value of the flag, which is 0; thus, the calling thread, which is *testing* the value of flag, will not get caught spinning in the while loop and will acquire the lock. The thread will also atomically *set* the value to 1, thus indicating that the lock is now held. When the thread is finished with its critical section, it simple calls unlock to set the flag back to zero. The second case we can imagine arises when one thread already has the lock held (i.e., flag = 1). In this case, this thread will call lock() and then call TestAndSet(flag, 1) as well. This time, however, TestAndSet() will return the old value at flag, which is 1 (because the lock is held), while simultaneously setting it to 1 again. While the lock is held by another thread, TestAndSet() will repeatedly return 1, and thus this thread will spin and spin until the lock is finally released. When the flag is finally set to 0 by some other thread, this thread will call TestAndSet() again, which will finally return 0 while atomically setting the value to 1 and thus acquire the lock. Thus, by making both the *test* and *set* of the lock one atomic operation, we can ensure that only one thread acquires the lock, and thus we can build a successful mutual exclusion primitive. [SO MUCH SPINNING, MY HEAD HURTS] Our simple test-and-set lock is simple and works, which are two excellent properties of any system or code. However, in some cases, it can be quite inefficient. Imagine you are running two threads on a single processor (you can imagine similar scenarios with N threads on M processors, where N > M). Now imagine that one thread (thread 0) is in a critical section and thus has a lock held, and unfortunately gets interrupted. The second thread (thread 1) now tries to acquire the lock, but finds that it is held. Thus, it begins to spn. And spin. Then it spins some more. And finally, a timer interrupt goes off, thread 0 is run again, which releases the lock, and finally (the next time it runs, say), thread 1 won't have to spin so much and will be able to acquire the lock. Thus, any time a thread gets caught spinning in a situation like this, it wastes an entire *time slice* doing nothing but checking a value that isn't going to change! Also, when there are many threads spinning, we have no control over which thread acquires the lock. As you might guess, this could lead to starvation of a given thread, as other threads might repeatedly grab the lock and thus prevent the progress of others. A good solution will not allow starvation to occur. [THE CRUX OF THE PROBLEM] Thus, we have a problem: how can we develop a lock that doesn't needlessly waste time spinning on the CPU, and also prevents starvation from occurring? [A SIMPLE APPROACH: JUST YIELD] Our first try is a simple and friendly approach: when you are going to spin, instead give up the CPU to another thread. Or, as Al Davis might say, just yield, baby. Here is the code for this approach: -------------------------------------------------------------------------------- void init() { flag = 0; } void lock() { while (TestAndSet(&flag, 1) == 1) yield(); // give up the CPU } void unlock() { flag = 0; } -------------------------------------------------------------------------------- [FIGURE: LOCK WITH TEST-AND-SET INSTRUCTION AND YIELD] In this approach, we assume an operating system primitive *yield()* which a thread can call when it wants to give up the CPU and let another thread run. Because a thread can be in one of three states (running, ready, or blocked), you can think of this as an OS system call that moves the caller from the *running* state to the *ready* state, and thus promotes another thread to running. Think about the example with two threads on one CPU; in this case, our yield-based approach works quite well. If a thread happens to call lock() and find a lock held, it will simply yield the CPU, and thus the other thread will run and finish its critical section. In this simple case, the yielding approach works well. Let us now consider the case where there are many threads (say 100) contending for a lock repeatedly. In this case, if one thread acquires the lock and is preempted before releasing it, the other 99 will each call lock(), find the lock held, and yield the CPU. Assuming some kind of round-robin scheduler, each of the 99 will execute this run-and-yield pattern before the thread holding the lock gets to run again. While better than our spinning approach (which would waste 99 time slices spinning), this approach is still costly; the cost of a context switch can be substantial, and there is thus plenty of waste. Worse, we have not tackled the starvation problem at all. A thread may get caught in an endless yield loop while other threads repeatedly enter and exit the critical section. We clearly will need an approach that addresses this problem directly. [QUEUED LOCKS: CONTROLLING WHO GETS THE LOCK EXPLICITLY] The real problem with our previous approaches is that they leave too much to chance. The scheduler determines which thread runs next; if the scheduler makes a bad choice, a thread runs that must either spin waiting for the lock (our first approach), or yield the CPU immediately (our second approach). Either way, there is potential for waste and no prevention of starvation. Thus, we must explictly exert some control over who gets to acquire the lock next after the current holder releases it. To do this, we will need a little more OS support, as well as a queue to keep track of which threads are waiting to enter the lock. Let's look at the code: -------------------------------------------------------------------------------- typedef struct __mutex_t { int flag; int guard; queue_t *q; } mutex_t; void lock_init(mutex_t *m) { m->flag = 0; m->guard = 0; queue_init(m->q); } void lock_acquire(mutex_t *m) { while (TestAndSet(m->guard, 1) == 1) ; //acquire guard lock by spinning if (m->flag == 0) { m->flag = 1; // lock is acquired m->guard = 0; } else { queue_add(m->q, gettid()); m->guard = 0; yield(); } } void lock_release(mutex_t *m) { while (TestAndSet(m->guard, 1) == 1) ; //acquire guard lock by spinning if (queue_empty(m->q)) m->flag = 0; // let go of lock; no one wants it else wakeup(queue_remove(m->q)); // hold lock (for next thread!) m->guard = 0; } -------------------------------------------------------------------------------- [FIGURE: LOCK WITH QUEUES, TEST-AND-SET, YIELD, AND WAKEUP] We do a couple of things in this example. First, we are showing how to build a more realistic mutex implementation, in that we actually have a *mutex_t* data type (which gets passed to all of the lock routines). Second, we combine the old test-and-set idea with an explicit queue of lock waiters to make a more efficient lock. Third, we use a queue to help control who gets the lock next and thus avoid starvation. You might notice how the guard is used, basically as a spin-lock around the flag and queue manipulations the lock is using. This approach doesn't avoid spin-waiting entirely; a thread might be interrupt while acquiring or releasing the lock, and thus cause other threads to spin-wait for this one to run again. However, the time spent spinning is quite limited, and thus this approach may be a reasonable one. Secondly, you might notice that in lock_acquire(), when a thread can not acquire the lock (it is already held), we are careful to add ourselves to a queue (by calling the *gettid()* call to get the thread ID of the current thread), set guard to 0, and yield the CPU. A question for the reader: What would happen if the release of the guard lock came *after* the yield(), and not before? Finally, you might notice the interesting fact that the flag does not get set back to 0 when when another thread gets woken up. Why is this? Well, it is not an error, but rather a necessity! When a thread is woken up, it will be as if it is returning from yield(); however, it does not hold the guard at that point in the code and thus cannot even try to set the flag to 1. Thus, we just pass the lock directly from the thread releasing the lock to the next thread acquiring it; flag is not set to 0 in-between. You could probably work through some more examples to convince yourself that this approach works. I could do that too, but am out of writing steam. [MORE REALISTIC LOCKS: THE REAL WORLD] The above approach shows how real locks are built these days: some hardware support (in the form of a more powerful instruction) plus some operating system support (in this case, in the form of yield() and wakeup() primitives). In today's systems, this basic approach is followed, but not surprisingly, some of the details differ. As for hardware primitives, the x86 platform provides something similar to test-and-set, known as *compare and exchange*, and on SPARCv9, there is a *compare-and-swap* instruction. A *fetch-and-add* is available on the DEC Alpha, and on MIPS, there are a pair of instructions known as *load linked* and *store conditional*. All of these can be used to build locks. Look up how these work, and see if you can figure out how to build some simple locks with these primitives. As for operating system support, there are some interesting alternatives out there. For example, in Linux, there is a single system call *futex()* which is used to both yield the processor and wake up sleeping threads. (some more details could go here)