** Threads: Intro **

Thus far, we have seen the development of the basic abstractions that the OS
performs. We have seen how to take a single physical CPU and turn it into
multiple *virtual CPUs*, thus enabling the illusion of multiple programs
running at the same time. We have also seen how to create the illusion of a
large, private *virtual memory* for each process; this abstraction of the
*address space* enables each program to behave as if it has its own memory
when indeed the OS is secretly multiplexing the address spaces across physical
memory (and sometimes, across the disk). Finally, we have seen the
abstractions of the file and the directory, virtualizing the disk to give
users a way to organize and share data that they want to keep persistent.

In this note, we introduce a new abstraction for a single running process:
that of a *thread*. Instead of our classic view of a single point of execution
within a program (i.e., a single PC where instructions are being fetched from
and executed), a *multi-threaded* program has more than one point of execution
(i.e., multiple PCs, each of which is being fetched and executed
from). Perhaps another way to think of this is that each thread is very much
like a separate process, except for one major difference: they *share* the
same address space and thus can access the same data.

The state of a single thread is thus very similar to that of a process. It has
a program counter (PC) that tracks where the program is fetching instructions
from. Each thread has its own private set of registers it uses for
computation; thus, if there are two threads that are running on a single
processor, when switching from running one (T1) to running the other (T2), a
*context switch* must take place. The context switch between threads is quite
similar to the context switch between processes, as the register of T1 must be
saved and the register state of T2 restored before running T2. With processes,
we saved state to a *process control block* (PCB); now, we'll need one or more
*thread control blocks* (TCBs) to store the state of each thread of a process.
There is one major difference, though, in the context switch we perform
between threads as compared to processes: the address space remains the same
(i.e., there is no need to switch which page table we are using).

One other major difference between threads and processes concerns the
stack. In our simple model of the address space of a classic process (which we
can now call a *single-threaded* process), there is a single stack, perhaps
residing at the bottom of the address space:

     |---------------------|
     |    program code     | 
     |                     | 
     |                     |
     |---------------------|
     |        heap         | 
     |                     | 
     |                     | 
     |---------------------| 
     |          |          |
     |          v          |
     |                     |
     |                     |
     |                     |
     |                     |
     |                     |
     |        free         |
     |                     |
     |                     |
     |                     |
     |                     |
     |                     |
     |                     |
     |          ^          |
     |          |          |
     |---------------------|
     |        stack        |
     |                     |
     |---------------------|

 [FIGURE: SINGLE-THREADED ADDRESS SPACE]

However, in a multi-threaded process, each thread runs independently and of
course may call into various routines to do whatever work it is doing. Thus,
instead of a single stack within the address space, there will be one for each
thread. Let's say we have a multi-threaded process that has four threads in
it. In such a case, our address space might look like this:

     |---------------------|
     |    program code     | 
     |                     | 
     |                     |
     |---------------------|
     |        heap         | 
     |                     | 
     |                     | 
     |---------------------| 
     |          |          |
     |          v          |
     |                     |
     |        free         |
     |                     |
     |          ^          |
     |          |          |
     |---------------------|
     |       stack 1       |
     |                     |
     |---------------------|
     |        free         |
     |                     |
     |          ^          |
     |          |          |
     |---------------------|
     |       stack 2       |
     |                     |
     |---------------------|
     |        free         |
     |                     |
     |          ^          |
     |          |          |
     |---------------------|
     |       stack 3       |
     |                     |
     |---------------------|
     |        free         |
     |                     |
     |          ^          |
     |          |          |
     |---------------------|
     |       stack 0       |
     |                     |
     |---------------------|

 [FIGURE: MULTI-THREADED ADDRESS SPACE]

In this figure, you can see three stacks, 0, 1, 2, and 3, spread throughout
the address space of the process. Thus, any stack-allocated variables,
parameters, return values, and other things that we put on the stack will be 
placed in what is sometimes called *thread-local* storage, i.e., the stack of
the relevant thread.

[AN EXAMPLE: THREAD CREATION]

Let's say we wanted to run a program that created two threads, each of which
was doing some independent work, in this case printing "A" or "B". The code
might look something like this:  

#include <stdio.h>
#include <assert.h>
#include <pthread.h>

void *
t1(void *arg) 
{
    printf("A\n");
    return NULL;
}

void *
t2(void *arg) 
{
    printf("B\n");
    return NULL;
}


int
main(int argc, char *argv[])
{
    pthread_t p1, p2;
    int rc;

    printf("main: begin\n");
    rc = pthread_create(&p1, NULL, t1, NULL); assert(rc == 0);
    rc = pthread_create(&p2, NULL, t2, NULL); assert(rc == 0);

    // join waits for the threads to finish
    rc = pthread_join(p1, NULL); assert(rc == 0);
    rc = pthread_join(p2, NULL); assert(rc == 0);
    printf("main: done with both\n");
    return 0;
}

The main program creates two threads, one of which will start running at
function t1, and the other at function t2. Once a thread is created, it may
start running (depending on the whims of the scheduler); alternately, it may
be put in a "ready" but not "running" state and thus not run yet. 

After creating the two threads t1 and t2, the main thread calls
pthread_join(), which waits for a particular thread to complete.

Let us examine the possible execution ordering of this little program. In the
diagram, time is moving downwards. 

------------------------------------------------------------------------------
main starts running
main prints "main: begin"
main creates thread 1
main creates thread 2
main waits for thread 1
                              thread 1 runs
                              thread 1 prints "A"
                              thread 1 returns
main waits for thread 2
                                                     thread 2 runs
                                                     thread 2 prints "B"
                                                     thread 2 returns
main prints "main: end"
------------------------------------------------------------------------------

Note, however, that this is not the only possible ordering. In fact, there are
many, depending on what the scheduler decides to run at a given point. For
example, once a thread is created, it may run immediately, which would lead to
the following execution:

------------------------------------------------------------------------------
main starts running
main prints "main: begin"
main creates thread 1
                              thread 1 runs
                              thread 1 prints "A"
                              thread 1 returns
main creates thread 2
                                                     thread 2 runs
                                                     thread 2 prints "B"
                                                     thread 2 returns
main waits for thread 1 (returns immediately because thread 1 is finished)
main waits for thread 2 (returns immediately because thread 2 is finished)
main prints "main: end"
------------------------------------------------------------------------------

We also could even see "B" printed before "A", if, say, the scheduler decided
to run it first even though thread 1 was created first:

------------------------------------------------------------------------------
main starts running
main prints "main: begin"
main creates thread 1
main creates thread 2
                                                     thread 2 runs
                                                     thread 2 prints "B"
                                                     thread 2 returns
main waits for thread 1
                              thread 1 runs
                              thread 1 prints "A"
                              thread 1 returns
main waits for thread 2 (returns immediately because thread 2 is finished)
main prints "main: end"
------------------------------------------------------------------------------

As you might be able to see, one way to think about thread creation is that it
is a bit like making a function call; however, instead of first executing the
function and then returning to the caller, the system instead creates a
new thread of execution for the routine that is being called, and it runs
independently of the caller, perhaps before returning from the create, but
perhaps much later. 

As you also might be able to tell from this example, threads make life
complicated: it is already hard to tell what will run when! Unfortunately, it
gets worse. Much worse. Hold on to your hats!

[WHY IT GETS WORSE: SHARED DATA]

The simple thread example we showed above was useful in showing how threads
are created and how they can run in different orders depending on how the
scheduler decides to run them. What it doesn't show you, though, is how
threads interact when they access shared data. 

Let us imagine a simple example where two thread wish to update a global
shared variable. The code might look something like this:

------------------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>
#include "mythreads.h"

static volatile int balance = 0;

void *
mythread(void *arg)
{
    char *letter = arg;

    int i;
    printf("%s: begin\n", letter);
    for (i = 0; i < 1e7; i++) {
        balance = balance + 1;
    }
    printf("%s: done\n", letter);
    return NULL;
}

int
main(int argc, char *argv[])
{
    pthread_t p1, p2;
    printf("main: begin [balance = %d]\n", balance);
    Pthread_create(&p1, NULL, mythread, "A");
    Pthread_create(&p2, NULL, mythread, "B");

    // join waits for the threads to finish
    Pthread_join(p1, NULL);
    Pthread_join(p2, NULL);
    printf("main: done with both [balance = %d]\n", balance);
    return 0;
}
------------------------------------------------------------------------------

A few notes about the code. First, as Stevens suggests [1], we wrap the thread
creation and join routines to simply exit on failure; for a program as simple
as this one, we want to at least notice an error occurred (if it did), but not
do anything very smart about it. Thus, Pthread_create() just calls
pthread_create() and makes sure the return code is 0; if it isn't,
Pthread_create() just prints a message and exits. 

Second, you can see that instead of using two separate function bodies for the
worker threads, we just use a single piece of code, and pass the thread an
argument (in this case, a string) so we can have each thread print a different
letter before its messages.

Finally, and most importantly, we can now look at what each worker is trying
to do: add a number to the shared variable "balance", and do so 10 million
times (1e7) in a loop. Thus, the desired final result is: 20,000,000, as we
see in this potential output:

------------------------------------------------------------------------------
prompt> gcc -o main main.c -Wall -lpthread
prompt> ./main
main: begin [balance = 0]
A: begin
B: begin
A: done
B: done
main: done with both [balance = 20000000]
prompt>
------------------------------------------------------------------------------

Unfortunately, when we run this code, even on a single processor, we don't
necessarily get the desired result:

------------------------------------------------------------------------------
prompt> gcc -o main main.c -Wall -lpthread
prompt> ./main
main: begin [balance = 0]
A: begin
B: begin
A: done
B: done
main: done with both [balance = 19345221]
prompt> ./main
main: begin [balance = 0]
A: begin
B: begin
A: done
B: done
main: done with both [balance = 19221041]
prompt>
------------------------------------------------------------------------------

In fact, each run yields a *different* result! A big question remains: why
does this happen? 

[THE CRUX OF THE PROBLEM]

To understand why this happens, we must understand the code sequence that the
compiler generates for the update to balance. In this case, we wish to simply
add a number (1) to balance. Thus, the code sequence for doing so might look
something like this (an x86 example here):

 mov 0x8049a1c, %eax
 add $0x1, %eax
 mov %eax, 0x8049a1c

This example assumes that the variable balance is located at address
0x8049a1c. In this three-instruction sequence, the x86 mov instruction is used
first to get the memory value at the address and put it into register
eax. Then, the add is performed, adding 1 (0x1) to the contents of the eax
register, and finally, the contents of eax are stored back into memory at the
same address.

Let us imagine one of our two threads (thread 1) enters this region of code,
and is thus about to increment balance by one. It loads the value of balance
(let's say it is zero to begin with) into its register eax. Thus, eax=0 for
thread 1. Now, something unfortunate happens: a timer interrupt goes off. This
causes the OS to save the state of the currently running thread (its PC, its
registers including eax, etc.) to the TCB for this thread. 

Now something worse happens: thread 2 is chosen to run, and it enters this
same piece of code. It also executes the first instruction, getting the value
of balance and putting it into its eax (remember: each thread when running has
its own private registers; the registers are *virtualized* by the
context-switch code that saves and restores them). The value of balance is
still zero at this point, and thus thread 2 has eax=0. Let's then assume that
thread 2 executes the next two instructions, incrementing eax by 1 (thus
eax=1), and then saving the contents of eax into balance (address
0x8049a1c). Thus, the global variable balance now has the value 1.

Finally, another context switch occurs, and thread 1 resumes running. Recall
that it had just executed the first mov instruction, and is now about to add 1
to eax. Recall also that eax=0. Thus, the add instruction increments eax by 1
(thus eax=1), and then the mov instruction saves it to memory (thus balance is
set to 1 again).

Put simply, what has happened is this: the code to increment balance has been
run twice, but balance, which started at 0, is now only equal to 1. A
"correct" version of this program should have resulted in balance equal to 2.  

Here is a pictorial depiction of what happened and when in the example
above. Assume, for this depiction, that the above code is loaded at address
100 in memory, like the following sequence (note: as it turns out, x86 has
variable-length instructions, and the mov instructions here take up 5 bytes
whereas the add takes only 3):

 100 mov    0x8049a1c, %eax
 105 add    $0x1, %eax
 108 mov    %eax, 0x8049a1c

With these assumptions, the timing that occurs in the example is as follows: 
(your screen must be wide to see all the columns in a sensible format) 

OS                           THREAD 1                       THREAD 2                     BALANCE
                             mov 0x8049a1c, %eax (eax=0)                                    0
Interrupt                                                                                   0
Save T1 (pc=105, eax=0)                                                                     0
Restore T2 (pc=100, eax=0)                                                                  0
Run T2                                                                                      0
                                                            mov 0x8049a1c, %eax (eax=0)     0
                                                            add $0x1, %eax      (eax=1)     0
                                                            mov %eax, 0x8049a1c (eax=1)     1
Interrupt                                                                                   1
Save T2 (pc=113, eax=1)                                                                     1
Restore T1 (pc=105, eax=0)                                                                  1
Run T1                                                                                      1
                             add $0x1, %eax (eax=1)                                         1
                             mov %eax, 0x8049a1c                                            1 (not 2!)

What we have demonstrated here is called a *race condition*: the results
depend on the timing execution of the code. With some bad luck (i.e.,
context switches that occur at untimely points in the execution), we get the
wrong result. In fact, we may get a different result each time; thus, instead
of a nice deterministic computation (which we are used to from computers), we
call this result *indeterminate*, where it is not known what the output will
be and it is indeed likely to be different across runs.

Because multiple threads executing this code can result in a race condition,
we call this code a *critical section*. A critical section is a piece of code
that accesses a shared variable (or more generally, a shared resource) and
must not be concurrently executed by more than one thread. 

What we really want for this code is what we call *mutual exclusion*. This
property guarantees that if one thread is executing within the critical
section, the others will be prevented from doing so. 

[WHAT TO DO? THE WISH FOR ATOMICITY]

One way to solve this problem would be to have more powerful instructions
that, in a single step, did exactly whatever we needed done and thus removed
the possibility of an untimely interrupt.

For example, if we had a super instruction that looked like this:

 add 0x8049a1c, $0x1 

which added a value to a memory location, the hardware would guarantee that
this would execute *atomically*; when the instruction executed, it would
perform the update as desired. It could not be interrupted mid-instruction,
because that is precisely the guarantee we receive from the hardware: when an
interrupt occurs, either the instruction has not run at all, or it has run to
completion; there is no in-between state. 

Atomically in this context means "as a unit", which sometimes we take as "all
or none." What we'd like is to execute the three instruction sequence atomically:

 mov 0x8049a1c, %eax
 add $0x1, %eax
 mov %eax, 0x8049a1c

As we said, if we had a single instruction to do this, we could just use that
instruction and be done. But in the general case, we won't have such an
instruction. Imagine we were building a concurrent B-tree, and wished to
update it; would we really want the hardware to support a generic 
"atomic update of B-tree" instruction? Probably not.

Thus, what we will instead do is ask the hardware for a few useful
instructions upon which we can build a general set of what we call
*synchronization primitives*. By using these synchronization primitives, we
will be able to build multi-threaded code that accesses critical sections in a
synchronized and controlled manner, and thus reliably produces the correct
result despite the challenging nature of concurrent execution.


[1] "Advanced Programming in the UNIX Environment", by late great W. Richard
Stevens. This is an amazing book full of good tips for future UNIX
hackers. The latest edition has a co-author (Stephen Rago). Also amazing are
Stevens' books on network programming, another must for your bookshelf.