Instruction Level Parallelism: Pipelining

Programmer's model: one instruction is fetched and executed at a time.

Computer architect's model: The effect of a program's execution are given by the programmer's model. But, implementation may be different.

To make execution of programs faster, we attempt to exploit parallelism: doing more than one thing at one time.

Pipelining (ILP)

The concept:

A task is broken down into steps. Assume that there are N steps, each takes the same amount of time.

(Mark Hill's) EXAMPLE: car wash

     steps:  P -- prep
	     W -- wash
	     R -- rinse
	     D -- dry
	     X -- wax

     assume each step takes 1 time unit

     time to wash 1 car (red) = 5 time units
     time to wash 3 cars (red, green, blue) = 15 time units

     which car      time units
		1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
       red      P  W  R  D  X
       green                   P  W  R  D  X
       blue                                   P  W  R  D  X

A pipeline overlaps the steps.

     which car      time units
		1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
       red      P  W  R  D  X
       green       P  W  R  D  X
       blue           P  W  R  D  X
       yellow            P  W  R  D  X
	  etc.

IT STILL TAKES 5 TIME UNITS TO WASH 1 CAR, BUT THE RATE OF CAR WASHES GOES UP!

Two very important terms when discussing pipelining:

Pipelining does not affect the latency of car washes. It increases the throughput of car washes.

Pipelining can be done in computer hardware.

2-stage pipeline

  steps:
    F -- instruction fetch (and PC update!)
    E -- instruction execute (everything else)


    which instruction       time units
			1  2  3  4  5  6  7  8 . . .
       1                F  E
       2                   F  E
       3                      F  E
       4                         F  E

       time for 1 instruction =  2 time units
	 (INSTRUCTION LATENCY)

       rate of instruction execution = pipeline depth * (1 / time for     )
         (INSTRUCTION THROUGHPUT)                           1 instruction
				     =        2       * (1 /   2)
				     =   1 per time unit

5-stage pipeline

A popular pipelined implementation:
(Note: the R2000/3000 has 5 stages, the R6000 has 5 stages (but different), and the R4000 has 8 stages)

     steps:
	IF -- instruction fetch (and PC update)
	ID -- instruction decode (and get operands from registers)
	EX -- ALU operation (can be effective address calculation)
	MA -- memory access
	WB -- write back (results written to register(s))



    which       time units
instruction   1   2   3   4   5   6   7  8 . . .
     1        IF  ID  EX  MA  WB
     2            IF  ID  EX  MA  WB
     3                IF  ID  EX  MA  WB



    INSTRUCTION LATENCY = 5 time units
    INSTRUCTION THROUGHPUT = 5 * (1 / 5) = 1 instruction per time unit

Unfortunately, pipelining introduces other difficulties. . .

Data dependencies

Suppose we have the following code:

 
   lw   $8, data1
   addi $9, $8, 1
 
 

The data loaded does not get written to $8 until WB, but the addi instruction wants to get the data out of $8 it its ID stage. . .

    which       time units
instruction   1   2   3   4   5   6   7  8 . . .
    lw        IF  ID  EX  MA  WB
			      ^^
    addi          IF  ID  EX  MA  WB
		      ^^

The simplest solution is to STALL the pipeline. (Also called HOLES, HICCOUGHS or BUBBLES in the pipe.)

    which       time units
instruction   1   2   3   4   5   6   7   8 . . .
    lw        IF  ID  EX  MA  WB
			      ^^
    addi          IF  ID  ID  ID  EX  MA  WB
		      ^^  ^^  ^^ (pipeline stalling)

A data dependency (also called a hazard) causes performance to decrease.

Classification of data dependencies:

NOTE: there is no difficulty implementing a 2-stage pipeline due to data dependencies!

Control dependencies

What happens to a pipeline in the case of branch instructions?

MAL CODE SEQUENCE:


        b  label1
        addi  $9, $8, 1
label1: mult $8, $9



    which       time units
instruction   1   2   3   4   5   6   7  8 . . .
     b        IF  ID  EX  MA  WB
			      ^^ (PC changed here)
    addi          IF  ID  EX  MA  WB
		  ^^  (WRONG instruction fetched here!)

Whenever the PC changes (except for PC <- PC + 4), we have a control dependency.

Control dependencies break pipelines. They cause performance to plummet.

So, lots of (partial) solutions have been implemented to try to help the situation. Worst case, the pipeline must be stalled such that instructions are going through sequentially.

Note that just stalling does not really help, since the (potentially) wrong instruction is fetched before it is determined that the previous instruction is a branch.

How to minimize the effect of control dependencies on pipelines.

An aside, on condition codes
A historically significant way of branching. Condition codes were used on MANY machines before pipelining became popular.

4 1-bit registers (condition code register):

The result of an instruction set these 4 bits. Conditional branches were then based on these flags.

Example: bn label # branch to label if the N bit is set

Earlier computers had virtually every instruction set the condition codes. This had the effect that the test (for the branch) needed to come directly before the branch.

Example:

  
	sub r3, r4, r5    # blt $4, $5, label 
	bn  label
  
  

A performance improvement (sometimes) to this allowed the programmer to explicitly specify which instructions should set the condition codes. In this way, (on a pipelined machine) the test could be separated from the branch, resulting in fewer pipeline holes due to data dependencies.

Copyright © Karen Miller, 2006