Prerequisite Material from 252 (Starts Here)

Motivation for Registers


REGISTERS and MAL
-----------------

An introduction to the subject of registers -- from a motivational
point of view.

This lecture is an attempt to explain a bit about why computers
are designed (currently) the way they are.  Try to remember that
speed of program execution is an important goal.  Desire for increased
speed drives the design of computer hardware.


The impediment to speed (currently):  transferring data to and from
memory.

look at an invented instruction:
    add  x, y, z

    -x, y, and z must all be addresses of data in memory.
    -each address is 32 bits.
    -  what does the machine code look like?
    
      ----------------------------------------
      | add   |    x     |    y    |    z    |
      ----------------------------------------

      ----------------------------------------
      | opcode|  address | address | address |
      ----------------------------------------
          8(?)     32        32        32

       so, this instruction requires more than 96 bits.

    IF each read from memory delivers 32 bits of data,
    then it takes a lot of reads before this instruction can
    be completed.
       at least 3 for instruction fetch
       1 to load y
       1 to load z
       1 to store x

       that's 6 transactions with memory for 1 instruction!


How bad is the problem?
  Assume that a 32-bit 2's complement addition takes 1 time unit. 
  A read/write from/to memory takes about 10 time units.
  (Note that this is a very conservative estimate, the ratio is
  more like 1:16.)

  So we get
     fetch instruction   30 time units
     (and update PC)
     decode               1 time unit
     load y              10 time units
     load z              10 time units
     add                  1 time unit
     store x             10 time units
     ---------------------------------
       total time:       62 time units

     60/62 = 96.7 % of the processor's time is spent doing memory operations.



what do we do to reduce this number?

  1. transfer more data at one time
     if we transfer 64 bits at one time, then it only takes 2 reads
     to get the instruction.  There is no savings in loading/storing
     the operands.  And, an extra word worth of data is transferred
     for each load, a waste of resources.
     So, this idea would give a saving of 1 memory transaction.
     
     With the invented example instruction:
                          64 bits               128 bits
     fetch instruction:   20                      10
     decode                1                       1
     load y               10                      10
     load z               10                      10
     add                   1                       1
     store x              10                      10
     ---------------------------------          -----
       total time:        52                      42
  

  2.  shorten addresses.  This restricts where variables can be placed.
      First, make each address be 16 bits (instead of 32).  Then
	 add  x, y, z
      requires 2 words for instruction fetch.

      Shorten addresses even more . . . make them each 5 bits long.
      Problem:  that leaves only 32 words of data for operand storage.
      So, use extra move instructions that allow moving data from
      a 32-bit address to one of these special 32 words.

      Then, the add can fit into 1 transferred word.
     With the invented example instruction:
                          32 bits transferred   32 bits transferred
			  16-bit addr           5-bit addr
     fetch instruction:   20                      10
     decode                1                       1
     load y               10                      10
     load z               10                      10
     add                   1                       1
     store x              10                      10
     ---------------------------------          -----
       total time:        52                      42



  3. modify the instruction set such that instructions are smaller.
     This was common on machines from more than a decade ago.
     (It is still part of the IA-32 architecture from Intel)
     Here's how it works:

     The invented instruction implies what is called a 3-address machine.
     Each arithmetic type instruction contains 3 operands, 2 for sources
     and 1 for the destination of the result.

     To reduce the number of operands (and thereby reduce the number
     of reads for the instruction fetch), develop an instruction set
     that uses 2 operands for arithemtic type instructions.
     (Called a 2-address machine.)

     Now, instead of       add  x, y, z

     we will have          move x, z      (copies the value of z into x)
			   add  x, y      ( x <- x + y )

	   so, arithmetic type instructions always use one of the operands
	   as both a source and a destination.


    There's a couple of problems with this approach:
       - where 1 instruction was executed before, 2 are now executed.
	 It actually takes more memory transactions to execute this sequence!
	    at least 2 to fetch each instruction
	    1 for each of the load/storing of the operands themselves.

	    that is 8 reads/writes for the same sequence.

                          32 bits                64 bits
                          move  add              move  add
     fetch instruction:   20    20                10   10
     decode                1     1                 1    1
     load operand         10    10                10   10
     operation             0     1                 0    1
     store                10    10                10   10
     ---------------------------------          -----------
               sum:       41    42                31   32
            total:           83                     63

  (Is this better than for the 3-address machine?)



  So, allow only 1 operand -- called a 1-address format.
     
     now, the instruction     add  x, y, z   will be accomplished
     by something like

     load  z
     add   y
     store x

     to facilitate this, there is an implied word of storage
     associated with the ALU.  All results of instructions
     are placed into this word -- called an ACCUMULATOR.

     the operation of the sequence:
	 load z --  read from memory at address z, and place value into
                    the accumulator
	 add  y --  implied operation is to add the contents of the
		    accumulator with the operand, and place the result
		    back into the accumulator.
	 store x--  write to memory at address x; the value is the contents
                    of the accumulator

     Notice that this 1-address instruction format implies the use
     of a variable (the accumulator).

     How many memory transactions does it take?
	2 -- (load) at least 1 for instruction fetch, 1 for read of z
	2 -- (add) at least 1 for instruction fetch, 1 for read of y
	2 -- (store) at least 1 for instruction fetch, 1 for write of x
       ---
	6   the same as for the 3-address machine -- no savings.

			   32 bits transferred
                          load  add store
     fetch instruction:   10    10   10
     decode                1     1    1
     load operand         10    10    0
     operation             0     1    0
     store                 0     0   10      
     --------------------------------- 
               sum:       21    22   21
            total:           64      



  BUT, what if we wanted
    x = (y + z) / 3

  For the 3-address machine, the operation following the add is
	 div x, x, 3

     3-address machine     32 bits
                          add  div
     fetch instruction     30   30
     decode                 1    1
     load one operand      10   10
     load other operand    10    0 (immediate is in instruction)
     add                    1    1
     store x               10   10
     ---------------------------------
              sum:         62   52
              total:         114

  For the 1-address machine, the value for x is already in the
  accumulator, and the code on the 1-address machine could be
    load z
    add  y
    div  3
    store x
  there is only 1 extra instruction (2 memory transactions) for this
  whole sequence!  

     1-address machine      32 bits
                          load  add div store
     fetch instruction    10    10   10   10
     decode                1     1    1    1
     load operand         10    10    0    0
     operation             0     1    1    0
     store                 0     0    0   10
     ------------------------------------------- 
               sum:       21    22   12   21
            total:           76      


REMEMBER this:  the 1-address machine uses an extra word of storage
		that is located in the CPU (processor).

		the example shows a savings in memory transactions
		when a value is re-used.




NOW, put a couple of these ideas together.

Use of storage in processor (accumulator) allowed re-use of data.
It is easy to design -- put a bunch of storage in the processor --
call them REGISTERS.  How about 32 of them?  Then, restrict
arithmetic instructions to only use registers as operands.

   add  x, y, z

   becomes something more like

   load  reg10, y
   load  reg11, z
   add   reg12, reg11, reg10
   store x, reg12

presuming that the values for x, y, and z can/will be used again,
the load operations take relatively less time.


The MIPS R2000 architecture does this.  It has
  1. 32  32-bit registers.
  2. Arithmetic/logical instructions use register values as operands.

A set up like this where arith/logical instr. use only registers
for operands is called a LOAD/STORE architecture.  The only way to
access data within memory is to use an explicit instruction (load)
to read the data from memory and copy it into a register.

A computer that allows operands to come from main memory is often
called a MEMORY TO MEMORY architecture, although that term is not
universal.


Load/store architectures are common today.  They have the advantages
  1.  instructions can be fixed length (and short)
  2.  their design allows (easily permits) pipelining, making load/store
      architectures faster

Prerequisite Material from 252 (Ends Here)


Copyright © Karen Miller, 2006