University of Wisconsin Computer Sciences Header Map (repeated with
textual links if page includes departmental footer)
Lecture Notes for Chapter 6 -- Floating Point Arithmetic

CS/ECE 354: Machine Organization and Programming

Mark D. Hill's Section 4 for Fall 2001

Purpose: To allow better concentration in lecture by reducing note-taking pressure and to provide a study-aid before and after lecture.
Disclaimers: (a) I will not follow these notes exactly in class. (b) Students are responsible for what I say in class. (c) Reading these notes is not a substitute for attending lecture. (d) These notes probably contain errors.
Acknowledgements: These notes are derived from the notes of Karen Miller, Deb Deppeler, and David Wood, sometimes with substantial and sometimes with trivial changes. Thanks!
Last updated: Monday, November 26, 2001


REVIEW OF FLOATING POINT NUMBERS (From Chapter 4)
--------------------------------


Scientific Notation

     -1234.567
                   3
     -1.234567 * 10


Binary

     64.2
     1000000.0011001100110011. . .
                              6
     1.00000000110011. . . x 2	
                               110
     1.00000000110011. . . x 10

         s        e
     (-1)  x f x 2

     s = 0
     f = 1.00000000110011. . .
     e = 110


IEEE Single-Precision (32 bits)

	 -------------------
	 | S |   E   |  F  |
	 -------------------

    S is one bit representing the sign of the number
    E is an 8 bit biased integer representing the exponent
    F is an unsigned integer


    S = s
    E = e + 127
                 23
    F = (f-1) x 2

    S = 0
    E = 6 + 127 = 133 = 10000101
                                       23
    F = (1.00000000110011. . . - 1) x 2

                                  23
    F =  0.00000000110011. . . x 2
    F =    00000000110011001100110

    S     E               F
    0  10000101  00000000110011001100110

    the values are often given in hex, so here it is

    0100 0010 1000 0000 0110 0110 0110 0110

  0x   4    2    8    0    6    6    6    6

    0x42806666


From IEEE Single-Precision back to a number

    S     E               F
    0  10000101  00000000110011001100110

    s = S
    e = E - 127
    f = F/2  + 1

    s = 0
    e = 133 - 127 = 6 = 110

				          23
    f =  (00000000110011001100110 + 1) / 2

                                     -23
    f =  100000000110011001100110 x 2

    f = 1.00000000110011001100110

        s        e
    (-1)  x f x 2
                               110
    1.00000000110011. . . x 10



And now onto Chapter 6 ...
--------------------------

arithmetic operations on floating point numbers consist of
 addition, subtraction, multiplication and division

the operations are done with algorithms similar to those used
  on sign magnitude integers (because of the similarity of
  representation) -- example, only add numbers of the same
  sign.  If the numbers are of opposite sign, must do subtraction.


ADDITION

 example on decimal value given in scientific notation:

       9.997 x 10 ** 2
     + 4.631 x 10 ** -1
     -----------------

     first step:  align decimal points
     second step:  add

      
       9.997    x 10 ** 2
     + 0.004631 x 10 ** 2
     --------------------
      10.001631 x 10 ** 2

     third step:  normalize the result

	often already normalized
	otherwise move only one digit

     1.0001631 x 10 ** 3
	
    Example presumes infinite precision; with FP must round.


 example on fl pt. value given in binary:

 .25 =   0 01111101 00000000000000000000000

 100 =   0 10000101 10010000000000000000000

 explicitly add hidden bit:
		    |
		    V
 .25 =   0 01111101 1 00000000000000000000000

 100 =   0 10000101 1 10010000000000000000000


    to add these fl. pt. representations,
    step 1:  align radix points



	 shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1

	 shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1

	 we want to shift the mantissa right, because the bits that
	 fall off the end should come from the least significant end
	 of the mantissa

       -> choose to shift the .25, since we want to increase it's exponent.
	 01111101 - 127 = 125 - 127 = -2
	 10000101 - 127 = (128 + 5) - (128 - 1) = 6

	 shift smaller by 6 - (-2) = 8 places

       -> note: can do subtraction directly since biases cancel
       
		    10000101
		   -01111101
		   ---------
		    00001000   (8) places.

            0 01111101 1 00000000000000000000000 (original value)
            0 01111110 0 10000000000000000000000 (shifted 1 place)
		       (note that hidden bit is shifted into msb of mantissa)
            0 01111111 0 01000000000000000000000 (shifted 2 places)
            0 10000000 0 00100000000000000000000 (shifted 3 places)
            0 10000001 0 00010000000000000000000 (shifted 4 places)
            0 10000010 0 00001000000000000000000 (shifted 5 places)
            0 10000011 0 00000100000000000000000 (shifted 6 places)
            0 10000100 0 00000010000000000000000 (shifted 7 places)
            0 10000101 0 00000001000000000000000 (shifted 8 places)


    step 2: add (don't forget the hidden bit for the 100)

         0 10000101 1 10010000000000000000000  (100)
      +  0 10000101 0 00000001000000000000000  (.25)
      ---------------------------------------
	 0 10000101 1 10010001000000000000000



    step 3:  normalize the result (get the "hidden bit" to be a 1)

	     it already is for this example.

   result is
	 0 10000101 10010001000000000000000


Example with post-normalization:

	s exp  h frtn
	- ---  - ----
	0 011  1 1100
      + 0 011  1 1011
	------------
	0 011 11 0111
	0 100  1 1011   1--> discarded




SUBTRACTION

     Like addition, but watch out when the numbers are close:

       1.23456  x 10 ** 2
     - 1.23455  x 10 ** 2
     -----------------
       0.00001  x 10 ** 2

       1.00000  x 10 ** -3

     A many-digit normalization is possible!

     first step:  align decimal points

     like addition as far as alignment of radix points

     then the algorithm for subtraction of sign mag. numbers takes over.


     before subtracting,
       compare magnitudes (don't forget the hidden bit!)
       change sign bit if order of operands is changed.

     don't forget to normalize number afterward.

	s exp  h frtn
	- ---  - ----
        0 011  1 1011	smaller
      -	0 011  1 1101	bigger
	------------
	switch and match difference negative

	0 011  1 1101   bigger
      - 0 011  1 1011	smaller
	------------
	1 011  0 0010
	1 000  1 0000



MULTIPLICATION

 example on decimal values given in scientific notation:

       3.0 x 10 ** 1
     * 5.0 x 10 ** 2
     -----------------

     algorithm:  multiply mantissas
		 add exponents
		 normalize

       3.0 x 10 ** 1
     * 5.0 x 10 ** 2
     -----------------
      15.00 x 10 ** 3

      1.50 x 10 ** 4	


 example in binary:    use a mantissa that is only 4 bits so that
		       I don't spend all day just doing the multiplication
		       part.


     0 10000100 0100
   x 1 00111100 1100
   -----------------


   mantissa multiplication:           1.0100
    (don't forget hidden bit)	    x 1.1100
				    ------
				     00000
				    00000
				   10100
				  10100
				 10100
				 ---------
				1000110000
                      becomes   10.00110000



    add exponents:       always add true exponents
			 (otherwise the bias gets added in twice)

     biased:
     10000100
   + 00111100
   ----------


   10000100         01111111  (switch the order of the subtraction,
 - 01111111       - 00111100   so that we can get a negative value)
 ----------       ----------
   00000101         01000011
   true exp         true exp
     is 5.           is -67


     add true exponents      5 + (-67) is -62.

     re-bias exponent:     -62 + 127 is 65.
	  unsigned representation for 65 is  01000001.



     put the result back together (and add sign bit).


     1 01000001  10.00110000


     normalize the result:
	 (moving the radix point one place to the left increases
	  the exponent by 1.)

     1 01000001  10.00110000
       becomes
     1 01000010  1.000110000


     this is the value stored (not the hidden bit!):
     1 01000010  000110000



DIVISION

   similar to multiplication.

   true division:
   do unsigned division on the mantissas (don't forget the hidden bit)
   subtract TRUE exponents


   The IEEE standard is very specific about how all this is done.
   Unfortunately, the hardware to do all this is pretty slow.

   Some comparisons of approximate times:
       2's complement integer add      1 cycle
       fl. pt add                      1-2 cycles
       fl. pt multiply                 1-2 cycles
       fl. pt. divide                  4-8 cycles?


Some machines (e.g., Cray-1) used to do reciprocal approximation.

   Division by reciprocal approximation:


      instead of doing     a / b

      they do   a x  1/b.

      figure out a reciprocal for b, and then use the fl. pt.
      multiplication hardware.

      fast, but not completely correct.


Current fastest (and correct) division algorithm is SRT (optional)

    Sweeney, Robertson, and Tocher

    Uses redundant quotient representation
    E.g., base 4 usually has digits {0,1,2,3}
    SRT's redundant base 4 has digits  {-2,-1,0,+1,+2}

    Allows division algorithm to guess digits approximately
    with a table lookup.
    
    Approximations are fixed up when less-sigificant digits
    are calculated

    Final result in completely-accurate binary.

    In 1994, Intel got a few corner cases of the table wrong
    Maximum error less than one part in 10,000
    Nevertheless, Intel took a $300M write-off to replace chip

    Compare with software bugs that give the wrong answer
    and the customer pays for the upgrade


Rounding
--------
arithmetic operations on fl. pt. values compute results that cannot
be represented in the given amount of precision.  So, we must round
results.

There are MANY ways of rounding.  They each have "correct" uses, and
exist for different reasons.  The goal in a computation is to have the
computer round such that the end result is as "correct" as possible.
There are even arguments as to what is really correct.

lecture note:  a number line will help to get the message across.

Round to Nearest (Even)
-----------------------
        6-9 up
        5   to even to make unbiased
        1-4 down
        0   unchanged

E.g.,
     example:
	 .7783      if 3 decimal places available, .778
		    if 2 decimal places available, .78
	1.5	    if 1			 , 2
	2.5	    if 1			 , 2


In binary
        xxx.1....1...    up
        xxx.100000000... to even
        xxx.0....1...    down
        xxx.000000000... unchanged

Need infinite bits? No
        guard -- One extra bit 
		(the bit to the right of the radix point in example above)
        sticky -- OR of all the bits to the right of the guard bit

	Guard	Sticky	Round
	-----	------	-----	
	1	1	Up (add one in LSB position)
	1	0	To even (Force LSB to zero)
	0	1	Down (Truncate guard and sticky bits)
	0	0	Unchanged

Round to nearest is the default rounding method in IEEE Floating Point


3 other methods of rounding:
  round toward 0 --  also called truncation.
     figure out how many bits (digits) are available.  Take that many
     bits (digits) for the result and throw away the rest.
     This has the effect of making the value represented closer
     to 0.

     example:
	 .7783      if 3 decimal places available, .778
		    if 2 decimal places available, .77

  round toward + infinity --
     regardless of the value, round towards +infinity.

     example:
	 1.23       if 2 decimal places, 1.3
	-2.86       if 2 decimal places, -2.8

  round toward - infinity --
     regardless of the value, round towards -infinity.

     example:
	 1.23       if 2 decimal places, 1.2
	-2.86       if 2 decimal places, -2.9

Round to +/- infinity is sometimes used to check the robustness of the
calculation.  It can also be used for a crude approximatation of interval 
arithmetic, where one calculates error bars for results.



overflow 
----------------------
Just as with integer arithmetic, floating point arithmetic operations
can cause overflow.  Detection of overflow in fl. pt. comes by checking
exponents before/during normalization.

Once overflow has occurred, an infinity value can be represented and
propagated through a calculation.

1/0 = infinity

infinity * x = infinity
1/infinity = 0


NaNs
------
Recall NaNs represent Not A Number
e.g., sqrt(-1) = NaN

NaNs propagate through calculations
NaN * x = NaN, including NaN * 0 = NaN
1/NaN = NaN
Any operation on a NaN produces a NaN.


Underflow and Denormalized numbers
--------------------

Underflow occurs in fl. pt. representations when a number is
too small (close to 0) to be represented.  (show number line!)

if a fl. pt. value cannot be normalized
    (getting a 1 just to the left of the radix point would cause
     the exponent field to be all 0's)
    then underflow occurs.

IEEE Floating Point handles underflow with DENORMALIZED NUMBERS
	i.e., The hidden bit to the left of the radix point is ZERO
	This is indicated by E=0 and F<>0

Denormalized value = (-1)^S * F/2^n * 2^(E-bias+1)
                   = (-1)^S * F/2^23 * 2^-126

Notice that the value of the true exponent is the same for the smallest
normalized numbers as it is for denormalized numbers.

ADVANCED TOPIC: Why denorms?
Q: What is the maximum error in a number represented in the range 2^-125 to 2^-126?
Q: With denorms, what is the maximum error in a number represented in the 
   range 2^-126 to 0?
Q: Without denorms, what is the maximum error in a number represented in the 
   range 2^-126 to 0?

Show number line: denormalized numbers equalize the error in the ranges 
2^-125 to 2^-126 and 2^-126 to 0.  

Without denorms, floating point FLUSHES TO ZERO, which results in loss of
precision, and hence larger errors, near zero.

With denorms, floating point provides GRADUAL UNDERFLOW

Denormalized numbers are difficult to implement efficiently, and some systems
trap to software to perform these operations.  


How represent?
--------------

Approx way:

       S       E       F       number
      ---     ---     ---      ------
       S       0       F       (-1)^S x 1.F x 2^(-127)    # to be updated
       S       1       F       (-1)^S x 1.F x 2^(-126)
       S       2       F       (-1)^S x 1.F x 2^(-125)
       ...
       S      254      F       (-1)^S x 1.F x 2^(+127)
       S      255      F       (-1)^S x 1.F x 2^(+128)    # to be updated


Actual way:

       S       E       F       number
      ---     ---     ---      ------
       S       0       F       (-1)^S x 0.F x 2^(-126)    # denormalized
       S       1       F       (-1)^S x 1.F x 2^(-126)    # unchanged
       S       2       F       (-1)^S x 1.F x 2^(-125)    # unchanged
       ... 
       S      254      F       (-1)^S x 1.F x 2^(+127)    # unchanged
       S      255              (-1)^S x infinity          # infinities
       S      255    F!=0      NaN                        # not a number


"Numerical analysis" Warning
----------------------------

Beware mixing small and large numbers in FP
        (3.1415... + 6*10^23) - 6*10^23 != 3.1415... + (6*10^23 - 6*10^23)
        Numerical analysis.



MIPS floating point hardware (CHAPTER 8)
----------------------------------------

Floating point arithmetic could be done by hardware, or by software.
Hardware is fast, and takes up chip real estate.
Software is slow, but takes up no space (memory for the software --
  an insignificant amount)

The MIPS specifies and offers a HW approach.

FP is done in "Coprocessor 1" (but this is not important)


        --------       --------
	|      |       |      |
	| C0   |       | C1   |
	|      |       |      |
        --------       --------
	   |              |
	   |--------------|
           |
        --------
	|      |
	| MEM  |
	|      |
        --------


Just as there are registers meant for integers, there are registers
meant for floating pt. values.

  MIPS has 32, 32 bit FP registers.

  Integer instructions have no access to these registers, just as
    fl. pt. instructions have no access to the integer registers.

  Single precision FP numbers (32b) must use even FP registers.

  Double precision FP numbers (64b) must use even-odd pair of registers

  Thus, there are really only 16 FP registers: 0, 2, 4, ..., 30.


    bit   31 . . .   0
	 --------------
     f0  |            |
	 +------------+
     f1  |            |
	 +------------+
             .
	     .
	     .
	 +------------+
    f29  |            |
	 +------------+
    f30  |            |
	 +------------+
    f31  |            |
	 --------------



FP Instuctions

 (1) fl. pt. operations
      add, subtract, multiply, divide -- each specifies 3 fl. pt. registers.

      add.s ft, fr, fs (where .s means single precision)

      add.s $5, $6, $7


 (2) fp load/store instructions

     l.s  ft, x(rb)
	  Address of data is    x + (rb)  -- note that rb is an integer register
	  Read the data, and place it into fl. pt. register ft.

	  Address calculation is the same.  Where the data goes is different.
     
     s.s -- store
     
     l.s $5, 120($6)   # $5 is an FP regsiter while $6 is a regular register

 (3) convert, moves, etc. -- idiosycratic and not covered