Beyond 354 - Floating Point Representation

Floating Point Representation

Computers represent real values in a form similar to that of scientific notation. Consider the value

1.23 x 10^4

The number has a sign (+ in this case)
The significand (1.23) is written with one non-zero digit to the left of the decimal point.
The base (radix) is 10.
The exponent (an integer value) is 4. It too must have a sign.

There are standards which define what the representation means, so that across computers there will be consistancy.

Note that this is not the only way to represent floating point numbers, it is just the IEEE standard way of doing it.

Here is what we do:

the representation has three fields:

     ----------------------------
     | S |   E     |     F      |
     ----------------------------

S is one bit representing the sign of the number
E is an 8-bit biased integer representing the exponent
F is an unsigned integer

the decimal value represented is:

              S        e
	  (-1)  x f x 2

where

	    e = E - bias

	    f = ( F/(2^n) ) + 1

for single precision representation (the emphasis in this class)
n = 23
bias = 127

for double precision representation (a 64-bit representation)
n = 52 (there are 52 bits for the mantissa field)
bias = 1023 (there are 11 bits for the exponent field)

Biased Integer Representation

Since floating point representations use biased integer representations for the exponent field, here is a brief discussion of biased integers.

An integer representation that skews the bit patterns so as to look just like unsigned but actually represent negative numbers.

It represents a range of values (different from unsigned representation) using the unsigned representation. Another way of saying this: biased representation is a re-mapping of the unsigned integers.

visual example (of the re-mapping):

        bit pattern:        000  001  010  011  100  101  110  111

        unsigned value:      0    1    2    3    4    5    6    7

        biased-2 value:      -2   -1   0    1    2    3    4    5

This is biased-2. Note the dash character in the name of this representation. It is not a negative sign.

Example:

Given 4 bits, bias values by 2**3 = 8
(This choice of bias results in approximately half the represented values being negative.)

	  TRUE VALUE to be represented      3
	  add in the bias                  +8
					 ----
	  unsigned value                   11

	  so the 4-bit, biased-8 representation of the value 3
	  will be  1011

Example:

	  Going the other way, suppose we were given a
	  4-bit, biased-8 representation of   0110

	  unsigned 0110  represents 6
	  subtract out the bias   - 8
				  ----
	  TRUE VALUE represented   -2

On choosing a bias:
The bias chosen is most often based on the number of bits available for representing an integer. To get an approx. equal distribution of values above and below 0, the bias should be

      2 ^ (n-1)      or   (2^(n-1)) - 1

Now, what does all this mean?

S, E, F all represent fields within a representation. Each is just a bunch of bits.
S is just a sign bit. 0 for positive, 1 for negative. This is the sign of the number.
E is an exponent field. The E field is a biased-127 integer representation. So, the true exponent represented is (E - bias).
The radix for the number is ALWAYS 2.
Note: Computers that did not use this representation, like those built before the standard, did not always use a radix of 2. For example, some IBM machines had radix of 16.
F is the mantissa (significand). It is in a somewhat modified form. There are 23 bits available for the mantissa. It turns out that if floating point numbers are always stored in their normalized form, then the leading bit (the one on the left, or MSB) is ALWAYS a 1. So, why store it at all? It gets put back into the number (giving 24 bits of precision for the mantissa) for any calculation, but we only have to store 23 bits.
This MSB is called the hidden bit.

An example: Put the decimal number 64.2 into the IEEE standard single precision floating point representation.

	first step:
	  get a binary representation for 64.2
	  to do this, get unsigned binary representations for the stuff to the left
	  and right of the decimal point separately.

	  64  is   1000000

	  .2 can be gotten using the algorithm:

	  .2 x 2 =  0.4      0
	  .4 x 2 =  0.8      0
	  .8 x 2 =  1.6      1
	  .6 x 2 =  1.2      1

	  .2 x 2 =  0.4      0  now this whole pattern (0011) repeats.
	  .4 x 2 =  0.8      0
	  .8 x 2 =  1.6      1
	  .6 x 2 =  1.2      1
	    

	    so a binary representation for .2  is    .001100110011. . .

                 ----
	    or  .0011  (The bar over the top shows which bits repeat.)


	Putting the halves back together again:
	   64.2  is     1000000.0011001100110011. . .


      second step:
	Normalize the binary representation. (make it look like
	scientific notation)

				  6
	1.000000 00110011. . . x 2

      third step:
	6 is the true exponent.  For the standard form, it needs to
	be in 8-bit, biased-127 representation.

	      6
	  + 127
	  -----
	    133

	133 in 8-bit, unsigned representation is 1000 0101

	This is the bit pattern used for E in the standard form.

      fourth step:
	the mantissa stored (F) is the stuff to the right of the radix point
	in the normalized form.  We need 23 bits of it.

	  000000 00110011001100110


      put it all together (and include the correct sign bit):

	 S     E               F
	 0  10000101  00000000110011001100110

      the values are often given in hex, so here it is

	 0100 0010 1000 0000 0110 0110 0110 0110

     0x   4    2    8    0    6    6    6    6

Some extra details:

Since floating point numbers are always stored in a normalized form, how do we represent 0?
We take the bit patterns 0x0000 0000 and 0x8000 0000 to represent the value 0.
(What floating point numbers cannot be represented because of this?)
Note that the hardware that does arithmetic on floating point numbers must be constantly checking to see if it needs to use a hidden bit of a 1 or a hidden bit of 0 (for 0.0).
Values that are very close to 0.0, and would require the hidden bit to be a zero are called denormalized or subnormal numbers.
```
                S         E           F
   0.0         0 or 1   00000000   00000000000000000000000
               (hidden bit is a 0)

   subnormal   0 or 1   00000000   not all zeros
               (hidden bit is a 0)

   normalized  0 or 1   > 0        any bit pattern
               (hidden bit is a 1)
```

Other special values:

                              S  E        F
       +infinity              0 11111111 00000... (0x7f80 0000)
       -infinity              1 11111111 00000... (0xff80 0000)

       NaN (Not a Number)     ? 11111111 ?????... 
	  (S is either 0 or 1, E=0xff, and F is anything but all zeros)

Single precision representation is 32 bits.
Double precision representation is 64 bits.
For double precision:
- S is the sign bit (same as for single precision).
- E is an 11-bit, biased-1023 integer for the exponent.
- F is a 52-bit mantissa, using same method as single precision (hidden bit is not expicit in the representation).