Floating Point Representation

Computers represent real values in a form similar to that of scientific notation. Consider the value

1.23 x 10^4

The number has a sign (+ in this case)
The significand (1.23) is written with one non-zero digit to the left of the decimal point.
The base (radix) is 10.
The exponent (an integer value) is 4. It too must have a sign.

There are standards which define what the representation means, so that across computers there will be consistancy.

Note that this is not the only way to represent floating point numbers, it is just the IEEE standard way of doing it.

Here is what we do:

the representation has three fields:

     ----------------------------
     | S |   E     |     F      |
     ----------------------------

S is one bit representing the sign of the number
E is an 8-bit biased integer representing the exponent
F is an unsigned integer

the decimal value represented is:

              S        e
	  (-1)  x f x 2
where
	    e = E - bias

	    f = ( F/(2^n) ) + 1

for single precision representation (the emphasis in this class)
n = 23
bias = 127

for double precision representation (a 64-bit representation)
n = 52 (there are 52 bits for the mantissa field)
bias = 1023 (there are 11 bits for the exponent field)

Now, what does all this mean?

An example: Put the decimal number 64.2 into the IEEE standard single precision floating point representation.

	first step:
	  get a binary representation for 64.2
	  to do this, get unsigned binary representations for the stuff to the left
	  and right of the decimal point separately.

	  64  is   1000000

	  .2 can be gotten using the algorithm:

	  .2 x 2 =  0.4      0
	  .4 x 2 =  0.8      0
	  .8 x 2 =  1.6      1
	  .6 x 2 =  1.2      1

	  .2 x 2 =  0.4      0  now this whole pattern (0011) repeats.
	  .4 x 2 =  0.8      0
	  .8 x 2 =  1.6      1
	  .6 x 2 =  1.2      1
	    

	    so a binary representation for .2  is    .001100110011. . .

                 ----
	    or  .0011  (The bar over the top shows which bits repeat.)


	Putting the halves back together again:
	   64.2  is     1000000.0011001100110011. . .


      second step:
	Normalize the binary representation. (make it look like
	scientific notation)

				  6
	1.000000 00110011. . . x 2

      third step:
	6 is the true exponent.  For the standard form, it needs to
	be in 8-bit, biased-127 representation.

	      6
	  + 127
	  -----
	    133

	133 in 8-bit, unsigned representation is 1000 0101

	This is the bit pattern used for E in the standard form.

      fourth step:
	the mantissa stored (F) is the stuff to the right of the radix point
	in the normalized form.  We need 23 bits of it.

	  000000 00110011001100110


      put it all together (and include the correct sign bit):

	 S     E               F
	 0  10000101  00000000110011001100110

      the values are often given in hex, so here it is

	 0100 0010 1000 0000 0110 0110 0110 0110

     0x   4    2    8    0    6    6    6    6

Some extra details:

Copyright © Karen Miller, 2006