Chapter 6 -- floating point arithmetic
about FLOATING POINT ARITHMETIC
-------------------------------
arithmetic operations on floating point numbers consist of
addition, subtraction, multiplication and division
the operations are done with algorithms similar to those used
on sign magnitude integers (because of the similarity of
representation) -- example, only add numbers of the same
sign. If the numbers are of opposite sign, must do subtraction.
ADDITION
example on decimal value given in scientific notation:
3.25 x 10 ** 3
+ 2.63 x 10 ** -1
-----------------
first step: align decimal points
second step: add
3.25 x 10 ** 3
+ 0.000263 x 10 ** 3
--------------------
3.250263 x 10 ** 3
(presumes use of infinite precision, without regard for accuracy)
third step: normalize the result (already normalized!)
example on fl pt. value given in binary:
S E F
.25 = 0 01111101 00000000000000000000000
100 = 0 10000101 10010000000000000000000
to add these fl. pt. representations,
step 1: align radix points
shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1
shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1
we want to shift the mantissa right, because the bits that
fall off the end should come from the least significant end
of the mantissa
-> choose to shift the .25, since we want to increase it's exponent.
-> shift by 10000101
-01111101
---------
00001000 (8) places.
with hidden bit and radix point shown, for clarity:
0 01111101 1.00000000000000000000000 (original value)
0 01111110 0.10000000000000000000000 (shifted 1 place)
(note that hidden bit is shifted into msb of mantissa)
0 01111111 0.01000000000000000000000 (shifted 2 places)
0 10000000 0.00100000000000000000000 (shifted 3 places)
0 10000001 0.00010000000000000000000 (shifted 4 places)
0 10000010 0.00001000000000000000000 (shifted 5 places)
0 10000011 0.00000100000000000000000 (shifted 6 places)
0 10000100 0.00000010000000000000000 (shifted 7 places)
0 10000101 0.00000001000000000000000 (shifted 8 places)
step 2: add (don't forget the hidden bit for the 100)
0 10000101 1.10010000000000000000000 (100)
+ 0 10000101 0.00000001000000000000000 (.25)
---------------------------------------
0 10000101 1.10010001000000000000000
step 3: normalize the result (get the "hidden bit" to be a 1)
it already is for this example.
result is
0 10000101 10010001000000000000000
suppose that the result of an addition of aligned mantissas
gives
10.11110000000000000000000
and the exponent to go with this is 10000000.
We must put the mantissa back in the normalized form.
Shift the mantissa to the right by one place, and increase the
exponent by 1.
The exponent and mantissa become
10000001 1.01111000000000000000000 0 (1 bit is lost off the least
significant end)
SUBTRACTION
like addition as far as alignment of radix points
then the algorithm for subtraction of sign mag. numbers takes over.
before subtracting,
compare magnitudes (don't forget the hidden bit!)
change sign bit if order of operands is changed.
don't forget to normalize number afterward.
EXAMPLE:
0 10000001 10010001000000000000000 (the representations)
- 0 10000000 11100000000000000000000
---------------------------------------
step 1: align radix points
0 10000000 11100000000000000000000
becomes
0 10000001 11110000000000000000000 (notice hidden bit shifted in)
0 10000001 1.10010001000000000000000
- 0 10000001 0.11110000000000000000000
---------------------------------------
step 2: subtract mantissa
1.10010001000000000000000
- 0.11110000000000000000000
-------------------------------
0.10100001000000000000000
step 3: put result in normalized form
Shift mantissa left by 1 place, implying a subtraction of 1 from
the exponent.
0 10000000 01000010000000000000000
MULTIPLICATION
example on decimal values given in scientific notation:
3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------
algorithm: multiply mantissas (use unsigned multiplication)
add exponents
3.0 x 10 ** 1
+ 0.5 x 10 ** 2
-----------------
1.50 x 10 ** 3
example in binary: use a mantissa that is only 4 bits as an example
0 10000100 0100
x 1 00111100 1100
-----------------
The multiplication is unsigned multiplication (NOT two's complement),
so, make sure that all bits are included in the answer. Do NOT
do any sign extension!
mantissa multiplication: 1.0100
(don't forget hidden bit) x 1.1100
------
00000
00000
10100
10100
10100
---------
1000110000
becomes 10.00110000
add exponents: always add true exponents
(otherwise the bias gets added in twice)
biased:
10000100
+ 00111100
----------
10000100 01111111 (switch the order of the subtraction,
- 01111111 - 00111100 so that we can get a negative value)
---------- ----------
00000101 01000011
true exp true exp
is 5. is -67
add true exponents 5 + (-67) is -62.
re-bias exponent: -62 + 127 is 65.
unsigned representation for 65 is 01000001.
put the result back together (and add sign bit).
1 01000001 10.00110000
normalize the result:
(moving the radix point one place to the left increases
the exponent by 1. Can also think of this as a logical
right shift.)
1 01000001 10.00110000
becomes
1 01000010 1.000110000
this is the value stored (not the hidden bit!):
1 01000010 000110000
DIVISION
similar to multiplication.
true division:
do unsigned division on the mantissas (don't forget the hidden bit)
subtract TRUE exponents
The IEEE standard is very specific about how all this is done.
Unfortunately, the hardware to do all this is pretty slow.
Some comparisons of approximate times:
2's complement integer add 1 time unit
fl. pt add 4 time units
fl. pt multiply 6 time units
fl. pt. divide 13 time units
There is a faster way to do division. Its called
division by reciprocal approximation. It takes about the same
time as a fl. pt. multiply. Unfortunately, the results are
not always correct.
Division by reciprocal approximation:
instead of doing a / b
they do a x 1/b.
figure out a reciprocal for b, and then use the fl. pt.
multiplication hardware.
example of a result that isn't the same as with true division.
true division: 3/3 = 1 (exactly)
reciprocal approx: 1/3 = .33333333
3 x .33333333 = .99999999, not 1
It is not always possible to get a perfectly accurate reciprocal.
Current fastest (and correct) division algorithm is called SRT
Sweeney, Robertson, and Tocher
Uses redundant quotient representation
E.g., base 4 usually has digits {0,1,2,3}
SRT's redundant base 4 has digits {-2,-1,0,+1,+2}
Allows division algorithm to guess digits approximately
with a table lookup.
Approximations are fixed up when less-sigificant digits
are calculated
Final result in completely-accurate binary.
In 1994, Intel got a few corner cases of the table wrong
Maximum error less than one part in 10,000
Nevertheless, Intel took a $300M write-off to replace chip
Compare with software bugs that give the wrong answer
and the customer pays for the upgrade
ISSUES in floating point
note: this discussion only touches the surface of some issues that
people deal with. Entire courses are taught on floating point
arithmetic.
use of standards
----------------
--> allows all machines following the standard to exchange data
and to calculate the exact same results.
--> IEEE fl. pt. standard sets
parameters of data representation (# bits for mantissa vs. exponent)
--> MIPS architecture follows the standard
(All architectures follow the standard now.)
overflow and underflow
----------------------
Just as with integer arithmetic, floating point arithmetic operations
can cause overflow. Detection of overflow in fl. pt. comes by checking
exponents before/during normalization.
Once overflow has occurred, an infinity value can be represented and
propagated through a calculation.
Underflow occurs in fl. pt. representations when a number is
to small (close to 0) to be represented. (show number line!)
if a fl. pt. value cannot be normalized
(getting a 1 just to the left of the radix point would cause
the exponent field to be all 0's)
then underflow occurs.
Underflow may result in the representation of denormalized values.
These are ones in which the hidden bit is a 0. The exponent field
will also be all 0s in this case. Note that there would be a
reduced precision for representing the mantissa (less than 23 bits
to the right of the radix point).
Defining the results of certain operations
------------------------------------------
Many operations result in values that cannot be represented.
Other operations have undefined results. Here is a list
of some operations and their results as defined in the standard.
1/0 = infinity
any positive value / 0 = positive infinity
any negative value / 0 = negative infinity
infinity * x = infinity
1/infinity = 0
0/0 = NaN
0 * infinity = NaN
infinity * infinity = NaN
infinity - infinity = NaN
infinity/infinity = NaN
x + NaN = NaN
NaN + x = NaN
x - NaN = NaN
NaN - x = NaN
sqrt(negative value) = NaN
NaN * x = NaN
NaN * 0 = NaN
1/NaN = NaN
(Any operation on a NaN produces a NaN.)
HW vs. SW computing
-------------------
floating point operations can be done by hardware (circuitry)
or by software (program code).
-> a programmer won't know which is occuring, without prior knowledge
of the HW.
-> SW is much slower than HW. by approx. 1000 times.
A difficult (but good) exercize for students would be to design
a SW algorithm for doing fl. pt. addition using only integer
operations.
SW to do fl. pt. operations is tedious. It takes lots of shifting
and masking to get the data in the right form to use integer arithmetic
operations to get a result -- and then more shifting and masking to put
the number back into fl. pt. format.
A common thing for manufacturers to do is to offer 2 versions of the
same architecture, one with HW, and the other with SW fl. pt. ops.
Copyright © Karen Miller, 2006
|