In STEP 2, floating point representation was introduced as a convenient way of dealing with large or small numbers. Since most scientific computations involve such numbers, many students will be familiar with floating point arithmetic and will appreciate the way in which it facilitates calculations involving multiplication or division.

In order to investigate the implications of **finite number representation**, one must examine the way in which arithmetic is
carried out with **floating point numbers**. The following specifications apply to most computers
which **round**, and are easily adapted to those which **chop**. For
the sake of simplicity in the examples, we will use a three-digit
decimal mantissa normalized to lie in the range

(most computers use **binary
representation **and the
mantissa is commonly normalized to lie in the rangc [½,1]). Note
that up to six digits are used for intermediate results, but the
final result of each operation is a normalized three-digit
decimal floating point number.

Mantissae are added or subtracted (after shifting the mantissa and increasing the exponent of the smaller number, if necessary, to make the exponents agree); the final normalized result is obtained by rounding (after shifting the mantissa and adjusting the exponent, if necessary). Thus:

3.12 x 10^{1 }+ 4.26 x
l0^{1 }= 7.38 x 10^{1}

2.77 x 10^{2 }+ 7.55 x 10^{2 }= 10.32 x 10^{2 }® 1.03 x 10^{3}

6.18 x l0^{1} + 1.84 x l0^{-1 }= 6.18 x 10^{1}
+ 0.0184 x 10^{1 }= 6.1984 x 10^{1 }® 6.20 x 10^{1}
,

3.65 x 10^{-1 }- 2.78 x 10^{-1 }= 0.87 x 10^{-1}
® 8.70
x 10^{-2}.

The exponents are added and the mantissae are multiplied; the final result is obtained by rounding (after shifting the mantissa right and increasing the exponent by 1, if necessary). Thus:

(4.27 x 10^{1}) x (3.68
x 10^{1}) = 15.7136 x 10^{2 }® 1.57x10^{3}

(2.73x10^{2})x(-3.64x10^{-2})=-9.9372x10^{0}
® -9.94x10^{0}.

The exponents are subtracted and the mantissae are divided; the final result is obtained by rounding (after shifting the mantissa left and reducing the exponent by 1, if necessary). Thus:

(5.43xl0^{1}) /
(4.55x10^{2}) = 1.19340...xl0^{-1} ® 1.19x10^{-1}

(-2.75x10^{2}) / (9.87x10^{-2}) =
-0.278622. . .x10^{4 }® -2.79x10^{3}.

The order of evaluation is determined in a standard way and the result of each operation is a normalized floating point number. Thus:

(6.18x10^{1}+1.84xl0^{-1})/((4.27x10^{1})x(3.68x10^{1}))®(6.20x10^{1})/(1.57x10^{3})=3.94904...x10^{-2}® 3.95x10^{-2}

Note that all the above examples (except the subtraction and the first addition) involve generated errors which are relatively large due to the small number of digits in the mantissae. Thus the generated error in

2.77x10^{2}+7.55x10^{2}=10.32x10^{2
}® 1.03x10^{3}

is 0.002 x 10^{3}. Since the
propagated error in this example may be as large as 0.01
x 10^{2} (assuming the operands are correct to 3*S*
), one can use the result given in Error
propagation to deduce that the
accumulated error cannot exceed 0.002x10^{3} +
0.01x10^{2} = 0.003x10^{3}..

The peculiarities of floating point arithmetic lead to some unexpected and unfortunate consequences, including the following:

- Addition or subtraction of a small
(but nonzero) number may have no effect, for
example
5.18x10

^{2 }+ 4.37x10^{-1 }= 5.18x10^{2 }+ 0.00437x10^{2 }= 5.18437x10^{2}® 5.18x10^{2},whence, the

**additive identity is not unique**. - Frequently, the result of
*a*x(1/*a*) is not 1; for example, if a = 3.00x10^{0}, then1/

*a*is 3.33x10^{-1}and

*a*´ (1/*a*) is 9.99´ 10^{-1},whence the

**multiplicative inverse**may not exist. - The result of (
*a*+*b*) +*c*is not always the same as the result of*a*+ (*b*+*c*); for example, if*a*= 6.31x10^{1},*b*= 4.24x10^{0},*c*= 2.47x10^{-1},then

(

*a*+*b*)+*c*= (6.31x10^{1 }+ 0.424x10^{1}) + 2.47x10^{-1 }® 6.73x10^{1 }+ .0247x10^{1 }® 6.75x10^{1},whereas

(

*a*+*b*)+*c*= 6.31x10^{1 }+ (4.24x10^{0 }+ 2.47x10^{0})^{ }® 6.31x10^{1}+ 4.49x10^{0 }® 6.31x10^{1}+4.49x10^{0 }® 6.31x10^{1}+ 0.449x10^{1 }® 6.76x10^{1},whence the

**associative law for addition**does not always apply.Examples involving adding many numbers of varying size indicate that adding in order of increasing magnitude is preferable to adding in the reverse order.

- Subtracting a number from another
nearly equal number may result in
**loss of significance**or**cancellation errors**. In order to illustrate this loss of accuracy, suppose that we evaluate*f*(*x*) = 1 - cos*x*for*x*=0.05, using three-digit, decimal, normalized, floating point arithmetic with rounding. Then1 - cos(0.05) = 1-0.99875

^{ }® 1.00x10^{0 }- 0.999x10^{0 }® 1.00x10^{-3}.Although the value of 1 is exact and cos(0.05) is correct to 3

*S*, when expressed as a three-digit floating point number, their computed difference is correct to only 1*S*! (The two zeros after the decimal point in 1.00x10^{-3}**pad**the number.)The approximation 0.999 ~ cos(0.05) has a relative error of about 2.5x10

^{-4}. By comparison, the relative error of 1.00x10^{-3}- cos(0.05) is about 0.2, i.e., it is much larger. Thus, subtraction of two nearly equal numbers should be avoided whenever possible.In the case of

*f(x)*= 1 - cos*x*, one can avoid loss of significant digits by writingThis last formula is more suitable for calculations when

*x*is close to 0. It can be verified that the more accurate approximation of 1.25 x 10^{-3}is obtained for 1- cos(0.05) when three-digit floating point arithmetic is used. ### Checkpoint

Why is it sometimes necessary to shift the mantissa and adjust the exponent of a

**floating point number**?

- Does
**floating point arithmetic**obey the usual laws of arithmetic?

- Why should the
**subtraction of two nearly equal numbers**be avoided? **Exercises**Evaluate the following expressions, using three-digit decimal normalized floating point arithmetic with rounding:

- 6.19x10
^{2}+5.82x10^{2},

- 6.19x10
^{2}+3.61x10^{1},

- 6.19x10
^{2}-5.82x10^{2},

- 6.19x 10
^{2}-3.61x10^{1},

- (3.60 x 10
^{3})x(1.01x10^{-1},

- (-7.50x10
^{-1}x(-4.44x10^{1},

- (6.45x10
^{2}/(5.16xl0^{-1},

- (-2.86 x 10
^{-2})/(3.29 x 10^{3}). - Estimate the accumulated
errors in the results of Exercise 1,
assuming that all values are correct to 3
*S*. - Evaluate the following
expressions, using four-digit decimal
normalized floating point arithmetic with
rounding, then recalculate them, carrying
all decimal places, and estimate the
propagated error.
- Given
*a*= 6.842x10^{-1},*b*= 5.685x10^{1},*c*= 5.641x10^{1}, find*a(b-c)*and*ab-ac*. - Given
*a*=9.812xl0^{1},*b*=4.631x+l0^{-1},*c*=8.340xl0^{-1}, find*(a+b)+c*and*a+(b+c)*. - Use four-digit
decimal normalized floating point
arithmetic with rounding to
calculate
*f (x)*= tan*x*- sin*x*for*x*= 0.1.Since

tan

*x*-sin*x*=tan*x*(1-cos*x*)=tan*x*(2$sin2($*x*/2))*f(x)*may be written as*f (x)*=2tan*x*sin^{2}(*x*/2). Repeat the calculation using this alternative expression. Which of the two values is more accurate?

- Given

- 6.19x10

Answers