### Interval arithmetic

In STEP 2, floating point representation was introduced as a convenient way of dealing with large or small numbers. Since most scientific computations involve such numbers, many students will be familiar with floating point arithmetic and will appreciate the way in which it facilitates calculations involving multiplication or division.

In order to investigate the implications of finite number representation, one must examine the way in which arithmetic is carried out with floating point numbers. The following specifications apply to most computers which round, and are easily adapted to those which chop. For the sake of simplicity in the examples, we will use a three-digit decimal mantissa normalized to lie in the range (most computers use binary representation and the mantissa is commonly normalized to lie in the rangc [½,1]). Note that up to six digits are used for intermediate results, but the final result of each operation is a normalized three-digit decimal floating point number.

Mantissae are added or subtracted (after shifting the mantissa and increasing the exponent of the smaller number, if necessary, to make the exponents agree); the final normalized result is obtained by rounding (after shifting the mantissa and adjusting the exponent, if necessary). Thus:

3.12 x 101 + 4.26 x l01 = 7.38 x 101
2.77 x 102 + 7.55 x 102 = 10.32 x 102
® 1.03 x 103
6.18 x l01 + 1.84 x l0-1 = 6.18 x 101 + 0.0184 x 101 = 6.1984 x 101
® 6.20 x 101 ,
3.65 x 10-1 - 2.78 x 10-1 = 0.87 x 10-1
® 8.70 x 10-2.

The exponents are added and the mantissae are multiplied; the final result is obtained by rounding (after shifting the mantissa right and increasing the exponent by 1, if necessary). Thus:

(4.27 x 101) x (3.68 x 101) = 15.7136 x 102 ® 1.57x103
(2.73x102)x(-3.64x10-2)=-9.9372x100
® -9.94x100.

• ### Division

The exponents are subtracted and the mantissae are divided; the final result is obtained by rounding (after shifting the mantissa left and reducing the exponent by 1, if necessary). Thus:

(5.43xl01) / (4.55x102) = 1.19340...xl0-1 ® 1.19x10-1
(-2.75x102) / (9.87x10-2) = -0.278622. . .x104
® -2.79x103.

• ### Expressions

The order of evaluation is determined in a standard way and the result of each operation is a normalized floating point number. Thus:

(6.18x101+1.84xl0-1)/((4.27x101)x(3.68x101))®(6.20x101)/(1.57x103)=3.94904...x10-2® 3.95x10-2

• ### Generated error

Note that all the above examples (except the subtraction and the first addition) involve generated errors which are relatively large due to the small number of digits in the mantissae. Thus the generated error in

2.77x102+7.55x102=10.32x102 ® 1.03x103

is 0.002 x 103. Since the propagated error in this example may be as large as 0.01 x 102 (assuming the operands are correct to 3S ), one can use the result given in Error propagation to deduce that the accumulated error cannot exceed 0.002x103 + 0.01x102 = 0.003x103..

• ### Consequences

The peculiarities of floating point arithmetic lead to some unexpected and unfortunate consequences, including the following:

1. Addition or subtraction of a small (but nonzero) number may have no effect, for example

5.18x102 + 4.37x10-1 = 5.18x102 + 0.00437x102 = 5.18437x102 ® 5.18x102,

whence, the additive identity is not unique.

2. Frequently, the result of ax(1/a) is not 1; for example, if a = 3.00x100, then

1/a is 3.33x10-1

and

a´ (1/a) is 9.99´ 10-1,

whence the multiplicative inverse may not exist.

3. The result of (a + b) + c is not always the same as the result of a + (b + c); for example, if

a = 6.31x101, b = 4.24x100, c = 2.47x10-1,

then

(a+b)+c = (6.31x101 + 0.424x101) + 2.47x10-1 ® 6.73x101 + .0247x101 ® 6.75x101,

whereas

(a+b)+c = 6.31x101 + (4.24x100 + 2.47x100) ® 6.31x101 + 4.49x100 ® 6.31x101+4.49x100 ® 6.31x101+ 0.449x101 ® 6.76x101,

whence the associative law for addition does not always apply.

Examples involving adding many numbers of varying size indicate that adding in order of increasing magnitude is preferable to adding in the reverse order.

4. Subtracting a number from another nearly equal number may result in loss of significance or cancellation errors. In order to illustrate this loss of accuracy, suppose that we evaluate f(x) = 1 - cos x for x=0.05, using three-digit, decimal, normalized, floating point arithmetic with rounding. Then

1 - cos(0.05) = 1-0.99875 ® 1.00x100 - 0.999x100 ® 1.00x10-3.

Although the value of 1 is exact and cos(0.05) is correct to 3S, when expressed as a three-digit floating point number, their computed difference is correct to only 1S ! (The two zeros after the decimal point in 1.00x10-3 pad the number.)

The approximation 0.999 ~ cos(0.05) has a relative error of about 2.5x10-4. By comparison, the relative error of 1.00x10-3 - cos(0.05) is about 0.2, i.e., it is much larger. Thus, subtraction of two nearly equal numbers should be avoided whenever possible.

In the case of f(x)= 1 - cos x, one can avoid loss of significant digits by writing This last formula is more suitable for calculations when x is close to 0. It can be verified that the more accurate approximation of 1.25 x 10-3 is obtained for 1- cos(0.05) when three-digit floating point arithmetic is used.

5. ### Checkpoint

Why is it sometimes necessary to shift the mantissa and adjust the exponent of a floating point number?

6. Does floating point arithmetic obey the usual laws of arithmetic?
7. Why should the subtraction of two nearly equal numbers be avoided?
8. Exercises

Evaluate the following expressions, using three-digit decimal normalized floating point arithmetic with rounding:

1. 6.19x102+5.82x102,
2. 6.19x102+3.61x101,
3. 6.19x102-5.82x102,
4. 6.19x 102-3.61x101,
5. (3.60 x 103)x(1.01x10-1,
6. (-7.50x10-1x(-4.44x101,
7. (6.45x102/(5.16xl0-1,
8. (-2.86 x 10-2)/(3.29 x 103).
9. Estimate the accumulated errors in the results of Exercise 1, assuming that all values are correct to 3S.
10. Evaluate the following expressions, using four-digit decimal normalized floating point arithmetic with rounding, then recalculate them, carrying all decimal places, and estimate the propagated error.
1. Given a = 6.842x10-1, b = 5.685x101, c = 5.641x101, find a(b-c) and ab-ac.
2. Given a=9.812xl01, b=4.631x+l0-1, c=8.340xl0-1, find (a+b)+c and a+(b+c).
3. Use four-digit decimal normalized floating point arithmetic with rounding to calculate f (x)= tanx - sinx for x = 0.1.

Since

tanx-sinx=tanx(1-cosx)=tanx(2$sin2\left($x/2))

f(x) may be written as f (x)=2tanxsin2(x/2). Repeat the calculation using this alternative expression. Which of the two values is more accurate?