Floating Point Numbers

Floating Point Number Systems

Real Numbers

Standard Solution (Partial)

Floating point systems

  • An approximate representation of real numbers using a finite number of bits

Normalized Form of a Real number

After expressing the real number in the desired base β, we multiply by a power of p to shift it into a normalized form:

0.d1d2d3d4×βp

where:

Summary

±0.d1d2d3d4dt×βp

for LpU and d10

The four integer parameters {β,t,L,U} characterize a specific floating point system, F where β= base, t= mantissa, and L and U are the lower and upper bounds for the exponent.

Overflow / Underflow Errors

Floating Point Standards

IEEE Single Precision (32 Bits): {β=2,t=24,L=126,U=127}
IEEE Double Precision (64 Bits): {β=2,t=53,L=1022,U=1023}

Floating Point Density

Floating point numbers are not evenly spaced

Rounding vs. Truncation

Round-to-nearest

Rounds to closest available number in F.

Truncation

Rounds to next number in F towards 0.

Measuring Error

Difference between xexact and xbetween
We distinguish between absolute error and relative error.

Eabs=|xexactxapprox|Erel=|xexactxapprox|xexact
Relative Error is often more useful

  • Is independent of the magnitudes of the numbers involved
  • Relates to the number of significant digits in the result

A result is correct to roughly s digits if Erel10s or

0.5×10sErel5×10s

Floating Point vs. Reals

For FP system F, relative error between XR, and its FP approximation, fl(x), has a bound, E

Erel=|xfl(x)|x(1E)|x||fl(x)|(1+E)|x|

Machine Epsilon

The maximum relative error, E, for a FP system is called machine epsilon or unit round-off error.
It is defined as the smallest value such that fl(1+E)>1 under the given floating point system.
We have the rule fl(x)=x(1+δ) for some |δ|E.
For an FP system (β,t,L,U) with …

Arithmetic with Floating Point

What guarantees are there on a FP arithmetic operation?

IEEE standard requires that for w,xF:

wz=fl(w+x)=(w+z)(1+δ)
This rule only applies to individual FP operations! Not sequences.

(ab)ca(bc)fl(a+b+c)

Result is order-dependent; associativity is broken!

Round-off Error Analysis

What can we say about (ab)c?

Let us use E to give a bound on the relative error
Erel=|(ab)c(a+b+c)||a+b+c|=|[(a+b)(1+δ1)]c(a+b+c)||a+b+c|=|[a+b+(a+b)δ1+c](1+δ2)(a+b+c)||a+b+c|=|(a+b)δ1+(a+b+c+(a+b)δ1)δ2||a+b+c|=|(a+b)δ1+(a+b+c)δ2+(a+b)δ1δ2||a+b+c||(a+b)δ1|+|(a+b+c)δ2|+|(a+b)δ1δ2||a+b+c||(a+b)|E+|(a+b+c)|E+|(a+b)|E2|a+b+c||a|+|b|+|c||a+b+c|(2E+E2)
This analysis describes only the worst case error magnification. Actual error could be much less.

Behaviour of Error Bounds

We showed that (ab)c satisfies:

Erel|a|+|b|+|c||a+b+c|(2E+E2)

The term |a|+|b|+|c||a+b+c| describes how much E could be scaled by to give the total error, when summing 3 numbers a,b,c.

Cancellation Errors

When is |a|+|b|+|c||a+b+c| large?

Expressions without cancellation often have better bounds on error

Catastrophic cancellation occurs when subtracting numbers about the same magnitude, when the input numbers contain error.

Stability of Algorithms

If any initial error in the data is magnified by an algorithm, the algorithm is considered numerically unstable.

Conditioning of Problems

For problem P, with input I and output O, if a change to the input, ΔI, gives a small change in the output ΔO, P, is well-conditioned. Otherwise, P is ill-conditioned.

Conditioning vs. Stability

Conditioning of a problem:

  1. An algorithm can be unstable even for a well-conditioned problem!
  2. An ill-conditioned problem limits how well we can expect any algorithm to perform

Stability Analysis of an Algorithm

We can analyse whether errors accumulate or shrink to determine the stability of algorithms