You are on page 1of 232

CONTENTS

Preface to the second edition xi

Preface to the first edition xiii

Prologue 1
1 Historical 1
2 Numerical Analysis today 2
3 This book 3

ERRORS

Step 1 Sources of error 4


1 Example 4

Step 2 Approximation to numbers 7


1 Number representation 7
2 Round-off error 8
3 Truncation error 8
4 Mistakes 8
5 Examples 8

Step 3 Error propagation and generation 10


1 Absolute error 10
2 Relative error 10
3 Error propagation 11
4 Error generation 11
5 Example 12

Step 4 Floating point arithmetic 14


1 Addition and subtraction 14
2 Multiplication 14
3 Division 15
4 Expressions 15
5 Generated error 15
6 Consequences 15
vi CONTENTS

Step 5 Approximation to functions 18


1 The Taylor series 18
2 Polynomial approximation 20
3 Other series expansions 20
4 Recursive procedures 20

NONLINEAR EQUATIONS

Step 6 Nonlinear algebraic and transcendental equations 23


1 A transcendental equation 23
2 Locating roots 24

Step 7 The bisection method 27


1 Procedure 27
2 Effectiveness 28
3 Example 28

Step 8 Method of false position 30


1 Procedure 30
2 Effectiveness and the secant method 31
3 Example 32

Step 9 The method of simple iteration 34


1 Procedure 34
2 Example 34
3 Convergence 35

Step 10 The Newton-Raphson iterative method 37


1 Procedure 37
2 Example 38
3 Convergence 39
4 Speed of convergence 40
5 The square root 41

SYSTEMS OF LINEAR EQUATIONS

Step 11 Solution by elimination 42


1 Notation and definitions 42
2 The existence of solutions 43
3 Gaussian elimination method 44
4 The transformation operations 45
5 General treatment of the elimination process 45
6 Numerical example 48
CONTENTS vii

Step 12 Errors and ill-conditioning 51


1 Errors in the coefficients and constants 51
2 Round-off errors and numbers of operations 52
3 Partial pivoting 52
4 Ill-conditioning 53

Step 13 The Gauss-Seidel iterative method 56


1 Iterative methods 56
2 The Gauss-Seidel method 56
3 Convergence 57

Step 14 Matrix inversion* 59


1 The inverse matrix 59
2 Method for inverting a matrix 59
3 Solution of linear systems using the inverse matrix 61

Step 15 Use of LU decomposition* 64


1 Procedure 64
2 Example 65
3 Effecting an LU decomposition 66

Step 16 Testing for ill-conditioning* 69


1 Norms 69
2 Testing for ill-conditioning 70

THE EIGENVALUE PROBLEM

Step 17 The power method 73


1 Power method 74
2 Example 74
3 Variants 75
4 Other aspects 77

FINITE DIFFERENCES

Step 18 Tables 79
1 Tables of values 79
2 Finite differences 80
3 Influence of round-off errors 80

Step 19 Forward, backward, and central difference notations 83


1 The shift operator E 83
2 The forward difference operator 1 83
3 The backward difference operator ∇ 84
viii CONTENTS

4 The central difference operator δ 84


5 Difference display 85

Step 20 Polynomials 88
1 Finite differences of a polynomial 88
2 Example 89
3 Approximation of a function by a polynomial 89

INTERPOLATION

Step 21 Linear and quadratic interpolation 92


1 Linear interpolation 92
2 Quadratic interpolation 94

Step 22 Newton interpolation formulae 96


1 Newton’s forward difference formula 96
2 Newton’s backward difference formula 96
3 Use of Newton’s interpolation formulae 97
4 Uniqueness of the interpolating polynomial 98
5 Analogy with Taylor series 99

Step 23 Lagrange interpolation formula 101


1 Procedure 101
2 Example 102
3 Notes of caution 103

Step 24 Divided differences* 104


1 Divided differences 104
2 Newton’s divided difference formula 105
3 Example 105
4 Error in interpolating polynomial 106
5 Aitken’s method 107

Step 25 Inverse interpolation* 110


1 Linear inverse interpolation 110
2 Iterative inverse interpolation 110
3 Divided differences 111

CURVE FITTING

Step 26 Least squares 114


1 The problem illustrated 114
2 General approach to the problem 115
3 Errors ‘as small as possible’ 116
CONTENTS ix

4 The least squares method and normal equations 116


5 Example 117

Step 27 Least squares and linear equations* 122


1 Pseudo-inverse 122
2 Normal equations 123
3 QR factorization 124
4 The QR factorization process 126

Step 28 Splines* 129


1 Construction of cubic splines 129
2 Examples 132

NUMERICAL DIFFERENTIATION

Step 29 Finite differences 135


1 Procedure 135
2 Error in numerical differentiation 136
3 Example 137

NUMERICAL INTEGRATION

Step 30 The trapezoidal rule 139


1 The trapezoidal rule 139
2 Accuracy 140
3 Example 141

Step 31 Simpson’s rule 143


1 Simpson’s rule 143
2 Accuracy 144
3 Example 145

Step 32 Gaussian integration formulae 146


1 Gauss two-point integration formula 146
2 Other Gauss formulae 147
3 Application of Gaussian quadrature 148

ORDINARY DIFFERENTIAL EQUATIONS

Step 33 Single-step methods 149


1 Taylor series 149
2 Runge-Kutta methods 150
3 Example 151
x CONTENTS

Step 34 Multistep methods 153


1 Introduction 153
2 Stability 154

Step 35 Higher order differential equations* 156


1 Systems of first-order initial value problems 156
2 Numerical methods for first-order systems 157
3 Numerical example 158

Applied Exercises 160

Appendix: Pseudo-code 163

Answers to the Exercises 173

Bibliography 216

Index 217
PREFACE TO THE SECOND EDITION

First Steps in Numerical Analysis, originally published in 1978, is now in its twelfth
impression. It has been widely used in schools, polytechnics, and universities
throughout the world. However, we decided that after a life of seventeen years in
the classroom and lecture theatre, the contents of the book should be reviewed.
Feedback from many users, both teachers and students, could be incorporated; and
the development of the subject suggested that some new topics might be included.
This Second Edition of the book is the outcome of our consideration of these
matters.
The changes we have made are not very extensive, which reflects our view that
the syllabus for a first course in Numerical Analysis must continue to include
most of the basic topics in the First Edition. However, the result of rapid changes
in computer technology is that some aspects are obviously less important than
they were, and other topics have become more important. We decided that less
should be said about finite differences, for example, but more should be said about
systems of linear equations and matrices. New material has been added on curve
fitting (for example, use of splines), and more has been given on the solution of
differential equations. The total number of Steps has increased from 31 to 35.
For the benefit of both teachers and students, additional exercises have been set
at the end of many of the Steps, and brief answers again supplied. Also, a set
of Applied Exercises has been included, to challenge students to apply numerical
methods in the context of ‘real world’ applications. To make it easier for users
to implement the given algorithms in a computer program, the flowcharts in the
Appendix of the First Edition have been replaced by pseudo-code. The method of
organizing the material into STEPS (of a length suitable for presentation in one or
two hours) has been retained, for this has been a popular feature of the book.
We hope that these changes and additions, together with the new typesetting
used, will be found acceptable, enhancing and attractive; and that the book will
continue to be widely used. Many of the ideas presented should be accessible to
students in mathematics at the level of Seventh Form in New Zealand, Year 12 in
Australia, or GCE A level in the United Kingdom. The addition of more (optional)
starred Steps in this Edition makes this book also suitable for first and second year
introductory Numerical Analysis courses in polytechnics and universities.
R. J. Hosking
S. Joe
D. C. Joyce
J. C. Turner
1995
xii
PREFACE TO THE FIRST EDITION

As its title suggests, this book is intended to provide an introduction to elementary


concepts and methods of Numerical Analysis for students meeting the subject for
the first time. In particular, the ideas presented should be accessible at the level of
Seventh Form Applied Mathematics in New Zealand or at Advanced Level G.C.E.
in the United Kingdom. We expect this book will also be found useful for many
courses in polytechnics and universities.
For ease of teaching and learning, the material is divided into short ‘Steps’,
most of which would be included in any first course. A discussion of the content
and plan of the book is given in section 3 of the Prologue.
R. J. Hosking
D. C. Joyce
J. C. Turner
1978
xiv
PROLOGUE

1 Historical
Although some may regard Numerical Analysis as a subject of recent origin, this
in fact is not so. In the first place, it is concerned with the provision of results in the
form of numbers, which no doubt were in use by very early man. More recently,
the Babylonian and ancient Egyptian cultures were noteworthy for numerical
expertise, particularly in association with astronomy and civil engineering. There
is a Babylonian tablet dated approximately 2000 B.C. giving the squares of the
integers 1–60, and another which records the eclipses back to about 750 B.C. The
Egyptians dealt with fractions, and even invented the method of false position for
the solution of nonlinear algebraic equations (see Step 8).
It is probably unnecessary to point out that the Greeks produced a number of
outstanding mathematicians, many of whom provided important numerical results.
In about 220 B.C. Archimedes gave the result

71 < π < 3 7
3 10 1

√  
The iterative procedure for a involving 12 xn + xa and usually attributed to
n
Newton (see Step 10) was in fact used by Heron the elder in about 100 B.C. The
Pythagoreans considered the numerical summation of series, and Diophantus in
about 250 A.D. gave a process for the solution of quadratic equations.
Subsequently, progress in numerical work occurred in the Middle East. Apart
from the development of the modern arithmetical notation commonly referred to
as Arabic, tables of the trigonometric functions sine and tangent were constructed
by the tenth century. Further east, in India and China, there was parallel (although
not altogether separate) mathematical evolution.
In the West, the Renaissance and scientific revolution involved a rapid expansion
of mathematical knowledge, including the field of Numerical Analysis. Such
great names of mathematics as Newton, Euler, Lagrange, Gauss, and Bessel
are associated with modern methods of Numerical Analysis, and testify to the
widespread interest in the subject.
In the seventeenth century, Napier produced a table of logarithms, Oughtred
invented the slide rule, and Pascal and Leibniz pioneered the development of
calculating machines (although these were not produced in quantity until the
nineteenth century). The provision of such machines brought a revolution in
numerical work, a revolution greatly accelerated since the late 1940’s by the
development of modern computers.
2 PROLOGUE

The extent of this revolution certainly becomes clearer when we consider the
great advances in computing power in the past fifty years. The fastest single-
processor supercomputers currently available are hundreds of thousands of times
faster than the earliest computers. The micro-computers that students have in
their place of study are many times faster (and smaller) than the mini-computers
that were available when the first edition of this book came out. Even hand-
held scientific calculators can perform calculations that were once the domain of
big mainframe computers. New procedures have been and are being developed;
computations and data analyses which could not have been contemplated even
as a life’s work a few decades ago are now solved in quite acceptable times.
There is now quite widespread use of vector machines for large-scale scientific
computation, and increasing use is being made of parallel computers with two or
more processors (perhaps even thousands) over which a computing task can be
spread. The equipment at our disposal is the dominant new feature in the field of
Numerical Analysis.

2 Numerical Analysis today


Theoretical science involves the construction of models to interpret experimental
results, and to predict results for future experimental check. Since these results are
often numerical, the applied mathematician attempts to construct a mathematical
model of a complex situation arising in some field such as physics or economics by
describing the important features in mathematical terms. The art of good applied
mathematics is to retain only those features essential for useful deductions, for
otherwise there is usually unnecessary extra work.
The abstract nature of such a mathematical model can be a real advantage, for
it may well be similar to others, previously studied in quite different contexts
but whose solutions are known. Occasionally, there may be a formal analytical
solution procedure available, but even then it may yield expressions so unwieldy
that any subsequent necessary interpretation of the mathematical results is difficult.
In many cases, a numerical procedure leading to meaningful numerical results is
available and preferable. Numerical Analysis remains a branch of mathematics in
which such numerical procedures are studied, with emphasis today on techniques
for use on computers.
There are various main problem areas in Numerical Analysis, including find-
ing roots of nonlinear equations, solving systems of linear equations, eigenvalue
problems, interpolation, approximation of functions, evaluating integrals, solv-
ing differential equations, and optimization. Equations involving transcendental
functions (for example, logarithm or sine) often arise in areas such as science
or engineering, and are usually solved numerically. Systems of linear equations
are common in both science and social science (for example, the rotation of a
set of coordinate axes or the movement of goods in an economy). The solu-
tion of differential equations is a major requirement in various fields, such as
mathematical physics or environmental studies. Since many of these differential
PROLOGUE 3

equations are nonlinear and therefore not normally amenable to analytic solution,
their numerical solution is important.
In an introductory text, of course, it is not possible to deal in depth with other
than a few basic topics. Nevertheless, we hope by these few remarks to encourage
students not only to view their progress through this book as worthwhile, but also
to venture beyond it with enthusiasm and success.

3 This book
Each main topic treated in the book has been divided into a number of Steps. The
first five are devoted to the question of errors arising in numerical work. We believe
that a thorough understanding of errors is necessary for a proper appreciation of
the art of using numerical methods. The succeeding Steps deal with concepts
and methods used in the problem areas of nonlinear equations, systems of linear
equations, the eigenvalue problem, interpolation, curve fitting, differentiation,
integration, and ordinary differential equations.
Most of the unstarred Steps in the book will be included in any first course.
The starred Steps (‘side-steps’) include material which the authors consider to
be extra, but not necessarily extensive, to a first course. The material in each
Step is intended to be an increment of convenient size, perhaps dependent on the
understanding of earlier (but not later) unstarred Steps. Ideally, the consideration
of each Step should involve at least the Exercises, carried out under the supervision
of the teacher where necessary. We emphasize that Numerical Analysis demands
considerable practical experience, and further exercises could also be valuable.
Some additional exercises of an applied nature are given towards the end of the
book (see pages 160–162).
Within each Step, the concepts and method to be learned are presented first,
followed by illustrative examples. Students are then invited to test their immediate
understanding of the text by answering two or three Checkpoint questions. These
concentrate on salient points made in the Step, and induce the student to think
about and re-read the text just covered; they may also be useful for revision
purposes. Brief answers are provided at the end of the book for the Exercises set
in each Step.
After much consideration, the authors decided not to include computer programs
for the various algorithms introduced in the Steps. However, they have provided
pseudo-code in an Appendix. In our experience, students do benefit if they study
the pseudo-code of a method at the same time as they learn it in a Step. If they
are familiar with a programming language they should be encouraged to convert
at least some of the pseudo-code into computer programs, and apply them to the
set Exercises.
To encourage further reading, reference is made at various places in the text to
books listed in the Bibliography on page 216.
STEP 1

ERRORS 1
Sources of error

The main sources of error in obtaining numerical solutions to mathematical prob-


lems are:
(a) the model – its construction usually involves simplifications and omissions;
(b) the data – there may be errors in measuring or estimating values;
(c) the numerical method – generally based on some approximation;
(d) the representation of numbers – for example, π cannot be represented exactly
by a finite number of digits;
(e) the arithmetic – frequently errors are introduced in carrying out operations
such as addition (+) and multiplication (×).
We can pass responsibility for (a) onto the applied mathematician, but the others
are not so easy to dismiss. Thus, if the errors in the data are known to lie within
certain bounds, we should be able to estimate the consequential errors in the
results. Similarly, given the characteristics of the computer, we should be able
to account for the effects of (d) and (e). As for (c), when a numerical method is
devised it is customary to investigate its error properties.

1 Example
To illustrate the ways in which the above errors arise, let us take the example of
the simple pendulum (see Figure 1). If various physical assumptions are made,
including that air resistance and friction at the pivot are negligible, we obtain the
simple (nonlinear) differential equation

d2 θ
m` = −mg sin θ
dt 2
In introductory mechanics courses† the customary next step is to use the ap-
proximation sin θ ≈ θ (assuming that θ is small) to produce the even simpler
(linear) differential equation

d2 θ
= −ω2 θ, where ω2 = g/`
dt 2
† Inpractice one could reduce the type (a) error by using a numerical method (see Step 35) to
solve the more realistic (nonlinear) differential equation
d2 θ
= −ω2 sin θ
dt 2
SOURCES OF ERROR 5

..
........ ..
........ ...
.........
.
θ
...
. ..
........
........ ........ ..
` .
...
.. .
.....
...
......... ........
.................
..
.... .
..
...
.. ...
.......
........ ..
..
...
.......... ..
..
...
...
.... ..
..
...
..... ..
...
...
.... ..
..
...
..... ..
...
...
.... ..
.
• .......... ..
... ..
... ..
... ..
... ..
... ..
...... ..
.. ..
..
..
mg ..
..
..
..

FIGURE 1. Simple pendulum.

This has the analytical solution

θ(t) = A sin ωt + B cos ωt

where A and B are suitable constants.


We can then deduce that the period of the simple pendulum (that is, the smallest
positive value of T such that θ(t + T ) = θ(t)) is


= 2π `/g
p
ω

Up to this point we have encountered only errors of type (a); the other errors
are introduced when we try to obtain a numerical value for T in a particular case.
Thus both ` and g will be subject to measurement errors; π must be represented as
a finite decimal number, the square root must be computed (usually by an iterative
process) after dividing ` by g (which may involve a rounding error), and finally
the square root must be multiplied by 2π.

Checkpoint

1. What sources of error are of concern to the numerical analyst?


2. Which types of error depend upon the computer used?

EXERCISES
When carrying out the following calculations, notice all the points at which errors
of one kind or another arise.
1. Calculate the period of a simple pendulum of length 75 cm, given that g is
981 cm/s2 .
6 ERRORS 1

2. The rate of flow of a liquid through a circular hole of diameter d is given by


the formula
πd 2 p
R=C 2g H
4
where C is a coefficient of discharge and H is the head of liquid causing
the flow. Calculate R for a head of 650 cm, given that d = 15 cm and the
coefficient of discharge is estimated to be 0.028.
STEP 2

ERRORS 2
Approximation to numbers

Although it may not seem so to the beginner, it is important to examine ways in


which numbers are represented.

1 Number representation
We humans normally represent a number in decimal (base 10) form, although
modern computers use binary (base 2) and also hexadecimal (base 16) forms. Nu-
merical calculations usually involve numbers that cannot be represented exactly
by a finite number of digits. For instance, the arithmetical operation of division
often gives a number which does not terminate; the decimal (base 10) representa-
tion of 23 is one example. Even a number such as 0.1 which terminates in decimal
form would not terminate if expressed in binary form. There are also the irrational
numbers such as the value of π, which do not terminate. In order to carry out
a numerical calculation involving such numbers, we are forced to approximate
them by a representation involving a finite number of significant digits (S ). For
practical reasons (for example, the size of the back of an envelope or the ‘storage’
available in a machine), this number is usually quite small. Typically, a ‘single
precision’ number on a computer has an accuracy of only about 6 or 7 decimal
digits (see below). √
To five significant digits (5S ), 32 is represented by 0.66667, π by 3.1416, and 2
by 1.4142. None of these is an exact representation, but all are correct to within
half a unit of the fifth significant digit. Numbers should normally be presented in
this sense, correct to the number of digits given.
If the numbers to be represented are very large or very small, it is convenient to
write them in floating point notation (for example, the speed of light 2.99792×108
m/s, or the electronic charge 1.6022 × 10−19 coulomb). As indicated, we separate
the significant digits (the mantissa) from the power of ten (the exponent); the form
in which the exponent is chosen so that the magnitude of the mantissa is less than
10 but not less than 1 is referred to as scientific notation.
In 1985 the Institute of Electrical and Electronics Engineers published a stan-
dard for binary floating point arithmetic. This standard, known as the IEEE
Standard 754, had been widely adopted (it is very common on workstations used
for scientific computation). The standard specifies a format for ‘single precision’
numbers and a format for ‘double precision’ numbers. The single precision format
allows 32 binary digits (known as bits) for a floating point number with 23 of these
bits allocated for the mantissa. In the double precision format the values are 64
and 52 bits, respectively. On conversion from binary to decimal, it turns out that
8 ERRORS 2

any IEEE Standard 754 single precision number has an accuracy of about six or
seven decimal digits, and a double precision number an accuracy of about 15 or
16 decimal digits.

2 Round-off error
The simplest way of reducing the number of significant digits in the representation
of a number is merely to ignore the unwanted digits. This procedure, known as
chopping, was used by many early computers. A more common and better
procedure is rounding, which involves adding 5 to the first unwanted digit, and
then chopping. For example, π chopped to four decimal places (4D ) is 3.1415, but
it is 3.1416 when rounded; the representation 3.1416 is correct to five significant
digits (5S ). The error involved in the reduction of the number of digits is called
round-off error. Since π is 3.14159 . . ., we could remark that chopping has
introduced much more round-off error than rounding.

3 Truncation error
Numerical results are often obtained by truncating an infinite series or iterative
process (see Step 5). Whereas round-off error can be reduced by working to more
significant digits, truncation error can be reduced by retaining more terms in the
series or more steps in the iteration; this, of course, involves extra work (and
perhaps expense!).

4 Mistakes
In the language of Numerical Analysis, a mistake (or blunder) is not an error!
A mistake is due to fallibility (usually human, not machine). Mistakes may be
trivial, with little or no effect on the accuracy of the calculation, or they may be
so serious as to render the calculated results quite wrong. There are three things
which may help to eliminate mistakes:
(a) care;
(b) checks, avoiding repetition;
(c) knowledge of the common sources of mistakes.
Common mistakes include: transposing digits (for example, reading 6238 as
6328); misreading repeated digits (for example, reading 62238 as 62338); misread-
ing tables (for example, referring to a wrong line or a wrong column); incorrectly
positioning a decimal point; overlooking signs (especially near sign changes).

5 Examples
The following illustrate rounding to four decimal places (4D ):

4/3 → 1.3333; π/2 → 1.5708; 1/ 2 → 0.7071
APPROXIMATION TO NUMBERS 9

The following illustrate rounding to four significant digits (4S ):



4/3 → 1.333; π/2 → 1.571; 1/ 2 → 0.7071

Checkpoint

1. What may limit the accuracy of a number in a calculation?


2. What is the convention adopted in rounding?
3. How can mistakes be avoided?

EXERCISES

1. What are the floating point representations of the following numbers:


12.345, 0.80059, 296.844, 0.00519?
2. For each of the following numbers:
34.78219, 3.478219, 0.3478219, 0.03478219,
(a) chop to three significant digits (3S ),
(b) chop to three decimal places (3D ),
(c) round to three significant digits (3S ),
(d) round to three decimal places (3D ).
3. For the number
3 = 1.66666 . . . ,
5

determine the magnitude of the round-off error when it is represented by a


number obtained from the decimal form by:
(a) chopping to 3S,
(b) chopping to 3D,
(c) rounding to 3S,
(d) rounding to 3D.
STEP 3

ERRORS 3
Error propagation and generation

We have noted that a number is to be represented by a finite number of digits,


and hence often by an approximation. It is to be expected that the result of any
arithmetic procedure (any algorithm) involving a set of numbers will have an
implicit error relating to the error of the original numbers. We say that the initial
errors propagate through the computation. In addition, errors may be generated
at each step in the algorithm, and we may speak of the total cumulative error at
any step as the accumulated error.
Since we wish to produce results within some chosen limit of error, it is useful
to consider error propagation. Roughly speaking from experience, the propagated
error depends on the mathematical algorithm chosen, whereas the generated error
is more sensitive to the actual ordering of the computational steps. It is possible
to be more precise, as described below.

1 Absolute error
The absolute error is the absolute difference between the exact number x and the
approximate number x ∗ ; that is,
eabs = |x − x ∗ |
A number correct to n decimal places has
eabs ≤ 0.5 × 10−n
we expect that the absolute error involved in any approximate number is no more
than five units at the first neglected digit.

2 Relative error
The relative error is the ratio of the absolute error to the absolute exact number;
that is,
eabs eabs
erel = ≤ ∗
|x| |x | − eabs
(Note that the upper bound follows from the triangle inequality; thus
|x ∗ | = |x + x ∗ − x| ≤ |x| + |x ∗ − x|
so that |x| ≥ |x ∗ | − eabs .) If eabs  |x ∗ |, then
eabs
erel ≈
|x ∗ |
ERROR PROPAGATION AND GENERATION 11

A decimal number correct to n significant digits has

erel ≤ 5 × 10−n

3 Error propagation
Consider two numbers x = x ∗ + e1 , y = y ∗ + e2
(a) Under the operations addition or subtraction, we have

x ∓ y = x ∗ ∓ y ∗ + e1 ∓ e2

so that
e ≡ (x ∓ y) − (x ∗ ∓ y ∗ ) = e1 ∓ e2
and hence
|e| ≤ |e1 | + |e2 |
that is,
max(|e|) = |e1 | + |e2 |
The magnitude of the propagated error is therefore not more than the sum of
the initial absolute errors; of course, it may be zero.
(b) Under the operation multiplication,

x y − x ∗ y ∗ = x ∗ e2 + y ∗ e1 + e1 e2

so that
x y − x ∗ y ∗ e1 e2 e1 e2

≤ + +
x ∗ y∗ x ∗ y∗ x ∗ y∗
and so
e1 e2
max(erel ) ≈ ∗ + ∗
x y

assuming e1∗ e2∗ is negligible. The maximum relative error propagated

x y
is approximately the sum of the initial relative errors. The same result is
obtained when the operation is division.

4 Error generation
Often (for example, in a computer) an operation ⊗ is also approximated, by an
operation ⊗∗ , say. Consequently, x ⊗ y is represented by x ∗ ⊗∗ y ∗ . Indeed, one
has

|x ⊗ y − x ∗ ⊗∗ y ∗ | = |(x ⊗ y − x ∗ ⊗ y ∗ ) + (x ∗ ⊗ y ∗ − x ∗ ⊗∗ y ∗ )|
≤ |x ⊗ y − x ∗ ⊗ y ∗ | + |x ∗ ⊗ y ∗ − x ∗ ⊗∗ y ∗ |

so that the accumulated error does not exceed the sum of the propagated and
generated errors. Examples may be found in Step 4.
12 ERRORS 3

5 Example
Here we evaluate (as accurately as possible) the following:
(i) 3.45 + 4.87 − 5.16
(ii) 3.55 × 2.73
There are two methods which the student may consider, the first of which
is to invoke the concepts of absolute and relative error as defined in this Step.
Thus the result for (i) is 3.16 ± 0.015, since the maximum absolute error is
0.005 + 0.005 + 0.005 = 0.015. One concludes that the answer is 3 (to 1S ), for
the number certainly lies between 3.145 and 3.175. In (ii), the product 9.6915 is
subject to the maximum relative error
 
0.005 0.005 0.005 0.005 1 1
+ + × ≈ + × 0.005
3.55 2.73 3.55 2.73 3.55 2.73

hence the maximum (absolute) error ≈ (2.73 + 3.55) × 0.005 ≈ 0.03, so that the
answer is 9.7.
A second approach is to use ‘interval arithmetic’. Thus, the approximate number
3.45 represents a number in the interval (3.445, 3.455), etc. Consequently, the
result for (i) lies in the interval bounded below by

3.445 + 4.865 − 5.165 = 3.145

and above by
3.455 + 4.875 − 5.155 = 3.175
Similarly, in (ii) the result lies in the interval bounded below by

3.545 × 2.725 ≈ 9.66

and above by
3.555 × 2.735 ≈ 9.72
Hence one again concludes that the approximate numbers 3 and 9.7 correctly
represent the respective results to (i) and (ii).

Checkpoint

1. What distinguishes propagated and generated error?


2. How may the propagated error for the operations addition (subtrac-
tion) and multiplication (division) be determined?
ERROR PROPAGATION AND GENERATION 13

EXERCISES
Evaluate the following as accurately as possible, assuming all values are correct
to the number of digits given:
1. 8.24 + 5.33.
2. 124.53 − 124.52.
3. 4.27 × 3.13.
4. 9.48 × 0.513 − 6.72.
5. 0.25 × 2.84/0.64.
6. 1.73 − 2.16 + 0.08 + 1.00 − 2.23 − 0.97 + 3.02.
STEP 4

ERRORS 4
Floating point arithmetic

In Step 2, floating point representation was introduced as a convenient way of


dealing with large or small numbers. Since most scientific computation involves
such numbers, many students will be familiar with floating point arithmetic and
will appreciate the way in which it facilitates calculations involving multiplication
or division.
To investigate the implications of finite number representation we need to
examine the way in which arithmetic is carried out with floating point numbers.
The specifications below apply to most computers that round, and are easily
adapted to those that chop. For simplicity in our examples, we will use a three-digit
decimal mantissa normalized to lie in the range [1, 10), that is, 1 ≤ |mantissa| <
10 (most computers use binary representation and the mantissa is commonly
normalized to lie in the range [ 12 , 1)). Note that up to six digits are used for
intermediate results but the final result of each operation is a normalized three-
digit decimal floating point number.

1 Addition and subtraction


The mantissae are added or subtracted (after shifting the mantissa and increasing
the exponent of the smaller number, if necessary, to make the exponents agree);
the final normalized result is obtained by rounding (after shifting the mantissa and
adjusting the exponent, if necessary). Thus:

3.12 × 101 + 4.26 × 101 = 7.38 × 101


2.77 × 102 + 7.55 × 102 = 10.32 × 102 → 1.03 × 103
6.18 × 101 + 1.84 × 10−1 = 6.18 × 101 + 0.0184 × 101
= 6.1984 × 101 → 6.20 × 101
3.65 × 10−1 − 2.78 × 10−1 = 0.87 × 10−1 → 8.70 × 10−2

2 Multiplication
The exponents are added and the mantissae are multiplied; the final result is
obtained by rounding (after shifting the mantissa right and increasing the exponent
by 1, if necessary). Thus:

(4.27 × 101 ) × (3.68 × 101 ) = 15.7136 × 102 → 1.57 × 103


(2.73 × 102 ) × (−3.64 × 10−2 ) = −9.9372 × 100 → −9.94 × 100
FLOATING POINT ARITHMETIC 15

3 Division
The exponents are subtracted and the mantissae are divided; the final result is
obtained by rounding (after shifting the mantissa left and reducing the exponent
by 1, if necessary). Thus:
5.43 × 101 /(4.55 × 102 ) = 1.19340 . . . × 10−1 → 1.19 × 10−1
−2.75 × 102 /(9.87 × 10−2 ) = −0.278622 . . . × 104 → −2.79 × 103

4 Expressions
The order of evaluation is determined in a standard way and the result of each
operation is a normalized floating point number. Thus:
(6.18 × 101 + 1.84 × 10−1 )/((4.27 × 101 ) × (3.68 × 101 ))
→ 6.20 × 101 /(1.57 × 103 ) = 3.94904 . . . × 10−2 → 3.95 × 10−2

5 Generated error
We note that all the above examples (except the subtraction and the first addition)
involve generated errors which are relatively large because of the small number
of digits in the mantissae. Thus the generated error in
2.77 × 102 + 7.55 × 102 = 10.32 × 102 → 1.03 × 103
is 0.002 × 103 . Since the propagated error in this example may be as large as
0.01 × 102 (assuming the operands are correct to 3S ), we can use the result
given in Section 4 of Step 3 to deduce that the accumulated error cannot exceed
0.002 × 103 + 0.01 × 102 = 0.003 × 103 .

6 Consequences
The peculiarities of floating point arithmetic lead to some unexpected and unfor-
tunate consequences, including the following:
(a) Addition or subtraction of a small (but nonzero) number may have no effect,
for example,
5.18 × 102 + 4.37 × 10−1 = 5.18 × 102 + 0.00437 × 102
= 5.18437 × 102 → 5.18 × 102
thus, the additive identity is not unique.
(b) Frequently the result of a × (1/a) is not 1, for example, if a = 3.00 × 100 ,
then
1/a → 3.33 × 10−1
and
a × (1/a) → 9.99 × 10−1
thus, the multiplicative inverse may not exist.
16 ERRORS 4

(c) The result of (a + b) + c is not always the same as the result of a + (b + c),
for example, if
a = 6.31 × 101 , b = 4.24 × 100 , c = 2.47 × 10−1
then
(a + b) + c = (6.31 × 101 + 0.424 × 101 ) + 2.47 × 10−1
→ 6.73 × 101 + 0.0247 × 101
→ 6.75 × 101
whereas
a + (b + c) = 6.31 × 101 + (4.24 × 100 + 0.247 × 100 )
→ 6.31 × 101 + 4.49 × 100
→ 6.31 × 101 + 0.449 × 101
→ 6.76 × 101
thus, the associative law for addition does not always hold.
Examples involving adding many numbers of varying size indicate that adding
in order of increasing magnitude is preferable to adding in the reverse order.
(d) Subtracting a number from another nearly equal number may result in loss
of significance or cancellation error. To illustrate this loss of accuracy, sup-
pose we evaluate f (x) = 1 − cos x for x = 0.05 using three-digit decimal
normalized floating point arithmetic with rounding. Then
1 − cos(0.05) = 1 − 0.99875 . . .
→ 1.00 × 100 − 0.999 × 100
→ 1.00 × 10−3
Although the value of 1 is exact and cos(0.05) is correct to 3S when expressed
as a three-digit floating point number, their computed difference is correct
to only 1S! (The two zeros after the decimal point in 1.00 × 10−3 ‘pad’ the
number.)
The approximation 0.999 ≈ cos(0.05) has a relative error of about 2.5×10−4 .
By comparison, the relative error of 1.00 × 10−3 ≈ 1 − cos(0.05) is about
0.2 and so much larger. Thus subtraction of two nearly equal numbers should
be avoided whenever possible.
For f (x) = 1 − cos x we can avoid this loss of significant digits by writing
(1 − cos x)(1 + cos x) 1 − cos2 x sin2 x
1 − cos x = = =
1 + cos x 1 + cos x 1 + cos x
This last formula is more suitable for calculations when x is close to 0. It can
be verified that the more accurate approximation of 1.25 × 10−3 is obtained
for 1 − cos(0.05) when three-digit floating point arithmetic is used.
FLOATING POINT ARITHMETIC 17

Checkpoint

1. Why is it sometimes necessary to shift the mantissa and adjust the


exponent of a floating point number?
2. Does floating point arithmetic obey the usual laws of arithmetic?
3. Why should the subtraction of two nearly equal numbers be avoided?

EXERCISES

1. Evaluate the following using three-digit decimal normalized floating point


arithmetic with rounding:
(a) 6.19 × 102 + 5.82 × 102 .
(b) 6.19 × 102 + 3.61 × 101 .
(c) 6.19 × 102 − 5.82 × 102 .
(d) 6.19 × 102 − 3.61 × 101 .
(e) (3.60 × 103 ) × (1.01 × 10−1 ).
(f) (−7.50 × 10−1 ) × (−4.44 × 101 ).
(g) (6.45 × 102 )/(5.16 × 10−1 ).
(h) (−2.86 × 10−2 )/(3.29 × 103 ).
2. Estimate the accumulated errors in the results of Exercise 1, assuming that
all values are correct to 3S.
3. Evaluate the following, using four-digit decimal normalized floating point
arithmetic with rounding, then recalculate carrying all decimal places and
estimate the propagated error.
(a) Given a = 6.842 × 10−1 , b = 5.685 × 101 , c = 5.641 × 101 , find
a(b − c) and ab − ac.
(b) Given a = 9.812 × 101 , b = 4.631 × 10−1 , c = 8.340 × 10−1 , find
(a + b) + c and a + (b + c).
4. Use four-digit decimal normalized floating point arithmetic with rounding to
calculate f (x) = tan x − sin x for x = 0.1. Since
tan x − sin x = tan x(1 − cos x) = tan x(2 sin2 (x/2))
f (x) may be written as f (x) = 2 tan x sin2 (x/2). Repeat the calculation
using this alternative expression. Which of the two values is more accurate?
STEP 5

ERRORS 5
Approximation to functions

An important procedure in Analysis is to represent a given function as an infinite


series of terms involving simpler or otherwise more appropriate functions. Thus,
if f is the given function, it may be represented as the series expansion

f (x) = a0 φ0 (x) + a1 φ1 (x) + · · · + an φn (x) + · · ·

involving the set of functions {φ j }. Mathematicians have spent a lot of effort in


discussing the convergence of series; that is, in defining conditions for which the
partial sum
sn (x) = a0 φ0 (x) + a1 φ1 (x) + · · · + an φn (x)
approximates the function value f (x) ever more closely as n increases. In Numer-
ical Analysis, we are primarily concerned with such convergent series; computing
the sequence of partial sums is an approximation process in which the truncation
error may be made as small as we please by taking sufficient terms into account.

1 The Taylor series


The most important expansion to represent a function is the Taylor series. If f is
suitably smooth in the neighbourhood of some chosen point x0 we have
h 2 00 h n (n)
f (x) = f (x0 ) + h f 0 (x0 ) + f (x0 ) + · · · + f (x0 ) + Rn
2! n!
where
dk f

(k)
f (x0 ) ≡
d x k x=x0
h = x − x0 denotes the displacement from x0 to point x in the neighbourhood,
and the remainder term is
h n+1
Rn = f (n+1) (ξ )
(n + 1)!
for some point ξ between x0 and x. (This is known as the Lagrange form of the
remainder; its derivation may be found in Section 8.7 of Thomas and Finney (1992)
cited in the Bibliography.) Note that the ξ in this expression for Rn may be written
as ξ = x0 + θ h, where 0 < θ < 1.
The Taylor expansion converges for x within some range including the point x0 ,
a range which lies within the neighbourhood of x0 mentioned above. Within this
range of convergence, the truncation error due to discarding terms after the x n
term (equal to the value of Rn at point x) can be made smaller in magnitude than
APPROXIMATION TO FUNCTIONS 19

any positive constant by choosing n sufficiently large. In other words, by using


Rn to decide how many terms are needed, one may evaluate the function at any
point in the range of convergence as accurately as the accumulation of round-off
error permits.
From the viewpoint of the numerical analyst, it is most important that the
convergence be fast enough. For example, if we consider f (x) = sin x we have

f 0 (x) = cos x
f 00 (x) = − sin x
..
.
etc.

and the expansion (about x0 = 0) for n = 2k − 1 is given by

x3 x5 (−1)k−1 x 2k−1
sin x = x − + − ··· + + R2k−1
3! 5! (2k − 1)!
with
(−1)k x 2k+1
R2k−1 = cos ξ
(2k + 1)!
Note that this expansion has only odd-powered terms so, although the polynomial
approximation is of degree (2k − 1), it has only k terms. Moreover, the absence of
even-powered terms means that the same polynomial approximation is obtained
with n = 2k, and hence R2k−1 = R2k ; the remainder term R2k−1 given above is
actually the expression for R2k . Since | cos ξ | ≤ 1, then

|x|2k+1
|R2k−1 | ≤
(2k + 1)!
if 5D accuracy is required, it follows that we need only take k = 2 at x = 0.1, and
k = 4 at x = 1 (since 9! = 362 880). On the other hand, the expansion for the
natural (base e) logarithm,

x2 x3 (−1)n−1 x n
ln(1 + x) = x − + − ··· + + Rn
2 3 n
where
(−1)n x n+1
Rn =
(n + 1)(1 + ξ )n+1
is less suitable. Although only n = 4 terms are needed to give 5D accuracy at
x = 0.1, n = 13 is required for 5D accuracy at x = 0.5, and n = 19 gives just 1D
accuracy at x = 1!
Further, we remark that the Taylor series is not only used extensively to represent
functions numerically, but also to analyse the errors involved in various algorithms
(for example, see Steps 8, 9, 10, 30, and 31).
20 ERRORS 5

2 Polynomial approximation
The Taylor series provides a simple method of polynomial approximation (of
chosen degree n),

f (x) ≈ a0 + a1 x + a2 x 2 + · · · + an x n

which is basic to the discussion of various elementary numerical procedures in this


textbook. Because f is often complicated, one may prefer to execute operations
such as differentiation and integration on a polynomial approximation. Interpol-
ation formulae (see Steps 22 and 23) may also be used to construct polynomial
approximations.

3 Other series expansions


There are many other series expansions, such as the Fourier series (in terms
of sines and cosines), or those involving various orthogonal functions (Legendre
polynomials, Chebyshev polynomials, Bessel functions, etc.). From the numerical
standpoint, truncated Fourier series and Chebyshev polynomial series have proven
to be the most useful. Fourier series are appropriate in dealing with functions with
natural periodicity, while Chebyshev series provide the most rapid convergence
of all known approximations based on polynomials.
Occasionally, it is possible to represent a function adequately (from the numer-
ical standpoint) by truncating a series which does not converge in the mathematical
sense. For example, solutions are sometimes obtained in the form of asymptotic
series with leading terms which provide sufficiently accurate numerical results.
While we confine our attention in this book to truncated Taylor series, the
interested reader should be aware that such alternative expansions exist (see, for
example, Burden and Faires (1993)).

4 Recursive procedures
While a truncated series with few terms may be a practical way to compute values
of a function, there is a number of arithmetic operations involved, so if available
some recursive procedure which reduces the arithmetic required may be preferred.
For example, the values of the polynomial
P(x) = a0 + a1 x + a2 x 2 + · · · + an x n

and its derivative


P 0 (x) = a1 + 2a2 x + · · · + nan x n−1

for x = x̄ may be generated recursively under the scheme:


pk = pk−1 x̄ + an−k , qk = qk−1 x̄ + pk−1 , k = 1, 2, . . . , n

with p0 = an and q0 = 0.
APPROXIMATION TO FUNCTIONS 21

Thus, for successive values of k one has

p1 = p0 x̄ + an−1 q1 = q0 x̄ + p0
= an x̄ + an−1 = an
p2 = p1 x̄ + an−2 q2 = q1 x̄ + p1
= an x̄ 2 + an−1 x̄ + an−2 = 2an x̄ + an−1
.. ..
. .
pn = P(x̄) qn = P 0 (x̄)

The technique just described is known as nested multiplication. (Perhaps the


student may be able to suggest a recursive procedure for even higher derivatives
of P.)
Finally, it should be noted that it is common to generate members of a set of
orthogonal functions recursively.

Checkpoint

1. How do numerical analysts use the remainder term Rn in Taylor


series?
2. Why is ‘speed of convergence’ so important from the numerical
standpoint?
3. From the numerical standpoint, is it essential for a series represen-
tation to converge in the mathematical sense?

EXERCISES

1. Find the Taylor series expansions about x = 0 for each of the following
functions.
(a) cos x.
(b) 1/(1 − x).
(c) e x .
For each series also determine a general remainder term.
2. For each of the functions in Exercise 1, evaluate f (0.5) using a calculator
and by using the first four terms of your Taylor expansion.
3. Use the remainder term found in Exercise 1(c) to find the value of n required
in the Taylor series for f (x) = e x about x = 0 to give 5D accuracy for all x
between 0 and 1.
4. Truncate the Taylor series found in Exercise 1(c) to give linear, quadratic,
and cubic polynomial approximations for f (x) = e x in the neighbourhood
of x = 0. Use the remainder term to estimate (to the nearest 0.1) the range
over which each polynomial approximation yields results correct to 2D.
22 ERRORS 5

5. Evaluate P(3.1) and P 0 (3.1), where P(x) = x 3 − 2x 2 + 2x + 3, using the


technique of nested multiplication.
6. Evaluate P(2.6) and P 0 (2.6), where P(x) = 2x 4 − x 3 + 3x 2 + 5, using the
technique of nested multiplication. Check your values using a calculator.
STEP 6

NONLINEAR EQUATIONS 1
Nonlinear algebraic and transcendental equations

The first nonlinear equation encountered in algebra courses is usually the quadratic
equation
ax 2 + bx + c = 0
and all students will be familiar with the formula for its roots:

−b ± b2 − 4ac
x=
2a
The formula for the roots of a general cubic is somewhat more complicated and
that for a general quartic usually takes several pages to describe! We are spared
further effort by a theorem which states that there is no such formula for general
polynomials of degree higher than four. Accordingly, except in special cases (for
example, when factorization is easy), we prefer in practice to use a numerical
method to solve polynomial equations of degree higher than two.
Another class of nonlinear equations consists of those which involve transcen-
dental functions such as e x , ln x, sin x, and tan x. Useful analytic solutions of
such equations are rare so we are usually forced to use numerical methods.

1 A transcendental equation
We shall use a simple mathematical problem to show that transcendental equations
do arise quite naturally. Suppose we seek the height of liquid in a cylindrical tank
of radius r , lying with its axis horizontal, when the tank is a quarter full (see
Figure 2). Suppose the height of liquid is h (D B in the diagram). The condition
to be satisfied is that the area of the segment ABC should be 14 of the area of the
circle. This reduces to
h i
2 12 r 2 θ − 12 (r sin θ )(r cos θ ) = 14 πr 2

( 21 r 2 θ is the area of the sector O AB, r sin θ is the base and r cos θ the height of
the triangle O AD.) Hence
π
2θ − 2 sin θ cos θ =
2
or π
x + cos x = 0, where x = − 2θ
2
(since 2 sin θ cos θ = sin 2θ = sin(π/2 − x) = cos x).
24 NONLINEAR EQUATIONS 1

When we have solved the transcendental equation

f (x) ≡ x + cos x = 0

we obtain h from
h π x i
h = O B − O D = r − r cos θ = r 1 − cos −
4 2

..........................................
............. .........
......... .......
....... .......
......... ....
....
....
... ....
.. ...
..... ...
.. ...
.. ...
.. ..
.. ..
.. ..
. O ..
...
.... • .
.. .......
..... ...
..
.. .
... ...
..
... ......... ...
... .
..
.. r .
............
θ
....
..
... ........ ..
... ............... ..
.......... . ...
A .... . . . .. C
.
.
....
.... D ........ .....
...
...... ..
. ...
.. h ....
.......
... .......
........
........... ........
.......................................................
B
FIGURE 2. Cylindrical tank (cross-section).

2 Locating roots
Let us suppose that our problem is to find some or all of the roots of the nonlinear
equation f (x) = 0. Before we use a numerical method (compare Steps 7–10) we
should have some idea about the number, nature and approximate location of the
roots. The usual approach involves the construction of graphs and perhaps a table
of values of the function f to confirm the information obtained from the graph.
We now illustrate this approach by a few examples.
(i) sin x − x + 0.5 = 0
If we do not have a calculator or computer available to immediately plot the
graph of f (x) = sin x −x +0.5, we can separate f into two parts, sketch two curves
on the one set of axes, and see where they intersect. Because sin x − x + 0.5 = 0
is equivalent to sin x = x − 0.5, we sketch y = sin x and y = x − 0.5. Since
| sin x| ≤ 1 we are only interested in the interval −0.5 ≤ x ≤ 1.5 (outside which
|x − 0.5| > 1). Thus we deduce from the graph (Figure 3) that the equation has
only one real root, near x = 1.5. We can then tabulate f (x) = sin x − x + 0.5
near x = 1.5 as follows (the argument to the sine function should be in radians):

x 1.5 1.45 1.49


sin x 0.9975 0.9927 0.9967
f (x) −0.0025 0.0427 0.0067
NONLINEAR ALGEBRAIC AND TRANSCENDENTAL EQUATIONS 25

y
N
4
.... y = x − 0.5
..
...
3 .....
....
....
....
.....
2 ....
...
...
......
.
....
1 ................
............. ................ y = sin x
.......
...... ..
......... ......
. ......
.....
.
.... .... .... ....
.... .... .... ....
....
.... ..... ..... .... xI
.... .... .... ....
−4 ..... −2 ..... .... 2 4 ....
......
....... ...
... .... ..
........ .
... . ....
................................... .......
...
...
. −1
. ..
....
....
...
..... −2
....
....
...
. .....
...
.... −3
...
......
.
.... −4

FIGURE 3. Graphs of y = x − 0.5 and y = sin x.

We now know that the root lies between 1.49 and 1.50, and we can use a numerical
method to obtain a more accurate answer, as discussed in the following Steps.
(ii) e−0.2x = x(x − 2)(x − 3)
Again we sketch two curves:

y = e−0.2x

and
y = x(x − 2)(x − 3)
In sketching the second curve we use the three obvious zeros at x = 0, 2, and 3; as
well as the knowledge that x(x − 2)(x − 3) is negative for x < 0 and 2 < x < 3,
but positive and increasing steadily for x > 3. We deduce from the graph (Figure
4) that there are three real roots, near x = 0.2, 1.8, and 3.1, and tabulate as follows
(with f (x) = e−0.2x − x(x − 2)(x − 3)):

x 0.2 0.15 1.8 1.6 3.1 3.2


e−0.2x 0.9608 0.9704 0.6977 0.7261 0.5379 0.5273
x(x − 2)(x − 3) 1.0080 0.7909 0.4320 0.8960 0.3410 0.7680
f (x) −0.0472 0.1796 0.2657 −0.1699 0.1969 −0.2407

We conclude that the roots lie between 0.15 and 0.2, 1.6 and 1.8, and 3.1 and 3.2,
respectively. Note that the values in the table were calculated using an accuracy
of at least 5S, but are displayed to only 4D. For example, working to 5S accuracy
we have f (0.15) = 0.97045 − 0.79088 = 0.17957, which is then rounded to
26 NONLINEAR EQUATIONS 1

y
N
4
.. y = x(x − 2)(x − 3)
.
..
...
.
..
...
.......... .
........... ........ ..
........... 2 .... ....... ..
............
............. . ... ... ...
.............. . .... .
................
.................. ....
.
... ..
...................... ..
.............................. ...
... .
...............................................
. ... . ................ y = e −0.2x
.. ... ..
.. ..
.... ...
... x
.. . I
.. .
−4 −3 −2 −1 1...
. 2 3 4 .... ..
............
.
..
..
...
.
..
−2 ..
...
.
..
..
...
.
..
..
...
−4

FIGURE 4. Graphs of y.

0.1796. Thus the entry in the table for f (0.15) is 0.1796 and not 0.1795 as one
might expect from calculating 0.9704 − 0.7909.

Checkpoint

1. Why are numerical methods used in solving nonlinear equations?


2. How does a transcendental equation differ from an algebraic equa-
tion?
3. What kind of information is used when sketching curves for the
location of roots?

EXERCISES

1. Locate the roots of the equation


x + cos x = 0
2. Use sketch curves to roughly locate all the roots of the following equations.
(a) x + 2 cos x = 0.
(b) x + e x = 0.
(c) x(x − 1) − e x = 0.
(d) x(x − 1) − sin x = 0.
STEP 7

NONLINEAR EQUATIONS 2
The bisection method

The bisection method † for finding the roots of the equation f (x) = 0 is based on
the following theorem.

Theorem: If f is continuous for x between a and b and if f (a) and f (b) have
opposite signs, then there exists at least one real root of f (x) = 0 between a and
b.

1 Procedure
Suppose that a continuous function f is negative at x = a and positive at x = b, so
that there is at least one real root between a and b. (Usually a and b may be found
from a graph of f .) If we calculate f ((a + b)/2), which is the function value at
the point of bisection of the interval a < x < b, there are three possibilities:
(a) f ((a + b)/2) = 0, in which case (a + b)/2 is the root;
(b) f ((a + b)/2) < 0, in which case the root lies between (a + b)/2 and b;
(c) f ((a + b)/2) > 0, in which case the root lies between a and (a + b)/2.
Presuming there is just one root, if case (a) occurs the process is terminated. If
either case (b) or case (c) occurs, the process of bisection of the interval containing
the root can be repeated until the root is obtained to the desired accuracy. In Figure
5, the successive points of bisection are denoted by x1 , x2 , and x3 .
y
N
................
.............................
....... y = f (x)
..
...
...
.............
.
.........
.......
.......
...
.........
.
...
......
.....
....
......
....
....
...
.....
| | .
.... |
. | | I x
a x1 ................ x3 x2 b
..
....
.
.......
.......
........
.
...
...
...........
..........
.................
.......................................................................................

FIGURE 5. Successive bisection.

† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 164.
28 NONLINEAR EQUATIONS 2

2 Effectiveness
The bisection method is almost certain to give a root. Provided the conditions of
the above theorem hold, it can only fail if the accumulated error in the calculation
of f at a bisection point gives it a small negative value when actually it should
have a small positive value (or vice versa); the interval subsequently chosen would
therefore be wrong. This can be overcome by working to sufficient accuracy, and
this almost-assured convergence is not true of many other methods of finding a
root.
One drawback of the bisection method is that it applies only for roots of f
about which f (x) changes sign. In particular, double roots can be overlooked;
one should be careful to examine f (x) in any range where it is small, so that
repeated roots about which f (x) does not change sign are otherwise evaluated
(for example, see Steps 9 and 10). Of course, such a close examination also avoids
another nearby root being overlooked.
Finally, note that bisection is rather slow; after n iterations the interval con-
taining the root is of length (b − a)/2n . However, provided values of f can be
generated readily, as when a computer is used, the rather large number of itera-
tions which can be involved in the application of bisection is of relatively little
consequence.

3 Example
Let us solve 3xe x = 1 to three decimal places by the bisection method.
We can consider f (x) = 3x − e−x , which changes sign in the interval 0.25 <
x < 0.27: one may tabulate (working to 4D ) as follows:
x 3x e−x f (x)
0.25 0.75 0.7788 −0.0288
0.27 0.81 0.7634 0.0466
(The student should ascertain graphically that there is just one root.)
Let us denote the lower and upper endpoints of the interval bracketing the root
at the n-th iteration by an and bn respectively (with a1 = 0.25 and b1 = 0.27).
Then the approximation to the root at the n-th iteration is given by xn = (an +
bn )/2. Since the root is either in [an , xn ] or [xn , bn ] and both intervals are of
length (bn − an )/2, we see that xn will be accurate to three decimal places when
(bn − an )/2 < 5 × 10−4 . Proceeding to bisection:
n an bn xn = (an + bn )/2 3xn e−xn f (xn )
1 0.25 0.27 0.26 0.78 0.7711 0.0089
2 0.25 0.26 0.255 0.765 0.7749 −0.0099
3 0.255 0.26 0.2575 0.7725 0.7730 −0.0005
4 0.2575 0.26 0.2588 0.7763 0.7720 0.0042
5 0.2575 0.2588 0.2581 0.7744 0.7725 0.0019
6 0.2575 0.2581 0.2578
THE BISECTION METHOD 29

(Note that the values in the table are displayed to only 4D.) Hence the root accurate
to three decimal places is 0.258.

Checkpoint

1. When may the bisection method be used to find a root of the equation
f (x) = 0?
2. What are the three possible choices after a bisection value is calcu-
lated?
3. What is the maximum error after n iterations of the bisection
method?

EXERCISES

1. Use the bisection method to find the root of the equation


x + cos x = 0
correct to two decimal places (2D ).
2. Use the bisection method to find the positive root of the equation
x − 0.2 sin x − 0.5 = 0
to 3D.
3. Each equation in Exercises 2(a)–2(c) of Step 6 on page 26 has only one root.
For each equation use the bisection method to find the root correct to 2D.
STEP 8

NONLINEAR EQUATIONS 3
Method of false position

As mentioned in the Prologue, the method of false position† dates back to the
ancient Egyptians. It remains an effective alternative to the bisection method for
solving the equation f (x) = 0 for a real root between a and b, given that f is
continuous and f (a) and f (b) have opposite signs.

1 Procedure
The curve y = f (x) is not generally a straight line. However, one may join the
points
(a, f (a)) and (b, f (b))
by the straight line
y − f (a) x −a
=
f (b) − f (a) b−a
The straight line cuts the x-axis at (x̄, 0), where
0 − f (a) x̄ − a
=
f (b) − f (a) b−a
so that
b−a
x̄ = a − f (a)
f (b) − f (a)

a f (b) − b f (a) 1 a

f (a)

= =
f (b) − f (a) f (b) − f (a) b f (b)

Let us suppose that f (a) is negative and f (b) is positive. As in the bisection
method, there are three possibilities:
(a) f (x̄) = 0, in which case x̄ is the root;
(b) f (x̄) < 0, in which case the root lies between x̄ and b;
(c) f (x̄) > 0, in which case the root lies between a and x̄.
Again, if case (a) occurs, the process is terminated; if either case (b) or case
(c) occurs, the process can be repeated until the root is obtained to the desired
accuracy. In Figure 6 the successive points where the straight lines cut the x-axis
are denoted by x1 , x2 , and x3 .
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 165.
METHOD OF FALSE POSITION 31

y
N
......................
..................................... .. y = f (x)
.. ................. ...................................... ....
..
....... ......................................
... .. ...
..... ..
. ........ ..................... ................ ..
..... ........... ......... ..
..
....
.............
. .
.... ..
...
... ..
...
..
.
. .
......
.... ..
...
.
........
..
..
.... ...
... ..
..
...
... .
......
. ..
.
.....
..... .... ...
......
..
..... ..
.... .... .
........ ..
.......... .
....
. ..
...
... ..
. .. .... . ... x 3 ........ .. ........
. .
..
.... . . ... ...
.
..... ..........
| ..
...
...
| ... ..|
.
..... ......|.. | I x
......... ....... .......
. .
a. ......... x 1
......... ....... ....... x 2
.............. b
... ...
. ............ ...
........................
...... . . ..
... ..........................
..
.. ...
.............. . ..
..
.
.. .........
....... .......
......... ......
........
..
.. ......... .. ..
.. ................. ........ ......
. .
..............................................................................................

FIGURE 6. Method of false position.

2 Effectiveness and the secant method


Like the bisection method, the method of false position has almost assured con-
vergence, and it may converge to a root faster. However, it may happen that most
or all of the calculated approximations xn are on the same side of the root (see
Figure 7). One consequence of this is that, as iterations of the method of false
position are carried out, although the length of the interval bracketing the root gets
smaller, it may not go to zero. Thus the length of the interval may be unsuitable
for use as a stopping criterion for the method. Instead it is common to look at
the size of | f (xn )| and at the difference between successive estimates from the
method.
Another consequence which results from having the calculated approximations
xn on the same side of the root is that convergence may be slow. This is avoided
y
N .........
......................... ..
y = f (x)
.. .. .. .. ............................... ....
..
.......... ...................
.. ...
......... ..
......... ..................
. ..
.. .... .. ..
.... ..... .
...
..... ..
. ..... .. ...
......
..
.. .... ... ..
.....
. ..
..
.. ...... .
....
.....
.
..
... .... . .
...
......
..
... .... ...
......
..
. ..... .. .
....
......
..
....... ...
......
..
....... .
...
.. .
....
..
. .... .. .
....
.....
..
. . ..
..
..
.
. .
.. .. .
...
......
..
. .... ......... ..
.
.
...
......
..
............. .. ...
......
..
.. .... ....... .. ..
... .
....
..
...... ......... . .........
.
.
.
.... ....... .
.....
.
| .
. ..
.. .
. ......|
..... ....
......|
. .. | I x
..... ........ ........
a. ..... ........ x 2.......... x 1 b
..
.. .
...... ............. ...................
.. .. . ........ ..........
.
.. ..... .......... ..............
.. ... .................
.. ....................
........
......
..
..
..
..
..
FIGURE 7. Method of false position.
32 NONLINEAR EQUATIONS 3

in the secant method, which resembles the method of false position except that
no attempt is made to ensure that the root α is enclosed. Starting with two
approximations (x0 and x1 ) to the root α, further approximations x2 , x3 , . . . are
computed from
xn − xn−1
xn+1 = xn − f (xn )
f (xn ) − f (xn−1 )
We no longer have assured convergence, but the process is simpler (the sign
of f (xn+1 ) is not tested) and often converges faster. With respect to speed of
convergence of the secant method, we have the error at the (n + 1)-th iteration:
en+1 = α − xn+1
(α − xn−1 ) f (xn ) − (α − xn ) f (xn−1 )
=
f (xn ) − f (xn−1 )
en−1 f (α − en ) − en f (α − en−1 )
=
f (α − en ) − f (α − en−1 )
Hence, expanding in terms of Taylor series,
en−1 [ f (α) − en f 0 (α) + (en2 /2!) f 00 (α) − · · ·]
en+1 =
[ f (α) − en f 0 (α) + · · ·] − [ f (α) − en−1 f 0 (α) + · · ·]
en [ f (α) − en−1 f 0 (α) + (en−1
2 /2!) f 00 (α) − · · ·]

[ f (α) − en f 0 (α) + · · ·] − [ f (α) − en−1 f 0 (α) + · · ·]
f (α)
 00 
≈− en−1 en
2 f 0 (α)
where we have used the fact that f (α) = 0. Thus we see that en+1 is proportional to
en en−1 , which may be expressed in mathematical notation as en+1 ∼ en−1 en . We
k ; then e
seek k such that en ∼ en−1 k k2 k+1
n+1 ∼ en ∼ en−1 and en−1 en ∼ en−1 , so that we

deduce k 2 ≈ k + 1, whence k ≈ (1 + 5)/2 ≈ 1.618. The speed of convergence
is therefore faster than linear (k = 1), but slower than quadratic (k = 2). This rate
of convergence is sometimes referred to as superlinear convergence.

3 Example
We solve 3xe x = 1 by the method of false position, stopping when | f (xn )| <
5 × 10−6 , where f (x) = 3x − e−x .
In the previous Step, we observed that the root lies in the interval 0.25 < x <
0.27. Consequently, with calculations displayed to 6D, the first approximation is
given by

1 0.25 −0.028801

x1 =
0.046621 + 0.028801 0.27 0.046621
0.011655 + 0.007776
= = 0.257637
0.075421
METHOD OF FALSE POSITION 33

Then
f (x1 ) = f (0.257637) = 3 × 0.257637 − 0.772875
= 0.772912 − 0.772875 = 0.000036
The student may verify that doing one more iteration of the method of false
position yields an estimate x2 = 0.257628 for which the function value is less
than 5 × 10−6 . Since x1 and x2 agree to 4D, we conclude that the root is 0.2576
correct to 4D.

Checkpoint

1. When may the method of false position be used to find a root of the
equation f (x) = 0?
2. On what geometric construction is the method of false position
based?

EXERCISES

1. Use the method of false position to find the smallest positive root of the
equation f (x) ≡ 2 sin x + x − 2 = 0, stopping when xn satisfies | f (xn )| <
5 × 10−5 .
2. Compare the results obtained when
(a) the bisection method,
(b) the method of false position, and
(c) the secant method
are used (with starting values 0.7 and 0.9) to solve the equation
1
3 sin x = x +
x
3. Use the method of false position to find the root of the equation
f (x) ≡ x + cos x = 0
stopping when | f (xn )| < 5 × 10−6 .
4. Each equation in Exercises 2(a)–2(c) of Step 6 on page 26 has only one
root. Use the method of false position to find each root, stopping when
| f (xn )| < 5 × 10−6 .
STEP 9

NONLINEAR EQUATIONS 4
The method of simple iteration

The method of simple iteration involves writing the equation f (x) = 0 in a form
x = φ(x) suitable for the construction of a sequence of approximations to some
root, in a repetitive fashion.

1 Procedure
The iteration procedure is as follows. In some way we obtain a rough approxi-
mation x0 of the desired root, which may then be substituted into the right-hand
side to give a new approximation, x1 = φ(x0 ). The new approximation is again
substituted into the right-hand side to give a further approximation x2 = φ(x1 ),
and so on until (hopefully) a sufficiently accurate approximation to the root is ob-
tained. This repetitive process, based on xn+1 = φ(xn ), is called simple iteration;
provided that |xn+1 − xn | decreases as n increases, the process tends to α = φ(α),
where α denotes the root.

2 Example
The method of simple iteration is used to find the root of the equation 3xe x = 1
to an accuracy of 4D.
One first writes
x = 31 e−x ≡ φ(x)
Assuming x0 = 1 and with numbers displayed to 5D, successive iterations produce

x1 = 0.12263
x2 = 0.29486
x3 = 0.24821
x4 = 0.26007
x5 = 0.25700
x6 = 0.25779
x7 = 0.25759
x8 = 0.25764

Thus we see that after eight iterations the root is 0.2576 to 4D. A graphical
interpretation of the first three iterations is shown in Figure 8.
THE METHOD OF SIMPLE ITERATION 35

y
N .... y = x
.....
.. ....
1 ....
. ..
....
. . . .....
.
....
....
. . . ....
.
.....
....
.......
.
....
....
..
......
......... ....
......... ....
.......... .......
............... .
...................................................
.....
...........................
...
... ...... ... ... .....................................
... .... .. .. ...............
...................................................................................................................................................................................................
. .
..... . .. . ................................................
. ...............................
.
.
.
.... ..
. .
... ...
..
. y = 13 e−x
... Ix
| | | |
x1 x3 x2 x0 = 1
FIGURE 8. Iterative method.

3 Convergence
Whether or not the iteration procedure converges quickly, or indeed at all, depends
on the choice of the function φ, as well√as the starting value x0 . For example, the
equation x 2 = 3 has two real roots, ± 3(≈ ±1.732). It can be rewritten in the
form
3
x = ≡ φ(x)
x
which suggests the iteration
3
xn+1 =
xn
However, if the starting value x0 = 1 is used, successive iterations give
3
x1 = =3
x0
3
x2 = =1
x1
3
x3 = = 3 etc.
x2
so that there is no convergence!
We can examine the convergence of the iteration process
xn+1 = φ(xn )
to
α = φ(α)
with the help of the Taylor series (see page 18)
φ(α) = φ(xk ) + (α − xk )φ 0 (ζk ), k = 0, 1, 2, . . . , n
where ζk is a value between the root α and the approximation xk . We have
36 NONLINEAR EQUATIONS 4

α − x1 = φ(α) − φ(x0 ) = (α − x0 )φ 0 (ζ0 )


α − x2 = φ(α) − φ(x1 ) = (α − x1 )φ 0 (ζ1 )
.. ..
. .
α − xn+1 = φ(α) − φ(xn ) = (α − xn )φ 0 (ζn )
Multiplying the n + 1 rows together and cancelling the common factors α − x1 ,
α − x2 , . . . , α − xn leaves
α − xn+1 = (α − x0 )φ 0 (ζ0 )φ 0 (ζ1 ) · · · φ 0 (ζn )
Consequently,
|α − xn+1 | = |α − x0 | |φ 0 (ζ0 )| |φ 0 (ζ1 )| · · · |φ 0 (ζn )|
so that the absolute error |α − xn+1 | can be made as small as we please by sufficient
iteration if |φ 0 | < 1 in the neighbourhood of the root. (Note√that the derivative of
φ(x) = 3/x is such that |φ 0 (x)| = | − 3/x 2 | > 1 for |x| < 3.)

Checkpoint

1. What should a programmer guard against in a computer program


using the method of simple iteration?
2. What is necessary to ensure that the method of simple iteration does
converge to a root?

EXERCISES

1. Assuming the initial guess x0 = 1, show by the method of simple iteration


that one root of the equation 2x − 1 − 2 sin x = 0 is 1.4973.
2. Use the method of simple iteration to find (to 4D ) the root of the equation
x + cos x = 0.
3. Use the method of simple iteration to find to 3D the root of the equation given
in Exercise 2(b) of Step 6 on page 26.
STEP 10

NONLINEAR EQUATIONS 5
The Newton-Raphson iterative method

The Newton-Raphson method † is a process for the determination of a real root of


an equation f (x) = 0, given just one point close to the desired root. It can be
viewed as a limiting case of the secant method (see Step 8) or as a special case of
the method of simple iteration (see Step 9).

1 Procedure
Let x0 denote the known approximate value of the root of f (x) = 0, and let h
denote the difference between the true value α and the approximate value; that is,

α = x0 + h

The second degree terminated Taylor expansion (see page 18) about x0 is

h 2 00
f (α) = f (x0 + h) = f (x0 ) + h f 0 (x0 ) + f (ξ )
2!
where ξ = x0 + θ h, 0 < θ < 1, lies between α and x0 . Ignoring the remainder
term and writing f (α) = 0,

f (x0 ) + h f 0 (x0 ) ≈ 0

so that
f (x0 )
h≈−
f 0 (x0 )
and consequently
f (x0 )
x1 = x0 −
f 0 (x0 )
should be a better estimate of the root than x0 .
Even better approximations may be obtained by repetition (iteration) of the
process, which may then be written as
f (xn )
xn+1 = xn −
f 0 (xn )
Note that if f is a polynomial we can use the recursive procedure of Step 5 to
compute f (xn ) and f 0 (xn ).
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 166.
38 NONLINEAR EQUATIONS 5

The geometrical interpretation is that each iteration provides the point at which
the tangent at the original point cuts the x-axis (see Figure 9). Thus the equation
of the tangent at (xn , f (xn )) is

y − f (xn ) = f 0 (xn )(x − xn )

so that (xn+1 , 0) corresponds to

− f (xn ) = f 0 (xn )(xn+1 − xn )

which leads to
f (xn )
xn+1 = xn −
f 0 (xn )

y
N
............................ y = f (x)
...................
...............
..........
......... ..
........... ...
..
...
.................. ...
....... ...
........ ... ..
.
.............. .......... ..
..
.
....
..... .
.. .. .
. ..
......
. ... ..... ..
...
... . ... ..
.....
. . .
.
..
..
....
...... ...
..
.
.. .
. ..
.......... .......
. .
. ..
..... . .... ....
...
. .
. ..
.
.... ..... ........ .
.
...
.....
. ........
|.. |....... ....... .
.....
..
| | I x
... ......... x
x.1 ....... x..3.................. 2 x0
.. .... .........
.. ..... ...........
..
.. ...........................
.
.. ..................
.
.
...........
...
...
..............
....
.......................................................................................

FIGURE 9. Newton-Raphson method.

2 Example
We find the positive root of the equation sin x = x 2 , correct to 3D, using the
Newton-Raphson method.
It is convenient to use the method of false position to obtain an initial approxi-
mation. Tabulating, one has

x f (x) = sin x − x 2
0 0
0.25 0.1849
0.5 0.2294
0.75 0.1191
1 −0.1585
THE NEWTON-RAPHSON ITERATIVE METHOD 39

With working displayed to 4D, we see that there is a root in the interval 0.75 <
x < 1 at approximately

1 0.75

0.1191

x0 =
−0.1585 − 0.1191 1 −0.1585
1
=− (−0.1189 − 0.1191)
0.2777
0.2380
= = 0.8573
0.2777
We now use the Newton-Raphson method; we have

f (0.8573) = sin(0.8573) − (0.8573)2


= 0.7561 − 0.7349 = 0.0211

and
f 0 (x) = cos x − 2x
giving
f 0 (0.8573) = 0.6545 − 1.7145 = −1.0600
Consequently, a better approximation is
0.0211
x1 = 0.8573 + = 0.8573 + 0.0200 = 0.8772
1.0600
Repeating the procedure, we obtain

f (x1 ) = f (0.8772) = −0.0005

and
f 0 (x1 ) = f 0 (0.8772) = −1.1151
so that
0.0005
x2 = 0.8772 − = 0.8772 − 0.0005 = 0.8767
1.1151
Since f (x2 ) = 0.0000, we conclude that the root is 0.877 to 3D.

3 Convergence
If we write
f (x)
φ(x) = x −
f 0 (x)
the Newton-Raphson iteration expression
f (xn )
xn+1 = xn −
f 0 (xn )
may be written
xn+1 = φ(xn )
40 NONLINEAR EQUATIONS 5

We observed (see page 36) that in general the iteration method converges when
|φ 0 (x)| < 1 near the root. In the Newton-Raphson case we have

[ f 0 (x)]2 − f (x) f 00 (x) f (x) f 00 (x)


φ 0 (x) = 1 − =
[ f 0 (x)]2 [ f 0 (x)]2
so that the criterion for convergence is

| f (x) f 00 (x)| < [ f 0 (x)]2

convergence is not so assured as for the bisection method (say).

4 Speed of convergence
The second degree terminated Taylor expansion about xn is

en2 00
f (α) = f (xn + en ) = f (xn ) + en f 0 (xn ) + f (ξn )
2!
where en = α − xn is the error at the n-th iteration and ξn = xn + θen , 0 < θ < 1.
Since f (α) = 0 we have

f (xn ) e2 f 00 (ξn )
0= + (α − xn ) + n 0
f (xn )
0 2 f (xn )
But from the Newton-Raphson formula we have
f (xn )
− xn = −xn+1
f 0 (xn )
and so the error at the (n + 1)-th iteration is

en+1 = α − xn+1
en2 f 00 (ξn )
=−
2 f 0 (xn )
en2 f 00 (α)
≈−
2 f 0 (α)
when en is sufficiently small. This result states that the error at iteration (n + 1)
is proportional to the square of the error at iteration n; hence (if f 00 (α) ≈ 4 f 0 (α))
an answer correct to one decimal place at one iteration should be accurate to two
places at the next iteration, four at the next, eight at the next, etc. This quadratic
(‘second-order’) convergence outstrips the rate of convergence of the methods of
bisection and false position.
In relatively little used computer programs, it may be wise to prefer the methods
of bisection or false position, since convergence is virtually assured. However, for
hand calculations or for computer routines in constant use, the Newton-Raphson
method is usually preferred.
THE NEWTON-RAPHSON ITERATIVE METHOD 41

5 The square root


One application of the
√ Newton-Raphson method is in the computation of square
roots. Now, finding a is equivalent to finding the positive root of x 2 = a or
f (x) = x 2 − a = 0
Since f 0 (x) = 2x, we have the Newton-Raphson iteration formula:
xn2 − a
 
1 a
xn+1 = xn − = xn +
2xn 2 xn
(As mentioned in the Prologue, this formula was known to the ancient Greeks.)
Thus, if a = 16 and x0 = 5, we have x1 = 12 (5 + 3.2) = 4.1, x2 = 21 (4.1 +
3.9024) = 4.0012, and x3 = 12 (4.0012 + 3.9988) = 4.0000, with working shown
to 4D.

Checkpoint

1. What is the geometrical interpretation of the Newton-Raphson iter-


ative procedure?
2. What is the convergence criterion for the Newton-Raphson method?
3. What major advantage has the Newton-Raphson method over some
other methods?

EXERCISES

1. Use the Newton-Raphson method to solve for the (positive) root of 3xe x = 1
to four significant digits.
2. Derive the Newton-Raphson iteration formula
xnk − a
xn+1 = xn −
kxnk−1
for finding the k-th root of a.
3. Use the formula xn+1 = (xn + a/xn )/2 to compute the square root of 10 to
five significant digits, from the initial guess 1.
4. Use the Newton-Raphson method to find (to 4D ) the root of the equation
x + cos x = 0
5. Use the Newton-Raphson method to find (to 4D ) the root of each equation
in Exercises 2(a)–2(c) of Step 6 on page 26.
STEP 11

SYSTEMS OF LINEAR EQUATIONS 1


Solution by elimination

Many phenomena can be modelled by a set of linear equations which describe


relationships between system variables. In simple cases there are two or three
variables; in complex systems (for example, in a linear model of the economy of
a country) there may be several hundred variables. Linear systems also arise in
connection with many problems of numerical analysis. Examples of these are the
solution of partial differential equations by finite difference methods, statistical
regression analysis, and the solution of eigenvalue problems (see, for example,
Gerald and Wheatley (1994)). A brief introduction to this last topic may be found
in Step 17.
It is necessary, therefore, to have available rapid and accurate methods for
solving systems of linear equations. The student will already be familiar with
solving systems of equations with two or three variables by elimination methods.
In this Step we shall give a formal description of the Gaussian elimination method
for n-variable systems and discuss certain errors which might arise in solutions.
We discuss partial pivoting, a technique to enhance the accuracy of this method,
in the next Step.

1 Notation and definitions


We first consider an example in three variables:

x+ y−z =2
x + 2y + z = 6
2x − y + z = 1

This is a set of three linear equations in the three variables (or unknowns) x, y,
and z. By solution of the system we mean the determination of a set of values for
x, y, and z which satisfies each one of the equations. In other words, if values
(X, Y, Z ) satisfy all equations simultaneously, then (X, Y, Z ) constitute a solution
of the system.
Let us now consider the general system of n equations in n variables, which
may be written as follows:

a11 x1 + a12 x2 + · · · + a1n xn = b1 
a21 x1 + a22 x2 + · · · + a2n xn = b2 

.. .. .. .. .. n equations
. . . . . 


an1 x1 + an2 x2 + · · · + ann xn = bn

SOLUTION BY ELIMINATION 43

The dots indicate, of course, similar terms in the variables x3 , x4 etc., and the
remaining (n − 3) equations which complete the system.
In this notation, the variables are denoted by x1 , x2 , . . . , xn ; sometimes we write
xi , i = 1, 2, . . . , n, to represent the variables. The coefficients of the variables
may be detached and written in a coefficient matrix thus:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A= . .. .. .. 
 
 .. . . . 
an1 an2 · · · ann
The notation ai j will be used to denote the coefficient of x j in the i-th equation.
Note that it occurs in the i-th row and j-th column of the matrix.
The numbers on the right-hand side of the equations are called constants, and
may be written in a column vector, thus:
 
b1
 b2 
b= . 
 
 .. 
bn
The coefficient matrix may be combined with the constant vector to form the
augmented matrix, thus:
 
a11 a12 · · · a1n b1
 a21 a22 · · · a2n b2 
 .. .. .. .. .. 
 
 . . . . . 
an1 an2 · · · ann bn
It is usual to work directly with the augmented matrix when using elimination
methods of solution.

2 The existence of solutions


For any particular system of n linear equations there may be a single solution
(X 1 , X 2 , . . . X n ), or no solution, or an infinity of solutions. In the theory of
linear algebra, theorems are given and conditions stated that enable us to decide
the category into which a given system falls. We shall not treat the question of
existence of solutions in this book, but for the benefit of students familiar with
matrices and determinants we state the following theorem.
Theorem: A linear system of n equations in n variables, with coefficient matrix
A and constants vector b 6= 0, has a unique solution if and only if the determinant
of A is not zero.
If b = 0, the system has the trivial solution x = 0. It has no other solution
unless the determinant of A is zero, when it has an infinite number of solutions.
44 SYSTEMS OF LINEAR EQUATIONS 1

Provided that the determinant of A is nonzero, there exists an n × n matrix


called the inverse of A (denoted by A−1 ) which is such that the matrix product
of A−1 and A is equal to the n × n identity (or unit) matrix I. The elements
of the identity matrix are 1 on the main diagonal and 0 elsewhere. Its algebraic
properties include Ix = x for any n × 1 vector x, and IM = MI = M for any
n × n matrix M. For example, the 3 × 3 identity matrix is given by
 
1 0 0
I= 0 1 0 
0 0 1

By multiplying the equation Ax = b from the left by the inverse matrix A−1 we
obtain A−1 Ax = A−1 b, so the unique solution is x = A−1 b (since A−1 A = I and
Ix = x). Thus in principle a linear system with a unique solution may be solved
by first evaluating A−1 and then A−1 b. This approach is discussed in more detail
in the optional Step 14. The Gaussian elimination method is a more general and
efficient direct procedure for solving systems of linear equations.

3 Gaussian elimination method


In the method of Gaussian elimination, the given system of equations is trans-
formed into an equivalent system which is in upper triangular form; this new
form can be solved easily by a process called back-substitution. We shall demon-
strate the process by solving the example of Section 1.
(a) Transformation to upper triangular form.
x + y− z=2 R1 (Row 1)
x + 2y + z = 6 R2 (Row 2)
2x − y + z = 1 R3 (Row 3)
First stage: eliminate x from equations R2 and R3 using equation R1.
x + y− z=2 R10
y + 2z = 4 R20 (R2 − R1)
− 3y + 3z = −3 R30 (R3 − 2 × R1)

Second stage: eliminate y from R30 using equation R20 .


x + y− z=2 R100
y + 2z = 4 R200
9z = 9 R300 (R30 − (−3) × R20 )
The system is now in upper triangular form. The coefficient matrix is
 
1 1 −1
 0 1 2 
0 0 9
SOLUTION BY ELIMINATION 45

(b) Solution by back-substitution.


The system in upper triangular form is easily solved by obtaining z from
R300 , then y from R200 , and finally x from R100 . This procedure is called
back-substitution. Thus
z=1 dividing R300 by 9
y = 4 − 2z from R200
=2 using z = 1
x =2−y+z from R100
=1 using z = 1 and y = 2

4 The transformation operations


When transforming a system to upper triangular form we use one or more of the
following elementary operations at every step:
(a) multiplication of an equation by a constant;
(b) subtraction from one equation some multiple of another equation;
(c) interchange of two equations.
Mathematically speaking, it should be clear to the student that performing
elementary operations on a system of linear equations leads to equivalent systems
which have the same solutions. This statement requires proof, and may be found
as a theorem in books on linear algebra such as Anton (1993). It forms the basis
of all elimination methods for solving systems of linear equations.

5 General treatment of the elimination process


In this section, we describe the elimination process as applied to a general n × n
linear system written in general notation† . Before considering the general n × n
system, we demonstrate the process using a system of three equations. We begin
with the augmented matrix, and show the multipliers necessary (in the column
headed m) to perform the transforming operations.
Multipliers Augmented matrix
m  
a11 a12 a13 b1 R1
 a21 a22 a23 b2  R2
a31 a32 a33 b3 R3
First stage: eliminate the coefficients a21 and a31 using row R1 .
 
a11 a12 a13 b1 R01
m 21 = a21 /a11  0 a22 a23 b2  R02 (R2 − m 21 × R1 )
0 0 0
 
m 31 = a31 /a11 0 a32 0 0
a33 b30 R03 (R3 − m 31 × R1 )

† Thisprocess is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 167.
46 SYSTEMS OF LINEAR EQUATIONS 1

Second stage: eliminate the coefficient a32 using row R02 .


 
a11 a12 a13 b1 R001
0 0 b20  R002
 0 a22 a23
 
m 32 = a32 /a22
0 0 0 0 a33 b300
00 R003 (R03 − m 32 × R02 )

The matrix is now in the form necessary for back-substitution. The full system
of equations at this point, equivalent to the original system, is

a11 x1 + a12 x2 + a13 x3 = b1


0 x + a 0 x = b0
a22 2 23 3 2
00 x = b00
a33 3 3

The solution from back-substitution is thus:

x3 = b300 /a33
00

x2 = (b20 − a230
x3 )/a22
0

x1 = (b1 − a12 x2 − a13 x3 )/a11

We now display the process for the general n × n system, omitting the primes
(0 ) for convenience. Recall that the original augmented matrix is
 
a11 a12 · · · a1n b1
 a21 a22 · · · a2n b2 
 .. .. .. .. .. 
 
 . . . . . 
an1 an2 · · · ann bn

First stage: eliminate the coefficients a21 , a31 , . . . , an1 by calculating the
multipliers
m i1 = ai1 /a11 , i = 2, 3 . . . , n
and then calculating

ai j = ai j − m i1 a1 j , bi = bi − m i1 b1 , i, j = 2, 3, . . . , n

This leads to the modified augmented system


 
a11 a12 · · · a1n b1
 0 a22 · · · a2n b2 
 .. .. .. .. ..
 
 . . . . .


0 an2 · · · ann bn

Second stage: eliminate the coefficients a32 , a42 , . . . , an2 by calculating the
multipliers
m i2 = ai2 /a22 , i = 3, 4, . . . , n
SOLUTION BY ELIMINATION 47

and then calculating

ai j = ai j − m i2 a2 j , bi = bi − m i2 b2 , i, j = 3, 4, . . . , n

This gives  
a11 a12 a13 ··· a1n b1

 0 a22 a23 ··· a2n b2 

 0 0 a33 ··· a3n b3 
.. .. .. .. .. ..
 
. . . . . .
 
 
0 0 an3 · · · ann bn
We continue to eliminate unknowns, going on to columns 3, 4, . . . so that by
the beginning of the k-th stage we have the augmented matrix
 
a11 a12 · · · · · · · · · a1n b1
 0 a22 · · · · · · · · · a2n b2 
 .. .. .. .. .. .. .. 
 
 .
 . . . . . . 
 0 0 · · · akk · · · akn bk 
 .. .. .. .. .. .. .. 
 
 . . . . . . . 
0 0 · · · ank · · · ann bn
k-th stage: eliminate ak+1,k , ak+2,k , . . . , an,k by calculating the multipliers

m ik = aik /akk , i = k + 1, k + 2, . . . , n

and then calculating

ai j = ai j − m ik ak j , bi = bi − m ik bk , i, j = k + 1, k + 2, . . . , n

Thus at the end of the k-th stage we have the augmented system
 
a11 a12 · · · · · · ··· ··· a1n b1
 0 a22 · · · · · · · · · · · · a 2n b2 
 .. .. . . . . . ..
 
 . . .. .. .. .. .. .


 
 0 0 · · · a kk a k,k+1 · · · a kn b k

 
 0 0 · · · 0 a k+1,k+1 · · · a k+1,n bk+1

 .. .. .. .. .. .. .. ..
 
 . . . . . . . .


0 0 ··· 0 an,k+1 ··· ann bn
Continuing in this way, we obtain after n − 1 stages the augmented matrix
 
a11 a12 · · · a1,n−1 a1n b1
 0 a22 · · · a2,n−1 a2n b2 
 .. .. . . . .. 
 
 . . . . .
. .
. . 
 
 0 0 · · · an−1,n−1 an−1,n bn−1 
0 0 ··· 0 ann bn
48 SYSTEMS OF LINEAR EQUATIONS 1

Note that the original coefficient matrix has been transformed into upper triangular
form.
We now back-substitute. Clearly we have xn = bn /ann , and subsequently
" #
n
1 X
xi = bi − ai j x j , i = n − 1, n − 2, . . . , 2, 1
aii j=i+1

Notes
(a) The diagonal elements akk used in the k-th stage of the successive elimination
are called pivot elements.
(b) To proceed from one stage to the next, it is necessary for the pivot element
to be nonzero (notice that the pivot elements are used as divisors in the
multipliers and in the final solution). If at any stage a pivot element vanishes,
we rearrange the remaining rows of the matrix so as to obtain a nonzero pivot;
if this is not possible, then the system of linear equations has no solution.
(c) If a pivot element is small compared with the elements in its column which
have to be eliminated, the corresponding multipliers used at that stage will be
greater than one in magnitude. The use of large multipliers in the elimination
and back-substitution processes leads to magnification of round-off errors,
and this can be avoided by using partial pivoting as described in the next
Step.

6 Numerical example
Here we shall solve the system

0.34x1 − 0.58x2 + 0.94x3 = 2.0


0.27x1 + 0.42x2 + 0.13x3 = 1.5
0.20x1 − 0.51x2 + 0.54x3 = 0.8

The working required to obtain the solution is set out in tabular form on the
next page. For illustrative purposes, the calculations were done using three-digit
decimal floating point arithmetic. For example, in the first stage the multiplier
0.794 comes from

2.70 × 10−1 /3.40 × 10−1 = 0.79411 . . . × 100 → 7.94 × 10−1

while the value −0.0900 is obtained from the sequence of operations

1.50 × 100 − (7.94 × 10−1 × 2.00 × 100 ) = 1.50 × 100 − (15.88 × 10−1 )
→ 1.50 × 100 − (1.59 × 100 )
= −0.09 × 100 → −9.00 × 10−2

Working with so few significant digits leads to errors in the solution, as is shown
below by an examination of the residuals.
SOLUTION BY ELIMINATION 49

m Augmented matrix
0.34 −0.58 0.94 2.0
0.27 0.42 0.13 1.5
0.20 −0.51 0.54 0.8
First stage 0.34 −0.58 0.94 2.0
0.794 0.881 −0.616 −0.0900
0.588 −0.169 −0.0130 −0.380
Second stage 0.34 −0.58 0.94 2.0
0.881 −0.616 −0.0900
−0.192 −0.131 −0.397
We now do back-substitution:

−0.131x3 = −0.397 ⇒ x3 ≈ 3.03


0.881x2 − 0.616 × 3.03 = −0.0900 ⇒ x2 ≈ 2.02
0.34x1 − 0.58 × 2.02 + 0.94 × 3.03 = 2.0 ⇒ x1 ≈ 0.941

As a check, we can sum the original three equations to obtain 0.81x1 −0.67x2 +
1.61x3 = 4.3. Inserting the solution yields 0.81 × 0.941 − 0.67 × 2.02 + 1.61 ×
3.03 = 4.28711.
In order to judge the accuracy of the solution, we may insert the solution into the
left-hand side of each of the original equations, and compare the results with the
right-hand side constants. The differences between the results and the constants
are called residuals. For the example we have:

0.34 × 0.941 − 0.58 × 2.02 + 0.94 × 3.03 = 1.99654


0.27 × 0.941 + 0.42 × 2.02 + 0.13 × 3.03 = 1.49637
0.20 × 0.941 − 0.51 × 2.02 + 0.54 × 3.03 = 0.7942

so the residuals are

2.00 − 1.99654 = 0.00346


1.50 − 1.49637 = 0.00363
0.80 − 0.7942 = 0.0058

It would seem reasonable to believe that if the residuals are small the solution
is a good one. This is usually the case. Sometimes, however, small residuals are
not indicative of a good solution. This point is taken up under ‘ill-conditioning’,
in the next Step.

Checkpoint

1. When transforming the augmented matrix, what kinds of operation


are permissible?
50 SYSTEMS OF LINEAR EQUATIONS 1

2. What is the final form of the coefficient matrix, before back-substi-


tution begins?
3. What are pivot elements? Why must small pivot elements be avoided
if possible?

EXERCISES
Solve the following systems by Gaussian elimination.
1. x1 + x2 − x3 = 0
2x1 − x2 + x3 = 6
3x1 + 2x2 − 4x3 = −4
2. 5.6x + 3.8y + 1.2z = 1.4
3.1x + 7.1y − 4.7z = 5.1
1.4x − 3.4y + 8.3z = 2.4
3. 2x + 6y + 4z = 5
6x + 19y + 12z = 6
2x + 8y + 14z = 7
4. 1.3x + 4.6y + 3.1z = −1
5.6x + 5.8y + 7.9z = 2
4.2x + 3.2y + 4.5z = −3
STEP 12

SYSTEMS OF LINEAR EQUATIONS 2


Errors and ill-conditioning

For any system of linear equations, the question of how much error there may be
in a solution obtained by a numerical method is a very difficult one to answer.
A general discussion of the problems it raises is beyond the scope of this book.
However, some of the sources of error are indicated.

1 Errors in the coefficients and constants


In many practical cases the coefficients of the variables, and also the constants on
the right-hand sides of the equations, are obtained from observations of experi-
ments or from other numerical calculations. They will have error; and therefore
when the solution of the system is found, it too will contain errors. To show
how this kind of error is carried through in calculations, we shall solve a simple
example in two variables, assuming that the constants have error at most ±0.01.
Consider the system

2x + y = 4 (±0.01)
−x + y = 1 (±0.01)

Solving by Gaussian elimination and back-substitution yields

2x + y = 4 (±0.01)
2 y = 1 (±0.01) + 2 (±0.005)
3

Therefore 32 y lies between 2.985 and 3.015, so y lies between 1.990 and 2.010.
From the first equation we now obtain

2x = 4 (±0.01) − 2 (±0.01)

so x lies between 0.99 and 1.01.


If the system were exact in its coefficients and constants, its exact solution would
be x = 1, y = 2. Since the constants are not known exactly, it is meaningless to
talk of an exact solution; the best that can be said is that 0.99 ≤ x ≤ 1.01 and
1.99 ≤ y ≤ 2.01.
In this example the error in the solution is of the same order as that in the
constants. Generally, however, the error in the solutions is greater than that in the
constants.
52 SYSTEMS OF LINEAR EQUATIONS 2

2 Round-off errors and numbers of operations


Any numerical method for solving systems of linear equations involves large num-
bers of arithmetic operations. For example, in the Gaussian elimination method of
the previous Step, we see from Atkinson (1993) that there are (n 3 + 3n 2 − n)/3
multiplications/divisions and (2n 3 + 3n 2 − 5n)/6 additions/subtractions required
to arrive at the solution of a system which has n unknowns.
Since round-off errors are propagated at each step of an algorithm, the growth
of round-off errors can be such as to lead to a solution very far from the true one
when n is large.

3 Partial pivoting
In the Gaussian elimination method, the buildup of round-off errors may be
reduced by arranging the equations so that the use of large multipliers in the
elimination operations is avoided. The procedure to be carried out is known as
partial pivoting (or pivotal condensation). The general rule to follow is: at each
elimination stage, arrange the rows of the augmented matrix so that the new pivot
element is larger in absolute value than (or equal to) any element beneath it in its
column.
Use of this rule ensures that the multipliers used at each stage have magnitude
less than or equal to one. To show the rule in operation we treat a simple example,
using three-digit decimal floating point arithmetic. We solve

2x + 5y + 8z = 36
4x + 7y − 12z = −16
x + 8y + z = 20

The tabular solution is as follows, the pivot elements being printed in boldface
numerals. (Note that all the multipliers have magnitude less than 1.)

Stage m Aug. matrix Explanation


4 7 −12 −16 The first and sec-
2 5 8 36 ond equations have
1 8 1 20 been interchanged;
the pivot element 4
is now the largest in
the x-column.
1. Eliminate the 4 7 −12 −16 Rows 2 and 3 must
x-terms from the 0.500 0 1.50 14.0 44.0 be interchanged, so
second and third 0.250 0 6.25 4.00 24.0 that the next pivot
equations. element is 6.25 ra-
ther than 1.50.
ERRORS AND ILL-CONDITIONING 53

Stage m Aug. matrix Explanation


2. Eliminate the 4 7 −12 −16
y-term in the third 0 6.25 4.00 24.0
equation. 0.240 0 0 13.0 38.2
Solve by back-substitution: z = 2.94, y = 1.95, x = 1.40
If no pivoting is done, it may be verified that using three-digit floating point
arithmetic yields the solution z = 2.93, y = 2.00, and x = 1.30. Since the true
solution to 3S is given by z = 2.93, y = 1.96, and x = 1.36, the solution obtained
using partial pivoting is better than the one obtained without any pivoting.

4 Ill-conditioning
Certain systems of linear equations are such that their solutions are very sensitive to
small changes (and therefore to errors) in their coefficients and constants. We give
an example below in which 1% changes in two coefficients change the solution by
a factor of 10 or more. Such systems are said to be ill-conditioned. If a system is
ill-conditioned, a solution obtained by a numerical method may be very different
from the exact solution, even though great care is taken to keep round-off and
other errors very small.
As an example, consider the following system of equations:

2x + y = 4
2x + 1.01y = 4.02

This has the exact solution x = 1, y = 2. Making 1% changes in the coefficients


of the second equation and a 5% change in the constant of the first, gives the
system
2x + y = 3.8
2.02x + y = 4.02
It is easily verified that the exact solution to this system is x = 11, y = −18.2.
This is very different from the solution to the first system. Both these systems are
said to be ill-conditioned.
If a system is ill-conditioned, then the usual procedure of checking a numerical
solution by calculating the residuals may not be valid. To see why, suppose we
have an approximation x̃ to the true solution x. The vector of residuals r is then
given by r = b − Ax̃ = A(x − x̃). Thus e = x − x̃ satisfies the linear system
Ae = r. In general r will be a vector having small components. However, in an
ill-conditioned system, even if the components of r are small so that r is ‘close’
to 0, the solution of the linear system Ae = r could be very different from the
solution of Ae = 0, namely 0. It then follows that x̃ may be a poor approximation
to x despite the residuals in r being small.
Obtaining accurate solutions to ill-conditioned linear systems can be difficult,
and many tests have been proposed for determining whether or not a system is
ill-conditioned. A simple introduction to this topic is given in the optional Step
16.
54 SYSTEMS OF LINEAR EQUATIONS 2

Checkpoint

1. Describe the types of error that may affect the solution of a system
of linear equations.
2. How can partial pivoting contribute to a reduction of errors?
3. Is it true to say that an ill-conditioned system has not got an exact
solution?

EXERCISES

1. Find the range of solutions for the following system, assuming maximum
errors in the constants as shown:

x − y = 1.4 (±0.01)
x + y = 3.8 (±0.05)

2. Solve the following systems by Gaussian elimination:


(a) x − 10y = −21.8
10x + y = 14.3
(b) x + 5y − z = 4
2x − y + 3z = 7
3x − y + 5z = 12
(c) 2.1x1 + 2.4x2 + 8.1x3 = 62.76
7.2x1 + 8.5x2 − 6.3x3 = −1.93
3.4x1 − 6.4x2 + 5.4x3 = 16.24
3. Use four-digit decimal normalized floating point arithmetic to solve the fol-
lowing system with and without using partial pivoting. Compare your an-
swers with the exact answer, which is x = 1.000 × 100 , y = 5.000 × 10−1 .
(2.310 × 10−3 )x + (4.104 × 10−2 )y = 2.283 × 10−2
(4.200 × 10−1 )x + (5.368 × 100 )y = 3.104 × 100
4. Show that for a linear system of three unknowns, the Gaussian elimina-
tion procedure requires three divisions, eight multiplications, and eight sub-
tractions to complete the triangularization; and a further three divisions,
three multiplications, and three additions/subtractions to carry out the back-
substitution.
5. Derive the general formulae given in Section 2 for the numbers of required
arithmetic operations.
6. Study the ill-conditioning example given in Section 4 in the following ways.
(a) Plot the lines of the first system on graph paper; now describe
ill-conditioning in geometrical terms when only two unknowns are
involved.
ERRORS AND ILL-CONDITIONING 55

(b) Insert the solution of the first system into the left-hand side of the second
system. Does x = 1, y = 2 ‘look like’ a good solution to the second
system? Comment.
(c) Insert the solution of the second system into the left-hand side of the
first system. Comment.
7. The system
10x1 + 7x2 + 8x3 + 7x4 = 32
7x1 + 5x2 + 6x3 + 5x4 = 23
8x1 + 6x2 + 10x3 + 9x4 = 33
7x1 + 5x2 + 9x3 + 10x4 = 31
is an example of ill-conditioning due to T. S. Wilson. Insert the ‘solution’
(6.0, −7.2, 2.9, −0.1) into the left-hand side. Would you claim this solution
to be a good one? Now insert the solution (1.0, 1.0, 1.0, 1.0). Comment on
the dangers of making claims!
STEP 13

SYSTEMS OF LINEAR EQUATIONS 3


The Gauss-Seidel iterative method

The methods used in the previous Steps for solving systems of linear equations
are termed direct methods. When a direct method is used, and if round-off and
other errors do not arise, an exact solution is reached after a finite number of
arithmetic operations. In general, of course, round-off errors do arise; and when
large systems are being solved by direct methods, the growing errors can become
so large as to render the results obtained quite unacceptable.

1 Iterative methods
Iterative methods provide an alternative approach. Recall that an iterative method
starts with an approximate solution, and uses it in a recurrence formula to provide
another approximate solution; by repeatedly applying the formula, a sequence
of solutions is obtained which (under suitable conditions) converges to the exact
solution. Iterative methods have the advantages of simplicity of operation and
ease of implementation on computers, and they are relatively insensitive to propa-
gation of errors; they would be used in preference to direct methods for solving
linear systems involving several hundred variables, particularly if many of the
coefficients were zero. Systems of over 100 000 variables have been successfully
solved on computers by iterative methods, whereas systems of 10 000 or more
variables are difficult or impossible to solve by direct methods.

2 The Gauss-Seidel method


Only one iterative method for linear equations, due to Gauss and improved by
Seidel, will be presented in this text. We shall solve the system
10x1 + 2x2 + x3 = 13
2x1 + 10x2 + x3 = 13
2x1 + x2 + 10x3 = 13
by using the Gauss-Seidel iterative method † .
The first step is to solve the first equation for x1 , the second for x2 , and the third
for x3 . This transforms the system to the following:
x1 = 1.3 − 0.2x2 − 0.1x3 (1)
x2 = 1.3 − 0.2x1 − 0.1x3 (2)
x3 = 1.3 − 0.2x1 − 0.1x2 (3)
† Thismethod is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 168.
THE GAUSS-SEIDEL ITERATIVE METHOD 57

An initial solution is now assumed; we shall use x1 = 0, x2 = 0, and x3 = 0.


Inserting this into the right-hand side of (1) gives x1 = 1.3. This value for x1 is
used immediately together with the remainder of the initial solution (that is, x2 = 0
and x3 = 0) in the right-hand side of (2), giving x2 = 1.3 − 0.2 × 1.3 − 0 = 1.04.
Finally, x1 = 1.3 and x2 = 1.04 are inserted in (3) to produce x3 = 0.936. This
completes the first iteration; we have obtained a second approximate solution
(1.3, 1.04, 0.936).
Beginning with the second solution, we can repeat the process to obtain a
third. Clearly we can continue in this way, and obtain a sequence of approximate
solutions. Under certain conditions on the coefficients of the system, the sequence
will converge to the exact solution.
We can set up recurrence relations which show clearly how the iterative process
(k) (k) (k) (k+1) (k+1) (k+1)
proceeds. Using (x1 , x2 , x3 ) and (x1 , x2 , x3 ) to denote the k-th
and (k + 1)-th solutions respectively, we have
(k+1) (k) (k)
x1 = 1.3 − 0.2x2 − 0.1x3 (1)0
(k+1) (k+1) (k)
x2 = 1.3 − 0.2x1 − 0.1x3 (2)0
(k+1) (k+1) (k+1)
x3 = 1.3 − 0.2x1 − 0.1x2 (3)0
(0) (0) (0)
We begin with the starting vector x(0) whose components x1 , x2 , and x3 are
all 0, and then apply these relations repeatedly in the order (1)0 , (2)0 , and (3)0 .
Note that when we insert values for x1 , x2 and x3 into the right-hand sides we
always use the most recent estimates found for each unknown.

3 Convergence
The sequence of solutions produced by the iterative process may be displayed in
a table, thus:
Iteration Approximate solution (Gauss-Seidel )
(k) (k) (k)
k x1 x2 x3
0 0 0 0
1 1.3 1.04 0.936
2 0.9984 1.00672 0.999648
3 0.998691 1.000297 1.000232
The student may check that the exact solution for this system is (1, 1, 1). It is
seen that the Gauss-Seidel solutions are rapidly approaching this; in other words,
the method is converging.
In practice, of course, the exact solution is not known. It is customary to end
the iterative procedure as soon as the differences between the x (k+1) values and
the x (k) values are suitably small. One stopping rule is to end the iteration when
n
(k+1) (k)
X
Sk = xi − xi
i=1
58 SYSTEMS OF LINEAR EQUATIONS 3

becomes less than a prescribed small number (usually chosen according to the
accuracy of the machine on which the calculations are carried out).
The question of convergence with a given system of equations is crucial. As
in the above example, the Gauss-Seidel method may quickly lead to a solution
very close to the exact one; on the other hand, it may converge too slowly to be of
practical use, or it may produce a sequence which diverges from the exact solution.
The reader is referred to more advanced texts (such as Conte and de Boor (1980))
for treatments of this question.
To improve the chance (and rate) of convergence, before applying the iterative
method the system of equations should be arranged so that as far as possible each
leading-diagonal coefficient is the largest (in absolute value) in its row.

Checkpoint

1. What is an essential difference between a direct method and an


iterative method?
2. Give some advantages of the use of iterative methods rather than
direct methods.
3. How can the chance of success with the Gauss-Seidel method be
improved?

EXERCISES

1. For the example treated above, compute the value of S3 , the quantity used in
the suggested stopping rule after the third iteration.
2. Use the Gauss-Seidel method to solve the following systems to 5D accuracy
(remember to rearrange the equations if appropriate). Compute the value of
Sk (to 6D ) after each iteration.
(a) x − y + 10z = −7
20x + 3y − 2z = 51
2x + 8y + 4z = 25
(b) 10x − y =1
−x + 10y − z =1
−y + 10z − w =1
−z + 10w =1
STEP 14

SYSTEMS OF LINEAR EQUATIONS 4*


Matrix inversion*

The general system of n linear equations in n variables (see Step 11, Section 1)
can be written in matrix form Ax = b, and we seek a vector x which satisfies this
equation. Here we make use of the inverse matrix A−1 to find this vector.

1 The inverse matrix


In Step 11 we observed that if the determinant of A is nonzero, then A has an
inverse matrix A−1 . Moreover, we could then write the solution of the linear
system as x = A−1 b. Thus the solution to the system of linear equations can be
obtained by first finding the inverse of the coefficient matrix A, and then forming
the product A−1 b.
Nevertheless, this approach is not normally adopted in practice. The problem of
finding the inverse matrix is itself a numerical one, which generally requires for its
solution many more operations (and therefore involves more round-off and other
errors) than any of the methods described in previous Steps. However, it would be
sensible to compute the inverse first if it were required for some additional reason.
For example, the inverse may contain theoretical or statistical information or be
of use in some other formula or calculation.

2 Method for inverting a matrix


There are many numerical methods for finding the inverse of a matrix. We
shall describe one which uses the Gaussian elimination and back-substitution
procedures of Step 11. It is simple to apply and is computationally efficient. We
shall illustrate the method by applying it to a 2 × 2 matrix and a 3 × 3 matrix; it
should then be clear to the reader how the method may be extended for use with
n × n matrices.
As a 2 × 2 example, suppose
 
2 1
A=
4 5
We seek the inverse matrix
 
−1 u1 u2
A =
v1 v2
such that  
1 0
AA−1 = I =
0 1
60 SYSTEMS OF LINEAR EQUATIONS 4*

This is equivalent to solving the two systems


       
u1 1 u2 0
A = and A =
v1 0 v2 1
The method proceeds as follows:
(i) Form the augmented matrix
 
2 1 1 0
[A | I ] =
4 5 0 1
(ii) Apply elementary row operations to the augmented matrix such that A is
transformed to an upper triangular matrix à (see Step 11, Section 5):

 A I   Ã Ĩ 
2 1 1 0 2 1 1 0
→ −2 1 (row 2 − twice row 1)

4 5 0 1 0 3
(iii) Solve the two systems
         
2 1 u1 1 2 1 u2 0
= and =
0 3 v1 −2 0 3 v2 1
using the back-substitution method. Note how the systems are constructed,
using à and columns of Ĩ. From the first system, 3v1 = −2, v1 = − 23 , and
2u 1 + v1 = 1, so 2u 1 = 1 + 32 , u 1 = 65 . From the second system, 3v2 = 1,
v2 = 31 , and 2u 2 + v2 = 0, so 2u 2 = − 13 , u 2 = − 16 . The required inverse
matrix is    
5 1
u 1 u 2 −
A−1 =  = 6 6 
v1 v2 −32 1
3
(iv) Check: AA−1 should equal I. By multiplication we find
    
5
2 1 6 − 16 1 0
  = 
4 5 − 23 1
3 0 1

so A−1 is correct.
In this simple example it has been possible to work with fractions, so no round-
off errors occur and the resulting inverse matrix is exact. More generally, when
doing calculations by hand the final result should be checked by computing AA−1 ,
which should be approximately equal to the identity matrix I.
As a 3 × 3 example, we shall find the inverse matrix A−1 of
 
0.20 0.24 0.12
A =  0.10 0.24 0.24 
0.05 0.30 0.49
To show the effects of errors we shall work to 3S in the calculation of A−1 . The
results of the calculations are displayed below in tabular form.
MATRIX INVERSION* 61

Multipliers A transforms to à I transforms to Ĩ


0.20 0.24 0.12 1 0 0
0.10 0.24 0.24 0 1 0
0.05 0.30 0.49 0 0 1
0.20 0.24 0.12 1 0 0
0.5 0 0.12 0.18 −0.5 1 0
0.25 0 0.24 0.46 −0.25 0 1
0.20 0.24 0.12 1 0 0
0 0.12 0.18 −0.5 1 0
2 0 0 0.10 0.75 −2 1
The inverse matrix A−1 is placed 3 19.0 6 −34.0 9 12.0
here (→); its elements are calcu- 2 −15.4 5 38.3 8 −15.0
lated to 3S by back-substitution 1 7.5 4 −20.0 7 10.0
in the order shown (numbered 1,
2, . . . , 9).

As an example of the back-substitution, Ã taken with the second column of Ĩ


yields the second column of A−1 . Thus:
    
0.20 0.24 0.12 u2 0
 0 0.12 0.18   v2  =  1 
0 0 0.10 w2 −2

yields w2 = −20.0, v2 = 38.3, and u 2 = −34.0, found in that order. One might
check by multiplication that AA−1 is
 
1.004 −0.008 0
 0.004 0.992 0 
0.005 −0.01 1

which is approximately equal to I. The noticeable inaccuracy is due to carrying


out the calculation of the elements of A−1 to 3S only.

3 Solution of linear systems using the inverse matrix


As previously noted, the unique solution of a linear system Ax = b is x = A−1 b,
when the coefficient matrix A has an inverse A−1 . We shall illustrate this, by
using the inverse A−1 obtained in Section 2 to compute the solution to the linear
system:
0.20x + 0.24y + 0.12z = 1
0.10x + 0.24y + 0.24z = 2
0.05x + 0.30y + 0.49z = 3
62 SYSTEMS OF LINEAR EQUATIONS 4*

The coefficient matrix is


 
0.20 0.24 0.12
A =  0.10 0.24 0.24 
 

0.05 0.30 0.49

We can use the A−1 calculated in the previous section in the following manner:
 
x
x =  y  = A−1 b
 

z
    
19 −34 12 1 −13
=  −15.4 38.3 −15   2  =  16.2 
    

7.5 −20 10 3 −2.5

Thus we get the solution to 2S given by x = −13, y = 16, and z = −2.5.


We may check the solution by adding the three equations. This yields

0.35x + 0.78y + 0.85z = 6

Inserting the solution in the left-hand side gives

0.35 × (−13) + 0.78 × 16 + 0.85 × (−2.5) = 5.8 to 2S

Checkpoint

1. In the method for finding the inverse of A, what is the final form of
A after the elementary row operations have been carried out?
2. Is the solution of the system Mx = d, x = dM−1 or x = M−1 d (or
neither)?
3. Give a condition for a matrix not to have an inverse.

EXERCISES

1. Find the inverses of the following matrices, using the elimination and back-
substitution method.
 
(a) 2 6 4
 6 19 12 
 

2 8 14
MATRIX INVERSION* 63
 
(b) 1.3 4.6 3.1
 5.6 5.8 7.9 
 

4.2 3.2 4.5


 
(c) 0.37 0.65 0.81
 0.41 0.71 0.34 
 

0.11 0.82 0.52


2. Solve the following systems of equations (each with two right-hand side
vectors).
 
(a) 2x + 6y + 4z = 5 =1
6x + 19y + 12z = 6  =2 
 

2x + 8y + 14z = 7 =3
 
(b) 1.3x + 4.6y + 3.1z = −1 =0
5.6x + 5.8y + 7.9z = 2  = 1 
 

4.2x + 3.2y + 4.5z = −3 =1


 
(c) 0.37x1 + 0.65x2 + 0.81x3 = 1.1 = 0.5
0.41x1 + 0.71x2 + 0.34x3 = 2.2  = 2.1 
 

0.11x1 + 0.82x2 + 0.52x3 = −0.1 = 1.2


STEP 15

SYSTEMS OF LINEAR EQUATIONS 5*


Use of LU decomposition*

We have shown in Step 11 how to solve a linear system Ax = b using Gaussian


elimination, applied to the augmented matrix [A|b]. In the previous Step, we
extended the elimination process to calculate the inverse A−1 of the coefficient
matrix A, assuming it exists.
Another general approach to solving Ax = b is known as the method of LU
decomposition, which provides new insights into matrix algebra and has many
theoretical and practical uses. Efficient computer algorithms for handling practical
problems can be developed from it.
The symbols L and U denote a lower triangular matrix and an upper triangular
matrix, respectively. Examples of lower triangular matrices are
   
1 0 0 2 0 0
L1 =  0 1 0  and L2 =  1 −1 0 
2 −0.5 1 2 3 1

Note that all the elements above the leading diagonal in a lower triangular matrix
are zero. Examples of upper triangular matrices are
   
−1 2 1 −1 2 0
U1 =  0 8 6  and U2 =  0 1 2 
0 0 6 0 0 −1

where all the elements below the leading diagonal are zero. The product of L1
and U1 is given by
 
−1 2 1
A = L1 U1 =  0 8 6 
−2 0 5

1 Procedure
Suppose we have to solve a linear system Ax = b, and that we can express the
coefficient matrix A in the form A = LU. This form is called an LU decomposition
of A.
Then we may solve the linear system by the following procedure:
Stage 1: Write Ax = LUx = b.
USE OF LU DECOMPOSITION* 65

Stage 2: Set y = Ux, so that Ax = Ly = b. Use forward substitution on Ly = b


to find y1 , y2 , . . . , yn in that order. In more detail, suppose the augmented matrix
for the system Ly = b is

`11
 
0 ··· 0 0 b1
 ` `22 ··· 0 0 b2 
 21 
 .. .. .. .. .. .. 

 . . . . . . 

 `n−1,1 `n−1,2 · · · `n−1,n−1 0 bn−1 
 
`n1 `n2 · · · `n,n−1 `nn bn

Then the forward substitution procedure yields y1 = b1 /`11 , and subsequently


" #
i−1
1 X
yi = bi − `i j y j , i = 2, 3, . . .
`ii j=1

Note that the value of yi depends on the values y1 , y2 , . . . , yi−1 already calculated.
Stage 3: Finally, use back-substitution on Ux = y to find xn , . . . , x1 in that order.
Later on we shall outline a general method for finding LU decompositions
of square matrices. There follows an example showing this method in action
involving the matrix A = L1 U1 given above. If we wish to solve Ax = b
with a number of different b’s, then this method is more efficient than applying
the Gaussian elimination technique to each separate linear system. Once we
have found an LU decomposition of A, we need only do forward and backward
substitutions to solve the system for any b.

2 Example
We solve the system
−x1 + 2x2 + x3 = 0
8x2 + 6x3 = 10
−2x1 + 5x3 = −11
Stage 1: An LU decomposition of the system is
     
1 0 0 −1 2 1 x1 0
Ax = 0 1 0  0 8 6  x2  =  10 
2 −0.5 1 0 0 6 x3 −11
L1 U1 x b
Stage 2: Set y = U1 x and then solve the system L1 y = b, that is,
    
1 0 0 y1 0
 0 1 0   y2  =  10 
2 −0.5 1 y3 −11
66 SYSTEMS OF LINEAR EQUATIONS 5*

Using forward substitution we obtain:

y1 = 0
y2 = 10
2y1 − 0.5 × y2 + y3 = −11 ⇒ y3 = −6
     
−1 2 1 x1 0
Stage 3: Solve  0 8 6   x2  =  10 
0 0 6 x3 −6
U1 x y
Back-substitution yields:

6x3 = −6 ⇒ x3 = −1
8x2 + 6x3 = 10 ⇒ x2 = 2
−x1 + 2x2 + x3 = 0 ⇒ x1 = 3
 
3
Thus the solution of Ax = b is x =  2 , which may be checked in the
−1
original equations. We turn now to the problem of finding an LU decomposition
of a given square matrix A.

3 Effecting an LU decomposition
For an LU decomposition of a given matrix A of order n × n, we seek a lower
triangular matrix L and an upper triangular matrix U (both of order n × n) such
that A = LU. The matrix U may be taken to be the upper triangular matrix
resulting from the process of Gaussian elimination without partial pivoting (see
Sections 3 and 5 of Step 11), and the matrix L may be taken to be the lower
triangular matrix which has diagonal elements 1 and which, for k < i, has as the
(i, k)-th element the multiplier m ik . This multiplier is calculated at the k-th stage
of Gaussian elimination and is required to transform the current value of aik to
0. In the notation of Step 11, these multipliers were given by m ik = aik /akk ,
i = k + 1, k + 2, . . . , n.
An example will help clarify this. From Step 11, we recall that the Gaussian
elimination procedure applied to the system

x+ y−z =2
x + 2y + z = 6
2x − y + z = 1

yielded the upper triangular matrix


 
1 1 −1
U= 0 1 2 
0 0 9
USE OF LU DECOMPOSITION* 67

Also, we saw that in the first stage we calculated the multipliers m 21 = a21 /a11 =
1/1 = 1 and m 31 = a31 /a11 = 2/1 = 2, while in the second stage we calculated
the multiplier m 32 = a32 /a22 = −3/1 = −3. Thus
   
1 0 0 1 0 0
L =  m 21 1 0 = 1 1 0 
m 31 m 32 1 2 −3 1
It may be readily verified that
 
1 1 −1
LU =  1 2 1 
2 −1 1
the coefficient matrix of the original system.
Another technique that may be used to find an LU decomposition of an n × n
matrix is by a direct decomposition. To illustrate, suppose we wish to find an LU
decomposition for the 3 × 3 coefficient matrix of the system given above. Then
the required L and U are of the form
`11 0
   
0 u 11 u 12 u 13
L =  `21 `22 0  , U =  0 u 22 u 23 
`31 `32 `33 0 0 u 33
Note that the total number of unknowns in L and U is 12, whereas there are only
9 elements in the 3 × 3 coefficient matrix A. To ensure that L and U are unique,
we need to impose 12 − 9 = 3 extra conditions on the elements of these two
triangular matrices. (In the general n × n case, n extra conditions are required.)
One common choice is to require all the diagonal elements of L to have the value
1; the resulting method is known as Doolittle’s method. Another choice is to
require all the diagonal elements of U to be 1; this is called Crout’s method. Since
Doolittle’s method will result in the same LU decomposition for A as given above,
we shall use Crout’s method to illustrate this direct decomposition procedure.
We then require
`11 0
    
0 1 u 12 u 13 1 1 −1
 `21 `22 0   0 1 u 23  =  1 2 1 
`31 `32 `33 0 0 1 2 −1 1
By multiplying out L and U, we obtain:

`11 × 1 = 1 ⇒ `11 = 1
`11 u 12 = 1 ⇒ u 12 = 1
`11 u 13 = −1 ⇒ u 13 = −1
`21 × 1 = 1 ⇒ `21 = 1
`21 u 12 + `22 = 2 ⇒ `22 = 1
`21 u 13 + `22 u 23 = 1 ⇒ u 23 = 2
68 SYSTEMS OF LINEAR EQUATIONS 5*

`31 × 1 = 2 ⇒ `31 = 2
`31 u 12 + `32 = −1 ⇒ `32 = −3
`31 u 13 + `32 u 23 + `33 = 1 ⇒ `33 = 9

It is clear that this construction from Crout’s method yields triangular matrices L
and U for which A = LU.

Checkpoint

1. What constitutes an LU decomposition of a matrix A?


2. How is a decomposition A = LU used to solve a linear system
Ax = b?
3. How may an LU decomposition be obtained from Gaussian elimin-
ation?

EXERCISES

1. Find an LU decomposition of the matrix


 
a b
c d
where a, b, c, d 6= 0.
2. Solve each of the following systems by first finding an LU decomposition
of the coefficient matrix and then using forward and backward substitutions.
(These systems are from Exercises 1 and 3 in Step 11.)
(a) x1 + x2 − x3 = 0
2x1 − x2 + x3 = 6
3x1 + 2x2 − 4x3 = −4
(b) 2x + 6y + 4z = 5
6x + 19y + 12z = 6
2x + 8y + 14z = 7
STEP 16

SYSTEMS OF LINEAR EQUATIONS 6*


Testing for ill-conditioning*

We recall from Section 4 in Step 12 that ill-conditioned systems of linear equations


are such that their solutions are very sensitive to small changes in their coefficients
and constants. In this optional Step we show how one may test for such ill-
conditioning.

1 Norms
One of the most common tests for ill-conditioning of a linear system involves the
condition number of the coefficient matrix. In order to define this quantity, we
need to first consider the concept of the norm of a vector or matrix, which in some
way measures the size of their elements.
Let x and y be vectors. Then a vector norm k · k is a real number with the
following properties:
(a) kxk ≥ 0 and kxk = 0 if and only if x is a vector with all components zero;
(b) kαxk = |α| kxk for any real number α;
(c) kx + yk ≤ kxk + kyk (triangle inequality).
There are many possible ways to choose a vector norm with the above three
properties. One vector norm that is probably familiar to the student is the Euclidean
or 2-norm. Thus if x is an n × 1 vector, then the 2-norm is denoted and defined by
" #1/2
Xn
2
kxk2 ≡ xi
i=1

As an example, if x is the 5 × 1 vector with components x1 = 1, x2 = −3, x3 = 4,


x4 = −6, and x5 = 2, then
p √
kxk2 = 12 + (−3)2 + 42 + (−6)2 + 22 = 66

Another possible choice of norm, which is more suitable for our purposes here, is
the infinity norm defined by

kxk∞ ≡ max |xi |


i=1,2,...,n

Thus for the vector in the previous example we have kxk∞ = 6. It is easily verified
that kxk∞ has the three properties in the above definition. For kxk2 the first two
properties are easy to verify; the triangle inequality (c) is a bit more difficult and
70 SYSTEMS OF LINEAR EQUATIONS 6*

requires use of the so-called Cauchy-Schwarz inequality (for example, see Cheney
and Kincaid (1994)).
The defining properties of a matrix norm are similar, except that there is an extra
property. Let A and M be matrices. Then a matrix norm k · k is a real number
with the following properties:
(a) kAk ≥ 0 and kAk = 0 if and only if A is a matrix with all elements zero;
(b) kαAk = |α| kAk for any real number α;
(c) kA + Mk ≤ kAk + kMk;
(d) kAMk ≤ kAk kMk.
As for vector norms, there are many ways of choosing matrix norms with the
four properties above, but here we consider only the infinity norm. If A is an n × n
matrix, then the infinity norm is defined by
n
X
kAk∞ ≡ max |ai j |
i=1,2,...,n
j=1

From this definition, we see that this norm is the maximum of the sums obtained
from adding the absolute values of the elements in each row, so it is commonly
referred to as the maximum row sum norm.
As an example, suppose
 
−3 3 4 4
 5 1 2 −3 
A= 
 −4 4 −3 −4 
−3 −2 4 −2

Then
4
X 4
X 4
X 4
X
|a1 j | = 14, |a2 j | = 11, |a3 j | = 15, and |a4 j | = 11
j=1 j=1 j=1 j=1

so that kAk∞ = max(14, 11, 15, 11) = 15.


A useful property relating the matrix and vector infinity norms is

kAxk∞ ≤ kAk∞ kxk∞

this follows from property (d) of a matrix norm.

2 Testing for ill-conditioning


We now proceed to test whether or not a system is ill-conditioned, by using the
condition number of the coefficient matrix. If A is an n × n matrix and A−1 is
its inverse, then the condition number of A is denoted and defined by

cond(A) ≡ kAk∞ kA−1 k∞


TESTING FOR ILL-CONDITIONING* 71

The condition number is bounded below by 1, since kIk∞ = 1 and

1 = kAA−1 k∞ ≤ kAk∞ kA−1 k∞ = cond(A)

where we have used the matrix norm property (d) given in the previous section.
Large values of the condition number usually indicate ill-conditioning. As a
justification for this last statement, we state and prove the following theorem.
Theorem: Suppose x satisfies the linear system Ax = b and x̃ satisfies the linear
system Ax̃ = b̃. Then

kx − x̃k∞ kb − b̃k∞
≤ cond(A)
kxk∞ kbk∞

Proof: We have x − x̃ = A−1 (b − b̃). Since

kA−1 (b − b̃)k∞ ≤ kA−1 k∞ kb − b̃k∞

we see that
kx − x̃k∞ ≤ kA−1 k∞ kb − b̃k∞
However, since b = Ax, we have kbk∞ ≤ kAk∞ kxk∞ , or

1 kAk∞

kxk∞ kbk∞

It then follows that

1 kAk∞
kx − x̃k∞ × ≤ kA−1 k∞ kb − b̃k∞ ×
kxk∞ kbk∞

from which the result follows.


From the theorem we see that even if the difference between b and b̃ is small, the
change in the solution as measured by the ‘relative error’ kx − x̃k∞ /kxk∞ may be
large when the condition number is large. It follows that a large condition number
is an indication of possible ill-conditioning of the system. A similar theorem for
the case when there are small changes to the coefficients of A may be found in
more advanced texts such as Atkinson (1993). Such a theorem also shows that a
large condition number is an indicator of ill-conditioning.
The question then arises as to how large the condition number has to be for
ill-conditioning to be a problem. Roughly speaking, if the condition number is
10m and the machine being used to solve the linear system has k decimal digits of
accuracy, then the solution of the linear system will be accurate to k − m decimal
digits.
72 SYSTEMS OF LINEAR EQUATIONS 6*

In Step 12 we had the coefficient matrix


 
2 1
A=
2 1.01
which was associated with an ill-conditioned system. Then
 
50.5 −50
A−1 =
−100 100

and cond(A) = kAk∞ kA−1 k∞ = 3.01 × 200 = 602. This suggests that a
numerical solution would not be very accurate if only two decimal digits of
accuracy were used in the calculations. Indeed, if the components of A were
rounded to two decimal digits, the two rows of A would be identical. Then the
determinant of A would be zero, and it follows from the theorem in Step 11 that
this system would not have a unique solution.
We recall that as defined, the condition number requires A−1 , but it is compu-
tationally expensive to compute the inverse matrix. Moreover, even if the inverse
were calculated, this approximation might not be very accurate if the system is
ill-conditioned. It is therefore common in software packages to estimate the con-
dition number by obtaining an estimate of kA−1 k∞ without explicitly finding
A−1 .

Checkpoint

1. What are the three properties of a vector norm?


2. What are the four properties of a matrix norm?
3. How is the condition number of a matrix defined and how is it used
as a test for ill-conditioning?

EXERCISES

1. For the 5 × 1 vector with elements x1 = 4, x2 = −6, x3 = −5, x4 = 1, and


x5 = −1, calculate kxk2 and kxk∞ .
2. Calculate the infinity norm for each of the matrices given in Exercise 1 of
Step 14 on page 62.
3. Calculate the condition number for each of the matrices given in Exercise 1
of Step 14 on page 62.
STEP 17

THE EIGENVALUE PROBLEM


The power method

Suppose A is an n×n matrix. If there exists a number λ and a nonzero vector x such
that Ax = λx, then λ is said to be an eigenvalue of A, and x the corresponding
eigenvector. The evaluation of eigenvalues and eigenvectors of matrices is a
problem that arises in a variety of contexts. Note that if we have an eigenvalue λ
and an eigenvector x, then βx (where β is any real number) is also an eigenvector
since
A(βx) = βAx = βλx = λ(βx)
This shows that the eigenvector is not unique and may be scaled if desired (for
instance, we might want the sum of the components of the eigenvector to be 1).
Writing Ax = λx as
(A − λI)x = 0
we conclude from the theorem on page 43 that this can have a nonzero solution
only if the determinant of A − λI is zero. If we expand out this determinant, then
we get an n-th degree polynomial in λ known as the characteristic polynomial
of A. Thus one way to find the eigenvalues of A is to obtain its characteristic
polynomial, and then find the n zeros (some may be complex) of this polynomial.
For example, suppose  
a b
A=
c d
Then
a−λ
 
b
A − λI =
c d −λ
This last matrix has determinant

(a − λ)(d − λ) − bc = λ2 − (a + d)λ + (ad − bc)

The zeros of this quadratic yield the eigenvalues.


Although the characteristic polynomial is easy to work out in this simple 2 × 2
case, as n increases it becomes more complicated (and, of course, of correspond-
ingly higher degree). Moreover, even if we can get the characteristic polynomial,
the analytic formulae for the roots of a cubic or quartic are somewhat inconvenient
to use, and in any case, we must use some numerical method (see Steps 7–10)
to find the roots of the polynomial when n > 4. It is therefore common to use
alternative direct numerical methods to find the eigenvalues and eigenvectors of a
matrix.
If we are interested in only the eigenvalue of largest magnitude, then a popular
approach to the evaluation of this eigenvalue is the power method. We shall
later discuss how this method may be modified to find the eigenvalue of smallest
74 THE EIGENVALUE PROBLEM

magnitude. Methods for finding all the eigenvalues are beyond the scope of this
book. (One such method, called the QR method, is based on the QR factorization
to be discussed in Section 3 of Step 27.)

1 Power method
Suppose that the n eigenvalues of A are λ1 , λ2 , . . . , λn and that they are ordered
in such a way that
|λ1 | > |λ2 | ≥ · · · ≥ |λn−1 | ≥ |λn |
Then the power method† can be used to find λ1 . We begin with a starting vector
w(0) and calculate the vectors

w( j) = Aw( j−1)

for j = 1, 2, . . ., so by induction we have

w( j) = A j w(0)

where A j is A multiplied by itself j times. Thus w( j) is the product of w(0) and the
j-th power of A, which explains why this approach is called the power method.
It turns out that at the j-th iteration an approximation to the eigenvector x
( j) ( j−1)
associated with λ1 is given by w( j) . Moreover, if wk and wk are the k-th
components of w( j) and w( j−1) respectively, then an approximation to λ1 is given
by
( j)
( j) wk
λ1 = ( j−1)
wk
for any k ∈ {1, 2, . . . , n}. Although there are n possible choices for k, it is usual
( j)
to choose k so that wk is the component of w( j) with the largest magnitude.

2 Example
Let us use the power method to find the largest eigenvalue of the matrix
 
1 1 −1
A= 1 2 1 
2 −1 1

As a starting vector we take


   
1 1
w(0) =  1  so that w(1) = Aw(0) =  4 
1 2

† The procedure is quite easy to implement in a computer program.


THE POWER METHOD 75

Since the second component of w(1) has the largest magnitude we take k = 2 so
that the first approximation to λ1 is
(1)
(1) w2 4
λ1 = (0)
= =4
w2 1

By doing more iterations of the power method we find


 
3
(2)
w(2) =  11  , λ1 = 11 4 = 2.75
0
 
14
(3)
w(3) =  25  , λ1 = 25 11 = 2.27273
−5
 
44
(4)
w(4) =  59  , λ1 = 59 25 = 2.36
−2
 
105
(5)
w(5) =  160  , λ1 = 160 59 = 2.71186
27
 
238
(6)
w(6) =  452  , λ1 = 452 160 = 2.825
77
 
613
(7)
w(7) =  1219  , λ1 = 1219452 = 2.69690
101
 
1731
(8)
w(8) =  3152  , λ1 = 3152
1219 = 2.58573
108
 
4775
(9)
w(9) =  8143  , λ1 = 8143
3152 = 2.58344 etc.
418

From these calculations we conclude that the largest eigenvalue is about 2.6.

3 Variants
In the previous example, the reader would have noticed that the components of
w( j) were growing in size as j increases. Overflow problems would arise if this
growth were to continue, so in practice it is usual to use the scaled power method
instead. This is identical to the power method except we scale the vectors w( j)
at each iteration. Thus suppose w(0) is given and set y(0) = w(0) . Then for
j = 1, 2, . . . let us carry out the following steps:
76 THE EIGENVALUE PROBLEM

(a) Calculate w( j) = Ay( j−1) ;


( j) ( j)
(b) find p such that |w p | = max |wi |, so the p-th component of w( j) has
i=1,2,...,n
the largest magnitude;
(c) evaluate an approximation to λ1 , namely
( j)
( j) wk
λ1 = ( j−1)
yk
for some k ∈ {1, 2, . . . , n};
( j)
(d) calculate y( j) = w( j) /w p .
In step (c) there are n choices for k. As for the unscaled power method, k is
( j)
usually chosen to be the value for which wk has the largest magnitude, that is, k
is taken to be the value p obtained in step (b). Another option is to choose k to be
the value of p from the previous iteration, although often this results in the same
value of k. The effect of step (d) is to produce a vector y( j) with components of
magnitude not more than 1.
As an example, we apply the scaled power method to the matrix in the previous
section. We take the value of k in each iteration to be p. The starting vector w(0)
is the same as before and y(0) = w(0) . Then the first four iterations of the scaled
power method yield:
 
1
(1)
w(1) =  4  , p = 2, λ1 = 14 = 4
2
 
0.25
y(1) =  1 
0.5
 
0.75
(2)
w(2) =  2.75  , p = 2, λ1 = 2.75 1 = 2.75
0
 
0.27273
y(2) =  1 
0
 
1.27273
(3)
w(3) =  2.27273  , p = 2, λ1 = 2.27273 1 = 2.27273
−0.45455
 
0.56
y(3) =  1 
−0.2
 
1.76
(4)
w(4) =  2.36  , p = 2, λ1 = 2.36 1 = 2.36
−0.08
THE POWER METHOD 77
 
0.74576
y(4) =  1 
−0.03390
Note that the eigenvalue estimates are as before and the w( j) ’s are just multiples
of those obtained in the previous section.
We now discuss how the power method may be used to find the eigenvalue λn
with the smallest magnitude. If A has an inverse, then Ax = λx may be written as
1
x = λA−1 x or A−1 x = x
λ
It follows that the smallest eigenvalue of A may be found by finding the largest
eigenvalue of A−1 and then taking the reciprocal. Thus if the unscaled power
method were used, we would calculate the vectors

w( j) = A−1 w( j−1)

In general it is more efficient to solve the linear system Aw( j) = w( j−1) than to
find the inverse of A (see Step 14).

4 Other aspects
It may be shown that the convergence rate of the power method is linear and that
under suitable conditions
j

( j)
λ2
λ1 − λ1 ≈ C

λ1
where C is some positive constant. Thus the bigger the gap between λ2 and λ1 ,
the faster the rate of convergence.
Since the power method is an iterative method, one has to stop at some stage. It
is usual to carry on the process until successive estimates of the eigenvalue agree
to a certain tolerance or a maximum number of iterations is exceeded.
Difficulties with the power method usually arise when our assumptions about
the eigenvalues are not valid. For instance, if |λ1 | = |λ2 |, then the sequence of
estimates for λ1 may not converge. Even if the sequence does converge, one may
not be able to get an approximation to the eigenvector associated with λ1 . A short
discussion of such difficulties may be found in Conte and de Boor (1980).

Checkpoint

1. How are the eigenvalues and eigenvectors of a matrix defined?


2. What is the power method for finding the eigenvalue having the
largest magnitude?
3. What advantage does the scaled power method have over the power
method?
78 THE EIGENVALUE PROBLEM

EXERCISES

1. For the 2 × 2 matrix  


−1.2 1.1
A=
3.6 −0.8
apply eight iterations of the power method. Find the characteristic polyno-
mial, and hence find the two true eigenvalues of the matrix. Verify that the
approximations are converging to the eigenvalue with the larger magnitude.
2. Apply five iterations of the normal and scaled power methods to the 3 × 3
matrix  
2 6 4
 6 19 12 
2 8 14
STEP 18

FINITE DIFFERENCES 1
Tables

Historically, numerical analysts have been concerned with tables of numbers, and
many techniques have been developed for dealing with mathematical functions
represented in this way. For example, the value of the function at an untabulated
point may be required, so that an interpolation procedure is necessary. It is also
possible to estimate the derivative or the definite integral of a tabulated function,
using some finite processes to approximate the corresponding (infinitesimal) lim-
iting procedures of calculus. In each case, it has been traditional to use finite
differences. Another application of finite differences, which is outside the scope
of this book, is the numerical solution of partial differential equations.

1 Tables of values
Many books contain tables of mathematical functions. One of the most com-
prehensive is Handbook of Mathematical Functions, edited by Abramowitz and
Stegun (see the Bibliography for publication details), which also contains useful
information about numerical methods.
Although most tables use constant argument intervals, some functions do change
rapidly in value in particular regions of the argument, and hence may best be
tabulated using intervals varying according to the local behaviour of the function.
Tables with varying argument interval are more difficult to work with, however, and
it is common to adopt uniform argument intervals wherever possible. As a simple
example consider the 6S table of the exponential function over 0.10 (0.01) 0.18
(this notation specifies the domain 0.10 ≤ x ≤ 0.18 spanned in intervals of 0.01).

x f (x) = e x x f (x) = e x x f (x) = e x


0.10 1.10517 0.13 1.13883 0.16 1.17351
0.11 1.11628 0.14 1.15027 0.17 1.18530
0.12 1.12750 0.15 1.16183 0.18 1.19722

It is extremely important that the interval between successive values is small


enough to display the variation of the tabulated function, because usually the value
of the function will be needed at some argument value between values specified
(for example, e x at x = 0.105 from the above table). If the table is so constructed,
we can obtain such intermediate values to reasonable accuracy by assuming a
polynomial representation (hopefully, of low degree) of the function f .
80 FINITE DIFFERENCES 1

2 Finite differences
Since Newton, finite differences have been used extensively. The construction of
a table of finite differences for a tabulated function is simple: first differences are
obtained by subtracting each value from the succeeding value in a table, second
differences by repeating this operation on the first differences, and so on for higher
orders. From the above table of e x for x = 0.10 (0.01) 0.18, one has the following
table (note the customary layout, with decimal points and leading zeros omitted
from the differences).

Differences
x f (x) = e x 1st. 2nd. 3rd.
0.10 1.10517
1111
0.11 1.11628 11
1122 0
0.12 1.12750 11
1133 0
0.13 1.13883 11
1144 1
0.14 1.15027 12
1156 0
0.15 1.16183 12
1168 −1
0.16 1.17351 11
1179 2
0.17 1.18530 13
1192
0.18 1.19722

(In this case, the differences must be multiplied by 10−5 for comparison with the
function values.)

3 Influence of round-off errors


Consider the difference table given below for f (x) = e x : 0.1 (0.05) 0.5 to six
significant digits constructed as in the preceding section. As before, differences
of increasing order decrease rapidly in magnitude, but the third differences are
irregular. This is largely a consequence of round-off errors, as tabulation of the
function to seven significant digits and differencing to fourth order illustrates
(compare Exercise 3 at the end of this Step).
TABLES 81

Differences
x f (x) = e x 1st. 2nd. 3rd.
0.10 1.10517
5666
0.15 1.16183 291
5957 15
0.20 1.22140 306
6263 14
0.25 1.28403 320
6583 18
0.30 1.34986 338
6921 16
0.35 1.41907 354
7275 20
0.40 1.49182 374
7649 18
0.45 1.56831 392
8041
0.50 1.64872

Although the round-off errors in f should be less than 12 in the last significant
place, they may accumulate; the greatest error that can be obtained corresponds
to:
Differences
Tabular error 1st. 2nd. 3rd. 4th. 5th. 6th.
+ 12
−1
− 12 +2
+1 −4
+ 12 −2 +8
−1 +4 −16
− 12 +2 −8 +32
+1 −4 +16
+ 12 −2 +8
−1 +4
− 12 +2
+1
+ 12
82 FINITE DIFFERENCES 1

A rough working criterion for the expected fluctuations (‘noise level’) due to
round-off error is shown in the following table.
Order of difference 1 2 3 4 5 6
Expected error limits ±1 ±2 ±3 ±6 ±12 ±22

Checkpoint

1. What factors determine the intervals of tabulation of a function?


2. What is the name of the procedure to determine a value of a tabulated
function at an intermediate point?
3. What may be the cause of irregularity in the highest order differences
in a difference table?

EXERCISES

1. Construct a table of differences for the function f (x) = x 3 for x = 0 (1) 6.


2. Construct a table of differences for each of the following polynomial func-
tions.
(a) 2x − 1 for x = 0 (1) 3.
(b) 3x 2 + 2x − 4 for x = 0 (1) 4.
(c) 2x 3 + 5x − 3 for x = 0 (1) 5.
Study your resulting tables carefully; note what happens in the final few
columns of each table. Suggest a general result for polynomials of degree n
and compare your answer with the theorem on page 88.
3. Construct a difference table for the function f (x) = e x (given to seven
significant digits) for x = 0.1 (0.05) 0.5:
x f (x) x f (x) x f (x)
0.10 1.105171 0.25 1.284025 0.40 1.491825
0.15 1.161834 0.30 1.349859 0.45 1.568312
0.20 1.221403 0.35 1.419068 0.50 1.648721
STEP 19

FINITE DIFFERENCES 2
Forward, backward, and central difference notations

There are several different notations for the single set of finite differences described
in the previous Step. Here we shall consider only the forward, backward, and
central differences. We introduce each of these three notations in terms of the
so-called shift operator, which we define first.

1 The shift operator E


Let { f 0 , f 1 , . . . , f n−1 , f n } denote a set of values of the function f defined by
f j ≡ f (x j ), where x j = x0 + j h, j = 0, 1, 2, . . . , n. The shift operator E is
defined by
E f j ≡ f j+1
Consequently,
E 2 f j = E(E f j ) = E f j+1 = f j+2
and so on; that is,
E k f j = f j+k
where k is any positive integer. Moreover, let us extend this last formula to
negative integers, and indeed to all real values of j and k, so that for example,

E −1 f j = f j−1

and
1
E 2 f j = f j+ 1 = f x j + 12 h = f x0 + j + 12 h
  
2

2 The forward difference operator ∆


If we define the forward difference operator 1 by

1≡ E −1

it follows that

1 f j = (E − 1) f j = E f j − f j = f j+1 − f j

which is the first-order forward difference at x j . Similarly,

12 f j = 1(1 f j ) = 1 f j+1 − 1 f j = f j+2 − 2 f j+1 + f j


84 FINITE DIFFERENCES 2

is the second-order forward difference at x j , and so on. The forward difference of


order k is

1k f j = 1k−1 (1 f j ) = 1k−1 ( f j+1 − f j ) = 1k−1 f j+1 − 1k−1 f j

where k is any integer.

3 The backward difference operator ∇


If we define the backward difference operator ∇ by

∇ ≡ 1 − E −1

it follows that

∇ f j = (1 − E −1 ) f j = f j − E −1 f j = f j − f j−1

which is the first-order backward difference at x j . Similarly,

∇ 2 f j = ∇(∇ f j ) = ∇ f j − ∇ f j−1 = f j − 2 f j−1 + f j−2

is the second-order backward difference at x j , and so on. The backward difference


of order k is

∇ k f j = ∇ k−1 (∇ f j ) = ∇ k−1 ( f j − f j−1 ) = ∇ k−1 f j − ∇ k−1 f j−1

where k is any integer. Note that ∇ f j = 1 f j−1 , and ∇ k f j = 1k f j−k

4 The central difference operator δ


If we define the central difference operator δ by
1 1
δ ≡ E 2 − E− 2

it follows that
1 1 1 1
δ f j = (E 2 − E − 2 ) f j = E 2 f j − E − 2 f j = f j+ 1 − f j− 1
2 2

which is the first-order central difference at x j . Similarly,

δ 2 f j = δ(δ f j ) = δ( f j+ 1 − f j− 1 ) = f j+1 − 2 f j + f j−1


2 2

is the second-order central difference at x j , and so on. The central difference of


order k is

δ k f j = δ k−1 (δ f j ) = δ k−1 ( f j+ 1 − f j− 1 ) = δ k−1 f j+ 1 − δ k−1 f j− 1


2 2 2 2

where k is any integer. Note that δ f j+ 1 = 1 f j = ∇ f j+1


2
FORWARD, BACKWARD, AND CENTRAL DIFFERENCE NOTATIONS 85

5 Difference display
The role of the forward, central, and backward differences is displayed by the
difference table:
Differences
x f (x) 1st. 2nd. 3rd. 4th.
x0 f0 ..............
..............
..............
1f 0
..............
..............
..............
..............
x1 f1 1 f 2
0
..............
..............
..............
..............
1f 1 1 f 3
0
..............
..............
...................
... .......
x2 f2 12 f1 14 f 0
1 f2 13 f1
x3 f3 12 f 2
1 f3
x4 f4
.. ..
. .
x j−2 f j−2
δ f j− 3
2
x j−1 f j−1 δ 2 f j−1
δ f j− 1 δ 3 f j− 1
2 2
xj fj δ2 f j δ4 f j
.......................................................................................................................................................................................................
δ f j+ 1 δ f j+ 1
3
2 2
x j+1 f j+1 δ 2 f j+1
δ f j+ 3
2
x j+2 f j+2
.. ..
. .
xn−4 f n−4
∇ f n−3
xn−3 f n−3 ∇ 2 f n−2
∇ f n−2 ∇ 3 f n−1
xn−2 f n−2 ∇2 f n−1 ∇ 4 fn
..........
................
∇ f n−1 ∇3 fn
..........................
.............
..............
xn−1 f n−1 ∇ f 2 .............
n ................................
...........
∇f ..............
n .................................
..................
xn fn .....
86 FINITE DIFFERENCES 2

Although forward, central, and backward differences represent precisely the same
set of numbers:
(a) forward differences are especially useful near the start of a table, since they
involve tabulated function values below x j ;
(b) central differences are especially useful away from the ends of the table,
where there are available tabulated function values above and below x j ;
(c) backward differences are especially useful near the end of a table, since they
involve tabulated function values above x j .

Checkpoint

1. What is the definition of the shift operator?


2. How are the forward, backward, and central difference operators
defined?
3. When are the respective forward, backward, and central difference
notations likely to be used?

EXERCISES

1. Draw up a table of differences for the polynomial

f (x) = 3x 3 − 2x 2 + x + 5

for x = 0 (1) 4. Use the table to obtain the values of


(a) 1 f 1 , 12 f 1 , 13 f 1 , 13 f 0 , 12 f 2 .

(b) ∇ f1, ∇ f2, ∇ 2 f2, ∇ 2 f3, ∇ 3 f4.

(c) δ f 1 , δ2 f1, δ3 f 3 , δ3 f 5 , δ2 f2.


2 2 2

2. For the difference table on page 81 of f (x) = e x for x = 0.1 (0.05) 0.5 to
six significant digits, determine the following (taking x0 = 0.1):
(a) 1 f 2 , 12 f 2 , 13 f 2 , 14 f 2 .

(b) ∇ f6, ∇ 2 f6, ∇ 3 f6, ∇ 4 f6.

(c) δ2 f4, δ4 f4.

(d) 12 f 1 , δ 2 f 2 , ∇ 2 f 3 .

(e) 13 f 3 , ∇ 3 f 6 , δ 3 f 9 .
2
FORWARD, BACKWARD, AND CENTRAL DIFFERENCE NOTATIONS 87

3. Prove the following:


(a) E x j = x j+1 .

(b) 13 f j = f j+3 − 3 f j+2 + 3 f j+1 − f j .

(c) ∇ 3 f j = f j − 3 f j−1 + 3 f j−2 − f j−3 .

(d) δ 3 f j = f j+ 3 − 3 f j+ 1 + 3 f j− 1 − f j− 3 .
2 2 2 2
STEP 20

FINITE DIFFERENCES 3
Polynomials

Since polynomial approximations are used in many areas of Numerical Analysis,


it is important to investigate the effects of differencing polynomials.

1 Finite differences of a polynomial


Consider the finite differences of an n-th degree polynomial

f (x) = an x n + an−1 x n−1 + · · · + a1 x + a0

tabulated for equidistant points at tabular interval h.


Theorem: The n-th difference of a polynomial of degree n is a constant propor-
tional to h n , and higher order differences are zero.
Proof: For any positive integer k, the binomial expansion
k
X k!
(x j + h)k = x k−i h i
i=0
i!(k − i)! j

yields
(x j + h)k − x jk = kx jk−1 h + polynomial of degree (k − 2)
Omitting the subscript on x j , we then have

1 f (x) = f (x + h) − f (x)
= an [(x + h)n − x n ] + an−1 [(x + h)n−1 − x n−1 ]
+ · · · + a1 [(x + h) − h]
= an nx n−1 h + polynomial of degree (n − 2)

12 f (x) = an nh[(x + h)n−1 − x n−1 ] + · · ·


= an n(n − 1)x n−2 h 2 + polynomial of degree (n − 3)
.. ..
. .
1 f (x) = an n!h = constant.
n n

1n+1 f (x) = 0

In passing, the student may recall that in differential calculus the increment
1 f (x) = f (x + h) − f (x) is related to the derivative of f (x) at the point x.
POLYNOMIALS 89

2 Example
For f (x) = x 3 for x = 5.0 (0.1) 5.5 we obtain the following difference table.

x f (x) = x 3 1 12 13 14
5.0 125.000
7651
5.1 132.651 306
7957 6
5.2 140.608 312 0
8269 6
5.3 148.877 318 0
8587 6
5.4 157.464 324
8911
5.5 166.375

In this case n = 3, an = 1, h = 0.1, whence 13 f (x) = 1 × 3! × (0.1)3 = 0.006.


Note that round-off error noise may occur: for example, consider the tabulation
of f (x) = x 3 for 5.0 (0.1) 5.5 rounded to two decimal places.

x f (x) = x 3 1 12 13 14
5.0 125.00
765
5.1 132.65 31
796 0
5.2 140.61 31 0
827 0
5.3 148.88 31 3
858 3
5.4 157.46 34
892
5.5 166.38

3 Approximation of a function by a polynomial


Whenever the higher differences of a table become small (allowing for round-off
noise), the function represented may be well approximated by a polynomial. For
example, reconsider the difference table of f (x) = e x for x = 0.1 (0.05) 0.5 to
six significant digits.
90 FINITE DIFFERENCES 3

x f (x) = e x 1 12 13 14
0.10 1.10517
5666
0.15 1.16183 291
5957 15
0.20 1.22140 306 −1
6263 14
0.25 1.28403 320 4
6583 18
0.30 1.34986 338 −2
6921 16
0.35 1.41907 354 4
7275 20
0.40 1.49182 374 −2
7649 18
0.45 1.56831 392
8041
0.50 1.64872
Since the estimate for round-off error at 13 is ±3 (see page 82), we say that
third differences are constant within round-off error, and deduce that a cubic
approximation is appropriate for e x over the range 0.1 < x < 0.5 at interval
0.05. In this fashion, differences can be used to decide what (if any) degree of
approximating polynomial is appropriate.
An example in which polynomial approximation is inappropriate is when
f (x) = 10x for x = 0 (1) 4, thus:
x f (x) 1 12 13 14
0 1
9
1 10 81
90 729
2 100 810 6561
900 7290
3 1000 8100
9000
4 10000
Although f (x) = 10x is ‘smooth’, the large tabular interval (h = 1) produces
large higher order finite differences. It should also be understood that there exist
functions that cannot usefully be tabulated at all, at least in some neighbourhood;
for example, f (x) = sin(1/x) near the origin x = 0. Nevertheless, these are
fairly exceptional cases.
POLYNOMIALS 91

Finally, we remark that the approximation of a function by a polynomial is


fundamental to the widespread use of finite difference methods.

Checkpoint

1. What may be said about the higher order (exact) differences of a


polynomial?
2. What is the effect of round-off error on the higher order differences
of a polynomial?
3. When may a function be approximated by a polynomial?

EXERCISES

1. Construct a difference table for the polynomial f (x) = x 4 for x = 0 (0.1) 1


when
(a) the values of f are exact;
(b) the values of f have been rounded to 3 decimal places.
Compare the fourth difference round-off errors with the estimate ±6.
2. Find the degree of the polynomial which fits the data in the following table.
x f (x) x f (x)
0 3 3 24
1 2 4 59
2 7 6 118
STEP 21

INTERPOLATION 1
Linear and quadratic interpolation

Interpolation is ‘the art of reading between the lines in a table’ and may be regarded
as a special case of the general process of curve fitting (see Steps 26 and 28). More
precisely, interpolation is the process whereby untabulated values of a function
tabulated only at certain values are estimated, on the assumption that the function
behaves sufficiently smoothly between tabular points for it to be approximated by
a polynomial of fairly low degree.
Interpolation is not as important in Numerical Analysis as it was, now that
computers (and calculators with built-in functions) are available, and function
values may often be obtained readily by an algorithm (probably from a standard
subroutine). However,
(a) interpolation is still important for functions that are available only in tabular
form (perhaps from the results of an experiment); and
(b) interpolation serves to introduce the wider application of finite differences.
In Step 20, we observed that when the differences of order k are constant
(within round-off fluctuation), the tabulated function may be approximated by a
polynomial of degree k. Linear and quadratic interpolation correspond to the
cases k = 1 and k = 2, respectively.

1 Linear interpolation
When a tabulated function varies so slowly that first differences are approximately
constant, it may be approximated closely by a straight line between adjacent
tabular points. This is the basic idea of linear interpolation. In Figure 10, the two
function points (x j , f j ) and (x j+1 , f j+1 ) are connected by a straight line. Any x
between x j and x j+1 may be defined by a value θ such that

x − x j = θ(x j+1 − x j ) ≡ θ h, 0<θ <1

Provided f (x) is only slowly varying in the interval, a value of the function at
x is approximately given by the ordinate to the straight line at x. Elementary
geometrical considerations yield
x − xj f (x) − f j
θ= ≈
x j+1 − x j f j+1 − f j
so that

f (x) ≈ f j + θ( f j+1 − f j ) = f j + θ 1 f j = f j + θ∇ f j+1 = f j + θ δ f j+ 1


2
LINEAR AND QUADRATIC INTERPOLATION 93

.............
...............................................
.....................................................................................
.................... .... ........................................ ..........................................
| .
.
.
....... ......
...................... ..................... .....
................... .....................
................. ........................... ...
............................. ......................................... .
.
.
...
...
................. ............................ .
.
. ...
................ ............................ .
.
. ...
............... ............................. .
.
. ...
.
...
..
..
..
..
........................... .
.
. ...
... ..........
..
..
...
...
....... .
. f
. .
............................ ..... .
.
. ... j+1
....
. .
. .
.
. ...
........ . . ...
..... ..... ...
... ... ...
... ... ...
... ... ...
... ..
... ...
...
... j x−x  ...
...
... j f +x −x j+1 f j − f ...
...
...
...
j+1 j ...
f
... j
...
...
...
...
...
... .... ....
... ... ...
... ..
. ....
... .
. ...
... ... ...
... ..
. ...
... ... ...
... .... ...
... ... ...
... .... ...
..
... ... ..
..... ..... ......
.. .. ..
xj x x j+1
................................................................................
θh ................................................................................

..........................................................................................................................................
h ........................................................................................................................................
FIGURE 10. Linear interpolation.

In analytical terms, we have approximated f (x) by


x − xj
P1 (x) = f j + ( f j+1 − f j )
x j+1 − x j
a linear function of x which satisfies

P1 (x j ) = f j = f (x j ), P1 (x j+1 ) = f j+1 = f (x j+1 )

As an example, consider the following difference table from a 4D table of e−x .


x f (x) = e−x 1 12
0.90 0.4066
−41
0.91 0.4025 1
−40
0.92 0.3985 1
−39
0.93 0.3946 −1
−40
0.94 0.3906 1
−39
0.95 0.3867 1
−38
0.96 0.3829
94 INTERPOLATION 1

The first differences are almost constant locally, so that the table is suitable for
linear interpolation. For example,
f (0.934) ≈ 0.3946 + 10 (−0.0040)
4
= 0.3930

2 Quadratic interpolation
As previously indicated, linear interpolation is appropriate only for slowly varying
functions. The next simplest process is quadratic interpolation, based on an
approximating polynomial of degree two; one might expect that this approximation
would give better accuracy for functions with larger variation.
Given three adjacent points x j , x j+1 = x j + h, and x j+2 = x j + 2h, suppose
that f (x) is approximated by
P2 (x) = a + b(x − x j ) + c(x − x j )(x − x j+1 )
where a, b, and c are chosen so that
P2 (x j+k ) = f (x j+k ) = f j+k , k = 0, 1, 2
Thus
P2 (x j ) = a = f j
P2 (x j+1 ) = a + bh = f j+1
P2 (x j+2 ) = a + 2bh + 2ch 2 = f j+2
whence
a = fj
b = ( f j+1 − a)/ h = ( f j+1 − f j )/ h = 1 f j / h
c = ( f j+2 − 2bh − a)/(2h 2 ) = ( f j+2 − 2 f j+1 + f j )/(2h 2 )
= 12 f j /(2h 2 )
Setting x = x j + θ h, we obtain the quadratic interpolation formula
f (x j + θ h) ≈ f j + θ 1 f j + 12 θ(θ − 1)12 f j
We note immediately that the quadratic interpolation formula introduces a second
term (involving 12 f j ) not included in the linear interpolation formula.
As an example, we determine the second-order correction to the value of
f (0.934) obtained above using linear interpolation. The extra term is
1 4 −6
 0.0024
2 × 10 × 10 × 0.0001 = − 200

so that the quadratic interpolation formula gives


0.0024
f (0.934) ≈ 0.3930 − = 0.3930
200
(In this case, the extra term − 0.0024
200 is negligible.)
LINEAR AND QUADRATIC INTERPOLATION 95

Checkpoint

1. What is the process of obtaining an untabulated value of a function


called?
2. When is linear interpolation adequate?
3. When is quadratic interpolation needed and adequate?

EXERCISES

1. Obtain an estimate of sin(0.55) by constructing the linear interpolation poly-


nomial to f (x) = sin x over the interval [0.5, 0.6] using the data
x 0.5 0.6
f (x) 0.47943 0.56464
Compare your estimate with the value of sin(0.55) given by your calculator.
2. Entries in a table of cos x are:
00 100 200 300 400 500
80◦ 0.1736 0.1708 0.1679 0.1650 0.1622 0.1593
Obtain an estimate of cos(80◦ 350 ) by using
(a) linear interpolation, and
(b) quadratic interpolation.
3. Entries in a table of tan x are:
00 100 200 300 400 500
80◦ 5.671 5.769 5.871 5.976 6.084 6.197
Determine whether it is appropriate to use either linear or quadratic interpol-
ation. If so, obtain an estimate of tan(80◦ 350 ).
STEP 22

INTERPOLATION 2
Newton interpolation formulae

The linear and quadratic interpolation formulae are based on first and second de-
gree polynomial approximation. Newton derived general forward and backward
difference interpolation formulae, corresponding to approximation by a poly-
nomial of degree n, for tables of constant interval h. (For tables that do not
have constant interval, we can use an interpolation procedure involving divided
differences – see Step 24.)
1 Newton’s forward difference formula
Consider the points x j , x j + h, x j + 2h, . . ., and recall that

E f j = f j+1 = f (x j + h), E θ f j = f j+θ = f (x j + θ h)

where θ is any real number. Formally, one has (since 1 = E − 1)

f (x j + θ h) = E θ f j
= (1 + 1)θ f j
θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
 
= 1 + θ1 + 1 + 1 + · · · fj
2! 3!

which is Newton’s forward difference formula. The linear and quadratic (for-
ward) interpolation formulae correspond to truncation at first and second order,
respectively. If we truncate at n-th order, we obtain

θ(θ − 1) 2 θ(θ − 1) · · · (θ − n + 1) n
 
f (x j + θ h) ≈ 1 + θ1 + 1 + ··· + 1 fj
2! n!

which is an approximation based on the values f j , f j+1 , . . . , f j+n . It will be exact


if (within round-off error)

1n+k f j = 0, k = 1, 2, . . .

which is the case if f is a polynomial of degree n.

2 Newton’s backward difference formula


Formally, one has (since ∇ = 1 − E −1 )
NEWTON INTERPOLATION FORMULAE 97

f (x j + θ h) = E θ f j
= (1 − ∇)−θ f j
θ (θ + 1) 2 θ (θ + 1)(θ + 2) 3
 
= 1 + θ∇ + ∇ + ∇ + · · · fj
2! 3!

which is Newton’s backward difference formula. The linear and quadratic (back-
ward) interpolation formulae correspond to truncation at first and second order,
respectively. The approximation based on the values f j−n , f j−n+1 , . . . , f j−1 , f j
is

θ(θ + 1) 2 θ (θ + 1) · · · (θ + n − 1) n
 
f (x j + θ h) ≈ 1 + θ∇ + ∇ + ··· + ∇ fj
2! n!

3 Use of Newton’s interpolation formulae


The Newton forward and backward difference formulae are well suited for use at
the beginning and end of a difference table, respectively. (Other formulae that
make use of central differences may be more convenient elsewhere.)
As an example, consider the following difference table of f (x) = sin x for
x = 0◦ (10◦ ) 50◦ .

x◦ f (x) = sin x 1 12 13 14 15
0 0
1736
10 0.1736 −52
1684 −52
20 0.3420 −104 4
1580 −48 0
30 0.5000 −152 4
1428 −44
40 0.6428 −196
1232
50 0.7660

Since constant differences occur at fourth order, we conclude that a quartic ap-
proximation is appropriate. (Third-order differences are not quite constant within
expected round-off, and we anticipate that a cubic approximation is not quite good
enough.) To determine sin 5◦ from the table, we use Newton’s forward difference
98 INTERPOLATION 2

formula (to fourth order); thus taking x j = 0, we have θ = 5 10


−0 = 1
2 (h = 10),
and

sin 5◦ ≈ sin 0◦ + 12 (0.1736) + 12 12 − 12 (−0.0052) + 11


− 12 − 32 (−0.0052)
  
62
 3  5
1 1
+ 24 2 −2 −2
1
− 2 (0.0004)

= 0 + 0.0868 + 0.0006(5) − 0.0003(3) − 0.0000(2)


= 0.0871 (compare with the actual value 0.0872 to 4D)

Note that we have kept a guard digit (in parentheses) to minimize accumulated
round-off error.
To determine sin 45◦ from the table, we use Newton’s backward difference
formula (to fourth order); thus taking x j = 40, we have θ = 45 10
− 40 = 1 , and
2

sin 45◦ ≈ sin 40◦ + 12 (0.1428) + 2 2 2 (−0.0152)


113

6 2 2 2 (−0.0048) + 24 2 2 2 2 (0.0004)
1135 1 1357
+
= 0.6428 + 0.0714 − 0.0057 − 0.0015 + 0.0001(1)
= 0.7071 (compare with the actual value 0.7071 to 4D)

4 Uniqueness of the interpolating polynomial


Given a set of values f (x0 ), f (x1 ), . . . , f (xn ) with x j = x0 + j h, we have two
interpolation formulae of order n available:

f (x) ≈ Pn (x)
θ(θ − 1) 2 θ (θ − 1) · · · (θ − n + 1) n
 
= 1 + θ1 + 1 + ··· + 1 f0
2! n!

and

f (x) ≈ Q n (x)
φ(φ + 1) 2 φ(φ + 1) · · · (φ + n − 1) n
 
= 1 + φ∇ + ∇ + ··· + ∇ fn
2! n!

where θ = (x − x0 )/ h and φ = (x − xn )/ h
Clearly Pn and Q n are both polynomials of degree n. It can be verified (see
Exercise 2 at the end of this Step) that Pn (x j ) = Q n (x j ) = f (x j ) for j =
0, 1, 2, . . . , n, which implies that Pn − Q n is a polynomial of degree n which
vanishes at (n + 1) points. This in turn implies that Pn − Q n ≡ 0, or Pn ≡ Q n . In
fact a polynomial of degree n through any given (n+1) (distinct but not necessarily
equidistant) points is unique, and is called the interpolating polynomial.
NEWTON INTERPOLATION FORMULAE 99

5 Analogy with Taylor series


If we define for integer k
dk f

k
D fj ≡
d x k x=x j
the Taylor series (see page 18) about x j becomes

(x − x j )2 2
f (x) = f j + (x − x j )D f j + D fj + · · ·
2!

Setting x = x j + θ h, we have formally

θ 2h2 2
f (x j + θ h) = f j + θ h D f j +D fj + · · ·
2!
θ 2h2 2
 
= 1 + θhD + D + · · · fj
2!
= eθ h D f j

Comparison with the Newton interpolation form

f (x j + θ h) = E θ f j

shows that the operator eh D (on functions of a continuous variable) is analogous


to the operator E (on functions of a discrete variable).

Checkpoint

1. What is the relationship between the forward and backward linear


and quadratic interpolation formulae (for a table of constant interval
h) and Newton’s interpolation formulae?
2. When is the Newton forward difference formula convenient to use?
3. When is the Newton backward difference formula convenient to
use?

EXERCISES

1. From a difference table of f (x) = e x (using five decimal places) for x =


0.10 (0.05) 0.40, estimate
(a) e0.14 using Newton’s forward difference formula; and
(b) e0.315 using Newton’s backward difference formula.
100 INTERPOLATION 2

2. Show that for j = 0, 1, 2, . . ., we have


j
X j!
f j = f (x0 + j h) = 1m f (x0 )
m=0
m!( j − m)!

j ( j − 1) 2
 
= 1 + j1 + 1 + · · · + 1 j f (x0 )
2
3. Derive the equation of the interpolating polynomial for the following data.
x f (x) x f (x)
0 3 3 24
1 2 4 59
2 7 5 118
STEP 23

INTERPOLATION 3
Lagrange interpolation formula

The linear and quadratic interpolation formulae of Step 21 correspond to first and
second degree polynomial approximation, respectively. In Step 22, we discussed
the Newton forward and backward interpolation formulae and noted that higher
order interpolation corresponds to higher degree polynomial approximation. In
this Step we consider an interpolation formula attributed to Lagrange, which does
not require function values at equal intervals of the argument. The Lagrange
interpolation formula has the disadvantage that the degree of the approximating
polynomial must be chosen at the outset, and in the next Step we shall discuss
another approach. Thus the Lagrange formula is mainly of theoretical interest
for us here, but in passing we mention that there are some important applications
beyond the scope of this book – for example, the construction of basis functions
to solve differential equations using a spectral (‘discrete ordinate’) method.

1 Procedure
Suppose that the function f is tabulated at (n + 1) (not necessarily equidistant)
points {x0 , x1 , . . . , xn } and is to be approximated by a polynomial

Pn (x) = an x n + an−1 x n−1 + · · · + a1 x + a0

of degree at most n, such that

f j = f (x j ) = Pn (x j ) for j = 0, 1, 2, . . . , n

Now, for k = 0, 1, 2, . . . , n,
(x − x0 )(x − x1 ) · · · (x − xk−1 )(x − xk+1 ) · · · (x − xn )
L k (x) =
(xk − x0 )(xk − x1 ) · · · (xk − xk−1 )(xk − xk+1 ) · · · (xk − xn )
is a polynomial of degree n which satisfies

L k (x j ) = 0, j 6= k, j = 0, 1, 2, . . . , n, and L k (xk ) = 1

Hence
n
X
Pn (x) = L k (x) f k
k=0
is a polynomial of degree (at most) n such that

Pn (x j ) = f j , j = 0, 1, 2, . . . , n
102 INTERPOLATION 3

that is, it is the (unique) interpolating polynomial. Note that for x = x j all terms
in the sum vanish except the j-th, which is f j ; L k (x) is called the k-th Lagrange
interpolation coefficient, and the identity
n
X
L k (x) = 1
k=0

(established by setting f ≡ 1) may be used as a check. Note also that with n = 1


we recover the linear interpolation formula
(x − x1 ) (x − x0 ) (x − x0 )
P1 (x) = f0 + f1 = f0 + ( f1 − f0)
(x0 − x1 ) (x1 − x0 ) (x1 − x0 )
of Step 21.

2 Example
We use the Lagrange interpolation formula to find the interpolating polynomial
P3 through the points (0, 3), (1, 2), (2, 7), and (4, 59), and then approximate f (3)
by P3 (3).
The Lagrange coefficients are
(x − 1)(x − 2)(x − 4)
L 0 (x) = = − 18 (x 3 − 7x 2 + 14x − 8)
(0 − 1)(0 − 2)(0 − 4)
(x − 0)(x − 2)(x − 4)
L 1 (x) = = 13 (x 3 − 6x 2 + 8x)
(1 − 0)(1 − 2)(1 − 4)
(x − 0)(x − 1)(x − 4)
L 2 (x) = = − 14 (x 3 − 5x 2 + 4x)
(2 − 0)(2 − 1)(2 − 4)
(x − 0)(x − 1)(x − 2)
L 3 (x) = = 24 (x
1 3
− 3x 2 + 2x)
(4 − 0)(4 − 1)(4 − 2)
(The student can verify that L 0 (x) + L 1 (x) + L 2 (x) + L 3 (x) = 1.) Hence, the
required polynomial is

P3 (x) = − 83 (x 3 − 7x 2 + 14x − 8) + 23 (x 3 − 6x 2 + 8x)


− 74 (x 3 − 5x 2 + 4x) + 24 (x
59 3
− 3x 2 + 2x)

24 (−9x
3
= 1
+ 63x 2 − 126x + 72 + 16x 3 − 96x 2 + 128x
−42x 3 + 210x 2 − 168x + 59x 3 − 177x 2 + 118x)

24 (24x
3
= 1
+ 0x 2 − 48x + 72)
= x 3 − 2x + 3

Consequently, f (3) ≈ P3 (3) = 27 − 6 + 3 = 24. However, note that if the


explicit form of the interpolating polynomial was not required, one would proceed
LAGRANGE INTERPOLATION FORMULA 103

to evaluate P3 (x) for some x directly from the factored forms of L k (x). Thus, to
evaluate P3 (3), one has
(3 − 1)(3 − 2)(3 − 4) 1
L 0 (3) = = , etc.
(0 − 1)(0 − 2)(0 − 4) 4

3 Notes of caution
In the case of the Newton interpolation formulae considered in the previous Step,
or the formulae to be discussed in the next Step, the degree of the required
approximating polynomial may be determined merely by computing terms until
they no longer appear significant. In the Lagrange procedure, the polynomial
degree must be chosen at the outset. Also, note that
(a) a change of degree involves a completely new computation of all terms; and
(b) for a polynomial of high degree the process involves a large number of
multiplications and therefore may be quite slow.
Lagrange interpolation should be used with considerable
√ caution. For example,
suppose we use Lagrange interpolation to estimate

3
20 from the points (0, 0),
(1, 1), (8, 2), (27, 3), and (64, 4) on f (x) = 3 x. We have
x(x − 8)(x − 27)(x − 64) x(x − 1)(x − 27)(x − 64)
f (x) ≈ ×1+ ×2
1(−7)(−26)(−63) 8(7)(−19)(−56)
x(x − 1)(x − 8)(x − 64) x(x − 1)(x − 8)(x − 27)
+ ×3+ ×4
27(26)(19)(−37) 64(63)(56)(37)
so that f (20) ≈ −1.3139, which is not very close to the correct value 2.7144! A
better result (2.6316) can be obtained by linear interpolation between (8, 2) and
(27, 3). The problem
√ is that the Lagrange method gives no indication as to how
well f (x) = 3 x is represented by a quartic. In practice, therefore, Lagrange
interpolation is used only rarely.

Checkpoint

1. When is the Lagrange interpolation formula used in practical com-


putation?
2. What distinguishes the Lagrange formula from many other interpol-
ation formulae?
3. Why should the Lagrange formula be used in practice only with
caution?

EXERCISE

Given that f (−2) = 46, f (−1) = 4, f (1) = 4, f (3) = 156, and f (4) =
484, use the Lagrange interpolation formula to estimate f (0).
STEP 24

INTERPOLATION 4*
Divided differences*

We noted that the Lagrange interpolation formula is mainly of theoretical interest,


for at best it involves very considerable computation in practice, and it can be
quite dangerous to use. It is much more efficient to use divided differences to
interpolate a tabulated function (especially if the arguments are unequally spaced),
and at the same time it is relatively safe since the necessary degree of interpolating
polynomial can be decided. An allied procedure due to Aitken is also commonly
adopted in practice.

1 Divided differences
Again, suppose the function f is tabulated at the (not necessarily equidistant)
points {x0 , x1 , . . . , xn }. We define the divided differences between points thus:
first divided difference (say, between x0 and x1 ) by
f (x1 ) − f (x0 ) f1 − f0
f (x0 , x1 ) = = = f (x1 , x0 )
x1 − x0 x1 − x0
second divided difference (say, between x0 , x1 , and x2 ) by
f (x1 , x2 ) − f (x0 , x1 )
f (x0 , x1 , x2 ) =
x2 − x0
and so on to the n-th divided difference (between x0 , x1 , . . . xn )
f (x1 , x2 , . . . , xn ) − f (x0 , x1 , . . . , xn−1 )
f (x0 , x1 , . . . , xn ) =
xn − x0
As an example, we construct a divided difference table from the following data:
x 0 1 3 6 10
f (x) 1 −6 4 169 921
The divided difference table is as follows:
x f (x)
0 1
−7
1 −6 4
5 1
3 4 10 0
55 1
6 169 19
188
10 921
DIVIDED DIFFERENCES* 105

It is notable that the third divided differences are constant. In Section 3 we


shall interpolate from the table by using Newton’s divided difference formula, and
determine the corresponding interpolating cubic.

2 Newton’s divided difference formula


From the definitions of divided differences, we have

f (x) = f (x0 ) + (x − x0 ) f (x, x0 )


f (x, x0 ) = f (x0 , x1 ) + (x − x1 ) f (x, x0 , x1 )
f (x, x0 , x1 ) = f (x0 , x1 , x2 ) + (x − x2 ) f (x, x0 , x1 , x2 )
.. .. ..
. . .
f (x, x0 , . . . , xn−1 ) = f (x0 , x1 , . . . , xn ) + (x − xn ) f (x, x0 , . . . , xn )

Multiplying the second equation by (x − x0 ), the third by (x − x0 )(x − x1 ), etc.,


and adding yields the Newton divided difference formula† given by
f (x) = f (x0 ) + (x − x0 ) f (x0 , x1 ) + (x − x0 )(x − x1 ) f (x0 , x1 , x2 )
+ · · · + (x − x0 ) · · · (x − xn−1 ) f (x0 , x1 , . . . , xn ) + R
= Pn (x) + R
where R = (x − x0 )(x − x1 ) · · · (x − xn ) f (x, x0 , x1 , . . . , xn )
Note that the remainder term R is zero at x0 , x1 , . . . , xn , and we may infer that
the other terms on the right-hand side constitute the interpolating polynomial or,
equivalently, the Lagrange polynomial. If the degree of interpolating polynomial
necessary is not known in advance, it is customary to order the points x0 , x1 , . . . , xn
according to increasing distance from x and add terms until R is small enough.

3 Example
From the tabulated function in Section 1 of this Step, we estimate f (2) by Newton’s
divided difference formula and find the corresponding interpolating polynomial.
The same is done for f (4).
Since the third divided difference is constant, we can fit a cubic through the five
points. By Newton’s divided difference formula, using x0 = 0, x1 = 1, x2 = 3,
x3 = 6, the interpolation cubic is

P3 (x) = f (0) + x f (0, 1) + x(x − 1) f (0, 1, 3) + x(x − 1)(x − 3) f (0, 1, 3, 6)


= 1 − 7x + 4x(x − 1) + x(x − 1)(x − 3)

so that
f (2) ≈ P3 (2) = 1 − 14 + 8 − 2 = −7
† Thisformula is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 169.
106 INTERPOLATION 4*

The interpolating polynomial is obviously

1 − 7x + 4x 2 − 4x + x 3 − 4x 2 + 3x = x 3 − 8x + 1

To estimate f (4), let us identify x0 = 1, x1 = 3, x2 = 6, x3 = 10, so that

P3 (x) = −6 + 5(x − 1) + 10(x − 1)(x − 3) + (x − 1)(x − 3)(x − 6)

and
f (4) ≈ P3 (4) = −6 + 15 + 30 − 6 = 33
As expected, the interpolating polynomial is the same cubic – namely, x 3 −8x +1.

4 Error in interpolating polynomial


In Section 2 we saw that the error in the interpolating polynomial of degree n was
given by

R = (x − x0 )(x − x1 ) · · · (x − xn ) f (x, x0 , x1 , . . . , xn )

As it stands, this expression is not very useful because of the unknown quantity
f (x, x0 , x1 , . . . , xn ). However, it may be shown (for example, see Conte and de
Boor (1980)) that if a = min(x0 , x1 , . . . , xn ), b = max(x0 , x1 , . . . , xn ), and f is
(n + 1)-times differentiable on (a, b), then there exists a ξ ∈ (a, b) such that

f (n+1) (ξ )
f (x, x0 , x1 , . . . , xn ) =
(n + 1)!

from which it follows that

f (n+1) (ξ ) Y
n
f (x) − Pn (x) = (x − x j )
(n + 1)! j=0

This formula may be useful when we know the function giving the data and we
wish to find lower and upper bounds on the error.
As an example, to 6D we have sin 0 = 0, sin(0.2) = 0.198669, and sin(0.4) =
0.389418 (where the arguments to the sine function are in radians). Then we can
form the following divided difference table.

x sin x
0 0
993345
0.2 0.198669 −99000
953745
0.4 0.389418
DIVIDED DIFFERENCES* 107

Thus the quadratic approximation to sin(0.1) is

0 + 0.993345 × (0.1 − 0) − 0.099000 × (0.1 − 0) × (0.1 − 0.2) = 0.100325

Since n = 2, the magnitude of the error in the approximation is given by

f (ξ ) f (ξ )
000 000
3! (0.1 − 0)(0.1 − 0.2)(0.1 − 0.4) = 2000

where ξ lies between


000 0 and 0.4. For f (x) = sin x, we have f (x) = − cos x so
000

that cos(0.4) ≤ f (ξ ) ≤ cos 0. It then follows that


cos(0.4) cos 0
0.000461 = ≤ |sin(0.1) − 0.100325| ≤ = 0.000500
2000 2000
The actual error has magnitude 0.000492 which is within these bounds.

5 Aitken’s method
In practice, a procedure due to Aitken is often adopted, in which successively better
interpolation polynomials (corresponding to successively higher order truncation
of Newton’s divided difference formula) are determined systematically. Thus, one
has
f1 − f0
f (x) ≈ f 0 + (x − x0 )
x1 − x0
f 0 (x1 − x) − f 1 (x0 − x)
= ≡ I0,1 (x)
x1 − x0

and obviously
f 0 = I0,1 (x0 ), f 1 = I0,1 (x1 )
Next, since f (x0 , x1 , x2 ) = f (x1 , x0 , x2 ) = ( f (x0 , x2 ) − f (x1 , x0 ))/(x2 − x1 ),
one has
f (x0 , x2 ) − f (x1 , x0 )
f (x) ≈ f 0 + (x − x0 ) f (x0 , x1 ) + (x − x0 )(x − x1 )
x2 − x1
I0,1 (x)(x2 − x) − I0,2 (x)(x1 − x)
= ≡ I0,1,2 (x)
x2 − x1
noting that
I0,2 (x) = f 0 + (x − x0 ) f (x0 , x2 )
and so on. In passing, one may note that

f 0 = I0,1,2 (x0 ), f 1 = I0,1,2 (x1 ), f 2 = I0,1,2 (x2 )


108 INTERPOLATION 4*

At first sight, the procedure may look complicated, but it is systematic, and
therefore computationally straightforward: it may be represented by the scheme

x0 f0 x0 − x
x1 f1 I0,1 (x) x1 − x
x2 f2 I0,2 (x) I0,1,2 (x) x2 − x
x3 f2 I0,3 (x) I0,1,3 (x) I0,1,2,3 (x) x3 − x
.. .. .. .. .. ..
. . . . . .
··· ··· ··· ··· ··· ···

One major advantage is that the accuracy may be gauged by comparing suc-
cessive steps. (This of course corresponds to gauging the appropriate truncation
of the Newton divided difference formula.) As in the case of Newton’s div-
ided difference formula, usually the points x0 , x1 , x2 , . . . are ordered such that
x0 − x, x1 − x, x2 − x, . . . form a sequence with increasing magnitude. Finally,
we remark that although the derivation of Aitken’s method emphasizes its rela-
tionship with the Newton formula, it is notable that Aitken’s method ultimately
does not involve divided differences at all!
As an example, we estimate f (2) by Aitken’s method from the tabulated func-
tion given in Section 1 of this Step.
We have x = 2, so that we choose x0 = 1, x1 = 3, x2 = 0, x3 = 6, and
x4 = 10: thus the scheme yields

k xk fk xk − x
0 1 −6 −1
1 3 4 −1 +1
2 0 1 −13 −5 −2
3 6 169 29 −11 −7 +4
4 10 921 97 −15 −7 −7 +8

The computation proceeds from the left, row by row, with an appropriately divided
‘cross multiplication’ of the respective entries with those in the (xk − x) column
on the right: thus,

(−6)(+1) − (+4)(−1)
I0,1 = = −1,
3−1
(−6)(−2) − (+1)(−1)
I0,2 = = −13,
0−1
(−1)(−2) − (−13)(+1)
I0,1,2 = = −5, etc.
0−3

The entry −7 (in the square) appears twice successively along the diagonal, so
one may conclude that f (2) ≈ −7.
DIVIDED DIFFERENCES* 109

Checkpoint

1. What major practical advantage has Newton’s divided difference


interpolation formula over Lagrange’s formula?
2. How are the tabular points usually ordered for interpolation by
Newton’s divided difference formula or Aitken’s method?
3. Are divided differences actually used in interpolation by Aitken’s
method?

EXERCISES

1. Use Newton’s divided difference formula to show that an interpolation


√ √ for
3
20 from the points (0, 0), (1, 1), (8, 2), (27, 3), (64, 4) on f (x) = 3 x is
quite invalid.
2. Given that e0 = 1, e0.5 = 1.64872, and e1 = 2.71828, use Newton’s divided
difference formula to estimate e0.25 . Find lower and upper bounds on the
magnitude of the error and verify that the actual magnitude is within the
calculated bounds.
3. Given that f (−2) = 46, f (−1) = 4, f (1) = 4, f (3) = 156, and f (4) =
484, estimate f (0) from
(a) Newton’s divided difference formula, and
(b) Aitken’s method.
Comment on the validity of this interpolation.
4. Given that f (0) = 2.3913, f (1) = 2.3919, f (3) = 2.3938, and f (4) =
2.3951, use Aitken’s method to estimate f (2).
STEP 25

INTERPOLATION 5*
Inverse interpolation*

Rather than the value of a function f (x) for a certain x, one might seek the value of
x corresponding to a given value of f (x); this is called inverse interpolation. For
example, perhaps the reader may have contemplated the possibility of obtaining
roots of f (x) = 0 by inverse interpolation.

1 Linear inverse interpolation


An obvious elementary procedure is to tabulate the function in the neighbourhood
of the given value at an interval so small that linear inverse interpolation may be
used. Reference to Step 21 provides

x = x j + θ (x j+1 − x j )

where
f (x) − f j
θ≈
f j+1 − f j
in the linear approximation. (Note that if f (x) = 0, we recover the method of
false position – see Step 8).
For example, from a 4D table of f (x) = e−x one has f (0.91) = 0.4025,
f (0.92) = 0.3985 so that f (x) = 0.4 corresponds to
0.4 − 0.4025
x ≈ 0.91 + × (0.92 − 0.91)
0.3985 − 0.4025
= 0.91 + 0.00625 = 0.91625

An immediate check is to use (direct) interpolation to recover f (x) = 0.4. Thus,


0.91625 − 0.91
f (0.91625) ≈ 0.4025 + × (0.3985 − 0.4025)
0.92 − 0.91
= 0.4000

2 Iterative inverse interpolation


As no doubt the reader may appreciate, it may be preferable to adopt (at least
approximately) an interpolating polynomial of degree greater than one, rather than
seek to tabulate at a small enough interval to permit linear inverse interpolation.
The degree of the approximating polynomial may be decided implicitly by an
iterative (successive approximation) method.
INVERSE INTERPOLATION* 111

For example, Newton’s forward difference formula may be rearranged as


f (x) − f j − 12 θ (θ − 1)12 f j − · · ·
θ=
1 fj
Since terms involving second and higher differences may be expected to decrease
fairly quickly, we have successive approximations {θs } to θ given by
f (x) − f j
θ1 =
1 fj
f (x) − f j − 12 θ1 (θ1 − 1)12 f j
θ2 = , etc.
1 fj
Similar iterative procedures may be based on other interpolation formulae, such
as the Newton backward difference formulae.
To illustrate, consider the table of f (x) = sin x given on page 97 for x =
0◦ (10◦ ) 50◦ , and suppose we seek x for which f (x) = 0.2. Clearly, 10◦ < x <
20◦ . From Newton’s formula,
0.2 − 0.1736 0.0264
θ1 = = = 0.1568 ≈ 0.16
0.1684 0.1684
0.0264 − 21 (0.16)(−0.84)(−0.0104)
θ2 =
0.1684
0.0264 − 0.0007
= = 0.1526 ≈ 0.153
0.1684
θ3 = 0.0264 − 12 (0.153)(−0.847)(−0.0104)

.
− 16 (0.153)(−0.847)(−1.847)(−0.0048) 0.1684

0.0264 − 0.0006(7) + 0.0001(9)


= = 0.1539
0.1684
(Note that it is unnecessary to carry many digits in the first estimates of θ.)
Consequently,
x − 10
θ = 0.1539 =
10
which yields x = 11.539◦ .
Checking, either by the usual method of direct interpolation or in this case
directly, yields sin(11.539◦ ) = 0.2000.

3 Divided differences
Since divided differences are suitable for interpolation with tabular values that
are unequally spaced, they may be used for inverse interpolation. Let us again
consider the function f (x) = sin x for x = 10◦ (10◦ ) 50◦ , and determine x for
which f (x) = 0.2. Ordering with increasing distance from f (x) = 0.2, one has
the divided difference table (entries multiplied by 100):
112 INTERPOLATION 5*

f (x) x
0.1736 10
5938
0.3420 20 518
5848 1360
0.0000 0 962 1338
6000 1988 3486
0.5000 30 1560 3403
7003 3431
0.6428 40 4188
8117
0.7660 50
Consequently,

x = 10 + (0.2 − 0.1736)59.38
+ (0.2 − 0.1736)(0.2 − 0.3420)5.18
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)13.60
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)(0.2 − 0.5)13.38
+ (0.2 − 0.1736)(0.2 − 0.3420)(0.2 − 0)(0.2 − 0.5)
×(0.2 − 0.6428)34.86
= 10 + 1.5676 − 0.0194 − 0.0102 + 0.0030 − 0.0035
= 11.5375
Alternatively, the Aitken scheme could be used. With either alternative, however,
it is noticeable that any advantage in accuracy compared with iterative inverse
interpolation may not justify the additional computational demand.

Checkpoint

1. Why may linear inverse interpolation be either tedious or impracti-


cal?
2. What is the usual method for checking inverse interpolation?
3. What potential advantage has inverse interpolation, using either div-
ided differences or the Aitken scheme, compared with the iterative
method? What is a likely disadvantage?

EXERCISES

1. Use linear inverse interpolation to find the root of x + cos x = 0 correct to


4D.
2. Solve 3xe x = 1 to 3D.
INVERSE INTERPOLATION* 113

3. A table of values for a cubic f is given by:


x f (x) x f (x)
2 3.0671 8 24.2573
3 6.4088 9 28.0592
4 9.8257 10 31.9399
5 13.3184 11 35.9000
6 16.8875 12 39.9401
7 20.5336 13 44.0608
Without knowledge of the explicit form of f , find x for which f (x) = 10, 20,
and 40, respectively. Check your answers by (direct) interpolation. Finally,
obtain the equation of the cubic, and use it to check your answers again.
STEP 26

CURVE FITTING 1
Least squares

Scientists and social scientists often wish to fit a smooth curve to some experi-
mental data. Given (n + 1) points an obvious approach is to use the interpolating
polynomial of degree n, but when n is large this is usually unsatisfactory. Better
results can be obtained by using piecewise polynomials – that is, fitting lower de-
gree polynomials through subsets of the data points. The use of spline functions,
which usually provide a particularly smooth fit, has become widespread (see Step
28).
A rather different but often quite suitable approach is the least square fit in
which, rather than try to fit the points exactly we find a polynomial of low degree
(often first or second) which fits the points closely (after all, the points themselves
are not generally exact, being subject to experimental error).

1 The problem illustrated


Suppose we are studying experimentally the relationship between two variables
x and y – for example, quantity x of drug injected and observed response y,
measured in a laboratory experiment. By carrying out the appropriate experiment
six times (say), we obtain six pairs of values (xi , yi ); these can be plotted on a
diagram such as Figure 11(a).
We may believe that the relationship between the variables can be described
satisfactorily by a function y = f (x), but that the y-values obtained experimen-
tally are subject to errors (or noise). The mathematical model of this situation is
as follows:
f (xi ) = yi + i , i = 1, 2, . . . , n
when there are n observed points. Here f (xi ) is the value of y corresponding to
the value xi used in the experiment, and i is the experimental error involved in
the measurement of the y-variable at the point. Thus the error in y at the observed
point is i = f (xi ) − yi .
The problem of curve fitting is to use the information of the sample data points to
determine a suitable curve (that is, find a suitable function f ) so that the equation
y = f (x) can be used as a description of the (x, y) relationship; the hope is that
predictions made from this equation will not be too much in error.
How is the function f to be chosen? There is an unlimited number of functions
from which to choose, and Figure 11(b) shows four possibilities. The polygon A
passes through all six points; intuitively, however, we might prefer to fit a straight
line such as B, or an exponential curve such as C. The curve D is clearly not a
good candidate for our model.
LEAST SQUARES 115

yN (a) Observed points (xi ,yi ) yN (b) Fitted curves

...
... Polygon (A)
...
...
...
...
Straight line (B)
...
... Exponential (C)
...
...
... Ill-fitting curve (D)
...
...
• • ..
..
.
....
..
.... . .....
......... .... ....
........ .. ....
............. ....
.............. ....
..... .......
• • ........ ........
....... . ........ .......
. .
...
.
.... . ... ........ .......
. .
.... . .... ........ .......
.... .... . ......... .........
.... ... .... ......... ..........
• ....
.... .... .. . •
.... ... ...................... .................... ....................
.......... D
..................... .... .... ...................
• • • .................. •
.... ......... ............. A
.... ................ .... .... .
...... ......... ... .
• • ........
.......
C
B
I I
x x
FIGURE 11. Response to a drug.

2 General approach to the problem


Let us begin to answer the question of which function to choose. Given a set
of values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) we shall pick a function which we can
specify completely except for the values of a set of k parameters c1 , c2 , . . . , ck ;
we denote this function y = f (x; c1 , c2 , . . . , ck ). We then choose values for the
parameters which will make the errors at the measured points (xi , yi ) as small as
possible. We shall suggest three ways by which the phrase ‘as small as possible’
can be given specific meaning.
Some examples of functions to use are:
(a) y(x) = c1 + c2 x + c3 x 2 + · · · + ck x k−1 (polynomial),
(b) y(x) = c1 sin ωx + c2 sin 2ωx + c3 sin 3ωx + · · · + ck sin kωx (combination
of sine functions),
(c) y(x) = c1 cos ωx +c2 cos 2ωx +c3 cos 3ωx +· · ·+ck cos kωx (combination
of cosine functions).
Examples (a), (b) and (c) are themselves examples of what may be termed the
general linear form:
(d) y(x) = c1 φ1 (x) + c2 φ2 (x) + c3 φ3 (x) + · · · + ck φk (x), where the functions
φ1 , φ2 , . . . , φk constitute a pre-selected set of functions.
In (a) the set of functions is {1, x, x 2 , . . . , x k−1 }; in (b) it is {sin ωx, sin 2ωx,
. . . , sin kωx}, with ω being a constant chosen to coincide with a periodicity in
the data; while in (c) the set is {cos ωx, cos 2ωx, . . . , cos kωx}. Other functions
commonly used in curve fitting are exponential functions, Bessel functions, Leg-
endre polynomials, and Chebyshev polynomials (see, for example, Burden and
Faires (1993)).
116 CURVE FITTING 1

3 Errors ‘as small as possible’


We now present criteria which make precise the concept of choosing a function
to make measurement errors as small as possible. We suppose that the curve to
be fitted can be expressed in general linear form, with a known set of functions
{φ1 , φ2 , . . . , φk }.
The errors i = y(xi ) − yi at the n data points are as follows:

1 = c1 φ1 (x1 ) + c2 φ2 (x1 ) + · · · + ck φk (x1 ) − y1


2 = c1 φ1 (x2 ) + c2 φ2 (x2 ) + · · · + ck φk (x2 ) − y2
.. ..
. .
n = c1 φ1 (xn ) + c2 φ2 (xn ) + · · · + ck φk (xn ) − yn

If the number of data points is less than or equal to the number of parameters
(that is, n ≤ k), it is possible to find values for {c1 , c2 , . . . , ck } which make all
the errors i zero. If n < k there is an infinite number of solutions for {ci } which
make all the errors zero (and therefore an infinite number of curves with the given
form pass through all the experimental points); in this case the problem is not fully
determined – more information is needed to choose an appropriate curve.
If n > k, which in practice is usually the case, then it is not normally possible
to make all the errors zero by a choice of the {ci }. Three possible procedures are
as follows:
(a) choose a set {ci } which minimizes the total absolute error; that is, minimize
Pn
the sum |i |;
i=1
(b) choose a set {ci } which minimizes the maximum absolute error; that is,
minimize max |i |;
i=1,2,...,n

(c) choose a set {ci } which minimizes the sum of squares of errors; that is,
n
i2 .
P
minimize S =
i=1
Procedures (a) and (b) are generally difficult to apply. Procedure (c) leads to
a linear system of equations to solve for the set {ci }; it is called the principle of
least squares, and is the one customarily used.

4 The least squares method and normal equations


In order to apply the principle of least squares it is necessary to use partial
differentiation, a calculus technique which may not be known to readers of this
text. For that reason a general description of the method will not be given until
the next Step (which is optional), but we outline it here and give examples to show
how it is used.
LEAST SQUARES 117

The sum of squared errors to be minimized is


X n Xn
S= i2 = [c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi ) − yi ]2
i=1 i=1
The n values of (xi , yi ) are the known measurements taken from n experiments.
When they are inserted in the right-hand side, S becomes an expression involving
only k unknowns, namely c1 , c2 , . . . , ck . In other words S may be regarded as a
function of the ci ; that is, S ≡ S(c1 , c2 , . . . , ck ). The problem is to choose that
set of values for {ci } which makes S a minimum.
A theorem in calculus tells us that (under certain conditions which are usually
satisfied in practice) the minimum of S occurs at the point where all the partial
derivatives
∂S ∂S ∂S
, ,...,
∂c1 ∂c2 ∂ck
vanish. The partial derivative ∂c ∂ S (for example) is the same as the differential
1
dS with all the other c held constant; for instance, if S = 3c + 5c ,
coefficient dc i 1 2
1
then
∂S ∂S
= 3 and =5
∂c1 ∂c2
Thus we have to solve the following system of k equations:

∂S 

=0 
∂c1





∂S



=0 
∂c2

.. 
.






∂S



=0 

∂ck

This system is a set of equations linear in the variables c1 , c2 , . . . , ck and they


are called the normal equations for the least squares approximation. One of
the numerical methods presented in Steps 11, 13, or 15 may be used to obtain
the required set {ci } which minimizes S. However, we remark that the normal
equations may be ill-conditioned, when it is then better to invoke QR factorization
as outlined in the next (optional) Step, or to make use of orthogonal basis functions
– for example, see Conte and de Boor (1980).

5 Example
The following points were obtained in an experiment:
x 1 2 3 4 5 6
y 1 3 4 3 4 2
118 CURVE FITTING 1

We shall plot the points on a diagram, and use the method of least squares to fit
(a) a straight line, and (b) a parabola through them.
(a) The plotted points are shown in Figure 12(a). To fit a straight line, we have to
find a function y = c1 + c2 x (that is, a first degree polynomial) which minimizes

6
X 6
X
S= i2 = [yi − c1 − c2 xi ]2
i=1 i=1

Differentiating first with respect to c1 (keeping c2 constant) and then with respect
to c2 (keeping c1 constant), and setting the results equal to zero, gives the normal
equations:

6
∂S X
≡ −2 (yi − c1 − c2 xi ) = 0 


∂c1


i=1
6
∂S X 
≡ −2 xi (yi − c1 − c2 xi ) = 0 

∂c2


i=1

We may divide both equations by −2, take the summation operations through the
brackets, and rearrange, to obtain:
! 
6
P 6
P 
yi = 6c1 + xi c2




i=1 !i=1 !
6 6 6 
xi2
P P P
xi yi = x i c1 + c2 



i=1 i=1 i=1

P
It
Pis seenP that to proceed to a solution we have to evaluate the four sums xi ,
xi2 ,
P
yi , xi yi , and insert them in the last equations. We can arrange the
work in a table thus (the last three columns are for fitting the parabola and the
required sums are in the last row):

i xi yi xi2 xi yi xi2 yi xi3 xi4


1 1 1 1 1 1 1 1
2 2 3 4 6 12 8 16
3 3 4 9 12 36 27 81
4 4 3 16 12 48 64 256
5 5 4 25 20 100 125 625
6 6 2 36 12 72 216 1296
21 17 91 63 269 441 2275
LEAST SQUARES 119

The normal equations for fitting the straight line are therefore
)
17 = 6c1 + 21c2
63 = 21c1 + 91c2

The solution to 2D is c1 = 2.13 and c2 = 0.20. Consequently, the required line is

y = 2.13 + 0.20x

and is plotted in Figure 12(b).


(b) To fit a parabola we have to find the second degree polynomial

y = c1 + c2 x + c3 x 2

which minimizes
6
X 6 h
X i2
S= i2 = yi − c1 − c2 xi − c3 xi2
i=1 i=1

Taking partial derivatives and proceeding as above we obtain the normal equations

6
 6   6 
P P P 2
yi = 6c1 + x i c2 + x i c3




i=1 i=1 i=1 


6 6 6 6
      
P
xi yi =
P
x i c1 +
P 2
x i c2 +
P 3
xi c3
i=1  i=1 i=1 i=1

 
6 6 6 6
  

2 2 3 4
P P P P 
xi yi = x i c1 + xi c2 + xi c3 


i=1 i=1 i=1 i=1

(a) The points (b) Fitted line and parabola


Line: y=2.13+0.20x
y y Parabola: y=−1.20+2.70x−0.36x 2
N N

4 • • 4 • ... .... .... .... .



... . ... .
... .. .
... .. ............
.. .. ... .........................
.. .
..
.....................
3 • • 3 • . •
............
................ .
..
. ............ ..
.........................
. ..
... ...
..... .... ..
.... ....
............. .. ..
2 • 2 ...
.
• ..
..
. ..
.. ..
.. ..
.. ..
1 • 1 •.
... ..
..
.. ..
.. ..
.. ..
.. . I
I x x
0 1 2 3 4 5 6 0 1 2 3 4 5 6
FIGURE 12. Fitting a line and parabola by least squares.
120 CURVE FITTING 1

Inserting the values for the sums (see the above table) we obtain the system of
linear equations: 
17 = 6c1 + 21c2 + 91c3  
63 = 21c1 + 91c2 + 441c3

269 = 91c1 + 441c2 + 2275c3 
The solution to 3D is c1 = −1.200, c2 = 2.700, and c3 = −0.357. The required
parabola is therefore (retaining 2D ):

y = −1.20 + 2.70x − 0.36x 2

it is also plotted in Figure 12(b). It is clear that the parabola is a better fit than the
straight line!

Checkpoint

1. What is meant by ‘error’ at a point?


2. Give three criteria which may be applied to choose the set {ci }.
3. How are the normal equations obtained?

EXERCISES

1. For the example above (with data points shown in Figure 12(a)), compute
the value of S, the sum of squares of errors of points from (a) the fitted line,
and (b) the fitted parabola. Plot the points on graph paper, and fit a straight
line ‘by eye’ (that is, use a ruler to draw a line, guessing its best position).
Determine the value of S for this line and compare with the value for the least
squares line.
2. Fit a straight line by the least squares method to each of the following sets of
data:
(a) toughness x and percentage of nickel y in eight specimens of alloy steel.
toughness x 36 41 42 43 44 45 47 50
% nickel y 2.5 2.7 2.8 2.9 3.0 3.2 3.3 3.5
(b) aptitude test mark x given to six trainee salespeople, and their first-year
sales y in thousands of of dollars.
aptitude test x 25 29 33 36 42 54
first-year sales y 42 45 50 48 73 90
For both sets, plot the points and draw the least squares line on a graph. Use
the lines to predict the % nickel of a specimen of steel whose toughness is
38, and the likely first-year sales of a trainee salesperson who obtains a mark
of 48 on the aptitude test.
LEAST SQUARES 121

3. Obtain the normal equations for fitting a third-degree polynomial y = c1 +


c2 x + c3 x 2 + c4 x 3 to a set of n points. Show that they can be written in
matrix form (all sums are from i = 1 to i = n):
 P   P P 2 P 3  
yi n xi xi xi c1
 P   P P 2 P 3 P 4  
 P xi yi  =  P xi P xi P xi P xi   c2 
    
 2
xi yi  
  2
xi xi3 4
xi 5
xi   c3 
 

P 3 P 3 P 4 P 5 P 6
xi yi xi xi xi xi c4

Deduce the matrix form of the normal equations for fitting a fourth-degree
polynomial.
4. Fit a parabola by the least squares method to the points (0, 0), (1, 1), (2, 3),
(3, 3), and (4, 2). Find the value of S for this fit.
5. Find the normal equations that arise from fitting, by the least squares method,
an equation of the form y = c1 + c2 sin x to the set of points (0, 0), (π/6, 1),
(π/2, 3), and (5π/6, 2). Solve them for c1 and c2 .
STEP 27

CURVE FITTING 2*
Least squares and linear equations*

In this optional Step, we consider some advanced topics in systems of linear


equations that arise out of the discrete least squares problem discussed in the
previous Step.

1 Pseudo-inverse
Recall that we have data points (x1 , y1 ), . . . , (xn , yn ) and that we wish to find the
parameters c1 , . . . , ck for the basis functions φ1 , . . . , φk such that
n
X
S= |i |2
i=1

is minimized. Here the errors i are given by


1 = c1 φ1 (x1 ) + c2 φ2 (x1 ) + · · · + ck φk (x1 ) − y1
2 = c1 φ1 (x2 ) + c2 φ2 (x2 ) + · · · + ck φk (x2 ) − y2
.. ..
. .
n = c1 φ1 (xn ) + c2 φ2 (xn ) + · · · + ck φk (xn ) − yn
Ideally, we would like S to be zero, corresponding to each of the i being zero. If
c and y are the vectors    
c1 y1
 c2   y 
   
 .  and  .2 
 .   . 
 .   . 
ck yn
respectively, and A is the n × k matrix
 
φ1 (x1 ) φ2 (x1 ) ··· φk (x1 )
 φ1 (x2 ) φ2 (x2 ) ··· φk (x2 )
 

 .. .. .. .. 
. . . .
 
 
φ1 (xn ) φ2 (xn ) · · · φk (xn )
then requiring S to be zero is equivalent to requiring
Ac = y
If n > k we have an overdetermined system of linear equations, since the number
of equations is greater than the number of unknowns. As pointed out in the
LEAST SQUARES AND LINEAR EQUATIONS* 123

previous Step, it is generally not possible to find a solution to this system, but we
can find the c1 , . . . , ck such that Ac is ‘close’ to y (in the least squares sense). If
there is a solution c∗ for the least squares problem, then we write

c∗ = A+ y

The matrix A+ is called the pseudo-inverse or generalized inverse of A. We remark


that when n = k and A is invertible (that is, A has an inverse), then A+ = A−1 .

2 Normal equations
To minimize S, we need to minimize
n
X
[c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi ) − yi ]2
i=1

If for j = 1, 2, . . . , k we take the partial derivative with respect to c j and set it


equal to zero, then we get the normal equations
n
X
2 [c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi ) − yi ] φ j (xi ) = 0
i=1

or equivalently
n
X n
X
[c1 φ1 (xi ) + c2 φ2 (xi ) + · · · + ck φk (xi )] φ j (xi ) = φ j (xi )yi
i=1 i=1

If M is a matrix (or vector) with (i, j)-th element m i j , then recall from linear
algebra that its transpose, denoted by M T , is the matrix with (i, j)-th element
m ji – that is, M T is obtained from M by swapping the rows and columns. For
example, if
 
1 2 3
M = 4 5 6 
 
7 8 9
then  
1 4 7
MT =  2 5 8 
 
3 6 9
It is evident that the normal equations may be written as

AT Ac = AT y

where the matrix A has entries ai j = φ j (xi )


124 CURVE FITTING 2*

As an example, suppose k = 4 and φ j (x) = x j−1 . Then


 
1 x1 x12 x13
 
 1 x2 x22 x23 
A= .. .. .. ..
 


 . . . . 

1 xn xn2 xn3

and
 
1 1 ··· 1
 
 x1 x2 ··· xn 
AT = 
 2

 x1 x22 ··· xn2 

x13 x23 ··· xn3


so that
 
xi2 xi3
P P P
n xi
xi2 xi3 xi4 
 P P P P 
T
 xi
A A= P 5 
xi3
P 2
xi4
P P
xi xi 
 

P 3
xi5
P 6
xi4
P P
xi xi
where all the sums have i running from 1 to n, and
 P 
yi
 P 
xi yi
AT y = 
 

 P 2
xi yi

 
P 3
xi yi

Then the system AT Ac = AT y is precisely the system given in Exercise 3 of the


previous Step.
If the matrix AT A is invertible, then there is a solution to the least squares
problem. In this situation we have c = (AT A)−1 AT y, so that the pseudo-inverse
A+ is given by
A+ = (AT A)−1 AT
T
The case when A A does not have an inverse is beyond the scope of this text. It
is in this situation that the importance of the pseudo-inverse becomes apparent.

3 QR factorization
If the matrix AT A has an inverse, then in principle the normal equations can be
solved to find the least squares solution. However, as we remarked in the previous
Step, the normal equations may be ill-conditioned. If so, then an alternative
LEAST SQUARES AND LINEAR EQUATIONS* 125

technique for finding the least squares solution is to use a QR factorization of A.


In this technique A is expressed as
A = QR
where Q is an n × n orthogonal matrix and R is an n × k matrix whose first k rows
form an upper triangular matrix and whose last n − k rows contain only zeros.
(An orthogonal matrix M is a matrix whose inverse is equal to its transpose, that
is, M−1 = MT .) A technique for finding a QR factorization of a matrix will be
given in the next section.
As an example, suppose
 
1 1 1
 1 2 4 
 
 1 3 9 
 
A= 
 1 4 16 
 
 1 5 25 
1 6 36
Then we can write A = QR, where Q is given to 5D by
 
−0.40825 −0.59761 0.54554 0.02715 −0.09552 −0.41074
 
 −0.40825 −0.35857 −0.10911
 0.16867 0.42678 0.69446 

 −0.40825 −0.11952 −0.43644 −0.66002 −0.43546 0.05763 
 
 
 −0.40825
 0.11952 −0.43644 0.69655 −0.29913 −0.23220  
 −0.40825
 0.35857 −0.10911 −0.22346 0.67511 −0.43261 

−0.40825 0.59761 0.54554 −0.00889 −0.27178 0.32346
and  
−2.44949 −8.57321 −37.15059
 

 0.00000 4.18330 29.28310 

0.00000 0.00000 6.11010 
 
R=



 0.00000 0.00000 0.00000 


 0.00000 0.00000 0.00000 

0.00000 0.00000 0.00000
It may be verified that QT Q is the 6 × 6 identity matrix.
Recalling that the overdetermined system of equations is Ac = y, we substitute
A = QR and multiply through by QT = Q−1 to obtain
Rc = QT y
Since the matrix R has the form
" #
R1
R=
2
126 CURVE FITTING 2*

where 2 is an (n − k) × k matrix of zeros, we partition QT y as


" #
T q1
Q y=
q2

The vector q1 is k × 1 while q2 is (n − k) × 1, and it turns out that the solution of


the least squares problem is obtained by solving

R1 c = q1

Thus once we have a QR factorization of A, we can find the least squares approxi-
mation by solving an upper triangular system using back-substitution.
For example, suppose we wish to fit a parabola to the experimental data pre-
sented on page 117. The relevant matrix A with its QR factorization is given on
the previous page, and we see that
 
−2.44949 −8.57321 −37.15059
R1 =  0.00000 4.18330 29.28310 
 
0.00000 0.00000 6.11010

Calculation of QT y yields
   
−6.94022 −0.92889
q1 =  0.83666  and q2 =  0.70246 
   
−2.18218 0.12305

Upon solving R1 c = q1 , we find the same solution as before – namely c3 =


−0.357, c2 = 2.700, and c1 = −1.200 to 3D.

4 The QR factorization process


A complete explanation of how to determine a QR factorization is beyond the
scope of this book, but we now give a brief outline that should be sufficient for a
reader to write an appropriate computer program. (The calculations are normally
too complex for hand calculation.) Modifications of the process to improve the
computational efficiency are usually implemented in computer software packages.
For instance, it is not necessary to calculate Q explicitly in order to obtain the
system Rc = QT y for the least squares problem.
Most techniques to find a QR factorization are based on orthogonal transform-
ations, in which A is pre-multiplied by a sequence of orthogonal matrices that
successively reduce the elements below the main diagonal in each column to
zero. There are a number of ways of choosing these transformations. Here we
shall only consider Householder transformations, in which we pre-multiply A by
Householder matrices.
LEAST SQUARES AND LINEAR EQUATIONS* 127

An n × n Householder matrix is a matrix of the form


H = I − 2wwT
where w is an n × 1 vector such that
n
X
wT w = wi2 = 1
i=1

It is notable that H is symmetric and orthogonal (H = HT = H−1 ).


In order to find a QR factorization of A, we apply appropriate Householder
transformations to A which transform A into R. If we set A(0) = A, then this may
be achieved by calculating the sequence
A(`) = H(`) A(`−1)
for ` = 1, 2, . . . , k, where H(`) is an Householder matrix. The effect of this
Householder matrix is to make the last n − ` elements in the `-th column of A(`)
zero. The final matrix A(k) is the required R.
The corresponding Householder matrix H(`) is of the form I − 2w(`) (w(`) )T ,
where the first ` − 1 components of w(`) are zero. If the components of the `-th
(`−1)
column of A(`−1) are a j` , then the `-th element of w(`) should be taken to be
" (`−1) !#1/2
(`) 1 a`,`
w` = 1 − (`)
2 S

where
n h
(`−1) 2
i
(S (`) )2 =
X
a j,`
j=`
Since S (`)
can be either positive or negative, it is best to choose the sign to be
(`−1) (`−1)
opposite to a`,` – that is, take S (`) to be positive if a`,` is negative and
(`)
vice-versa. This choice maximizes w` and minimizes the round-off error. The
remaining elements of w(`) are given by
(`−1)
(`)
a j,`
wj = (`)
, j = ` + 1, ` + 2, . . . , n
−2S (`) w`

Finally, the matrix QT is given by the product H(k) H(k−1) · · · H(1) , from which
Q follows by taking the transpose. However, as pointed out earlier, it is not
necessary to find Q explicitly in order to obtain the least squares solution. Instead,
we set y(0) = y, and when A(`) = H(`) A(`−1) is being calculated, we may also
calculate
y(`) = H(`) y(`−1) , ` = 1, 2, . . . , k
The end result is a transformation of the original system Ac = y into the system
A(k) c = y(k) , that is, the system Rc = QT y. Furthermore, the calculations may
128 CURVE FITTING 2*

be carried out without the need to explicitly form the Householder matrices. Once
we have w(`) , then
A(`) = H(`) A(`−1)
= I − 2w(`) (w(`) )T A(`−1)
 

= A(`−1) − 2w(`) (w(`) )T A(`−1)


 

so A(`) may be calculated by first calculating (w(`) )T A(`−1) . In a similar manner,


y(`) may be calculated without explicitly forming H(`) .

Checkpoint

1. If A is the matrix with elements ai j = φ j (xi ), how may the normal


equations be expressed in terms of A, c, and y?
2. How may the the least squares solution be found by using a QR
factorization of A?
3. What type of transformations are used to produce a QR factoriza-
tion?

EXERCISES

1. Given the data points (0, 0), (π/6, 1), (π/2, 3), and (5π/6, 2), take φ1 (x) =
1 and φ2 (x) = sin x. Write down the matrix A with ai j = φ j (xi ) and obtain
the normal equations. Verify that these normal equations are the same as
those obtained in Exercise 5 of the previous Step.
2. For the A of Exercise 1, find H(1) (take A(0) = A) and hence calculate
A(1) = H(1) A. Verify that the second, third, and fourth components in the
first column of this last matrix are all zero.
STEP 28

CURVE FITTING 3*
Splines*

Suppose we want to fit a smooth curve which actually goes through n + 1 given
data points, where n is quite large. Since an interpolating polynomial of corre-
spondingly high degree n tends to be highly oscillatory and therefore gives an
unsatisfactory fit, at least in some places, the interpolation is often constructed by
linking lower degree polynomials (piecewise polynomials) at some or all of the
given data points (called knots or nodes). This interpolation is smooth if we also
insist that the piecewise polynomials have matching derivatives at these knots, and
this smoothness is enhanced by matching higher order derivatives.
Suppose the data points (x0 , f 0 ), (x1 , f 1 ), . . . , (xn , f n ), are ordered so that
x0 < x1 < · · · < xn−1 < xn
Here we seek a function S which is a polynomial of degree d on each subinterval
[x j−1 , x j ], j = 1, 2, . . . , n, and for which
S(x j ) = f j , j = 0, 1, 2, . . . , n
For maximum smoothness in this case, where the knots are all the given data
points, it turns out that we can allow S to have up to d − 1 continuous derivatives.
Such a function is known as a spline. An example of a spline for the linear
(d = 1) case is the polygon (Curve A) of Figure 11(b) on page 115. It is clear
that this spline is continuous, but it does not have a continuous first derivative.
The most popular in practice are cubic splines, constructed from polynomials of
degree three with continuous first and second derivatives at the data points, and
discussed further below. An example of a cubic spline S for n = 5 is displayed in
Figure 13. (The data points are taken from the table on page 117.) We see that S
goes through all the data points. Each function Sj on the subinterval [x j−1 , x j ] is
a cubic. As has already been indicated, the first and second derivatives of Sj and
Sj+1 match at (x j , f j ), the point where they meet.
The term ‘spline’ refers to the thin flexible rods that in the past were used by
draughtsman to draw smooth curves. The graph of a cubic spline approximates
the shape that arises when such a rod is forced to pass through the given n + 1
data points, and corresponds to minimum ‘strain energy’.

1 Construction of cubic splines


As indicated, the cubic spline S is constructed by fitting a cubic on each subinterval
[x j−1 , x j ] for j = 1, 2, . . . , n, so it is convenient to suppose that S has values
Sj (x) for x ∈ [x j−1 , x j ], where
Sj (x) = a j + b j (x − x j ) + c j (x − x j )2 + d j (x − x j )3
130 CURVE FITTING 3*

y
N

S2 .... .........
• •
......
4 ...
.... ....
.... . ..
. .... ........
... ... S S 4 ......
...
.. .. ... 3 ...
...
.. ... .. ...
... ... ... ...
.. ... .
..
... ...
.... .
.. ..
.. 5 S
.. ..... ... ...
.. ........... .... ...
3 • ... • ...
...
..... ...
. ..
... ..
... ..
S ..... ..
..
1 ... ..
. ..
... ...
2 ... f2 f4 • .
.....
.
...
...
..... f 1 f3
.
...
...
....
1 • f5
f0
I x
0 1 2 3 4 5 6
x0 x1 x2 x3 x4 x5
FIGURE 13. Schematic example of a cubic spline over subintervals [x0 , x1 ], [x1 , x2 ],
[x2 , x3 ], [x3 , x4 ], and [x4 , x5 ]; each function Sj on [x j−1 , x j ] is a cubic.

Then we require S(x j ) = f j , from which it follows that a j = f j for j =


1, 2, . . . , n; and for S to be continuous and to have continuous first and second
derivatives at the given data points, we require

Sj (x j ) = Sj+1 (x j )
Sj0 (x j ) = Sj+1
0
(x j )
Sj00 (x j ) = Sj+1
00
(x j )

for j = 1, 2, . . . , n − 1.
Since we have a cubic with four unknowns (a j , b j , c j , d j ) on each of the n
subintervals, and so a total of 4n unknowns, we need 4n equations to specify
them. The requirement S(x j ) = f j , j = 0, 1, 2, . . . , n, yields n + 1 equations,
while 3(n − 1) equations arise from the continuity requirement on S and its first
two derivatives given above. This yields a total of n + 1 + 3(n − 1) = 4n − 2
equations, so we need to impose two more conditions to specify S completely. The
choice of these two extra conditions determines the type of cubic spline obtained.
Two common choices are the following:
(a) natural cubic spline: S 00 (x0 ) = S 00 (xn ) = 0;
(b) clamped cubic spline: S 0 (x0 ) = β0 , S 0 (xn ) = βn for some given constants
β0 and βn . If the values of f 0 (x0 ) and f 0 (xn ) are known, then β0 and βn can
be set to these values.
SPLINES* 131

We shall not go into the algebraic details here; but it turns out that if we write
h j = x j − x j−1 and m j = S 00 (x j ), then the coefficients of Sj for j = 1, 2, . . . , n
are given by
aj = f j
f j − f j−1 h j (2m j + m j−1 )
bj = +
hj 6
mj
cj =
2
m j − m j−1
dj =
6h j
The spline is thus determined by the values of {m j }nj=0 , which depend on whether
we have a natural or a clamped cubic spline.
For a natural cubic spline we have m 0 = m n = 0, and the equations
 
f j+1 − f j f j − f j−1
h j m j−1 + 2(h j + h j+1 )m j + h j+1 m j+1 = 6 −
h j+1 hj
for j = 1, 2, . . . , n − 1. (We remark that if all the values of h j are the same,
then the right-hand side of this last equation is just 612 f j−1 / h j .) Setting α j =
2(h j + h j+1 ), these linear equations can be written as the (n − 1) × (n − 1) system

α1 h 2
  
··· 0 0 0 m1
 h α h ··· 0 0 0    m2 
 
 2 2 3
 0 h 3 α3 · · · 0 0 0   m3 
  
 . .. .. . . .. .. ..   .. 
 .
 
 . . . . . . .  .  = b

0 · · · αn−3 h n−2
  
 0 0 0    m n−3 
 
0 · · · h n−2 αn−2 h n−1

 0 0   m n−2 
0 0 0 ··· 0 h n−1 αn−1 m n−1
where
f2 − f1 f − f
   
6 h − 1h 0
 2 1 
f − f f − f
 
6 3h 2 − 2h 1
 
 
3 2
..
 
 
b=
 . 

 6 f n−1 − f n−2 − f n−2 − f n−3
   

 h n−1 h n−2 
 
 
f n − f n−1 f n−1 − f n−2
6 h − h
n n−1

It is notable that the coefficient matrix has nonzero entries only on the leading
diagonal and the two subdiagonals either side of it. Such a system is called a
tridiagonal system. Because most of the entries below the leading diagonal are
already zero, it is possible to modify Gaussian elimination (see Step 11) to produce
a very efficient method for solving tridiagonal systems.
132 CURVE FITTING 3*

For the clamped boundary conditions, the equations are


f1 − f0
2h 1 m 0 + h 1 m 1 = 6 − 6β0
h1
f n − f n−1
h n m n−1 + 2h n m n = 6βn − 6
hn
and
 
f j+1 − f j f j − f j−1
h j m j−1 + 2(h j + h j+1 )m j + h j+1 m j+1 = 6 −
h j+1 hj
for j = 1, 2, . . . , n −1. It may be verified that these equations for m 0 , m 1 , . . . , m n
can be written as an (n + 1) × (n + 1) tridiagonal system.

2 Examples
We fit a natural cubic spline to the following data from Step 26 on page 117:
j 0 1 2 3 4 5
xj 1 2 3 4 5 6
fj 1 3 4 3 4 2
Since the values of the x j are equally spaced, we have h j = 1 for j = 1, 2, . . . , 5.
Also, m 0 = m 5 = 0 and the remaining values m 1 , m 2 , m 3 , and m 4 satisfy the
linear system     
4 1 0 0 m1 −6
 1 4 1 0   m 2   −12 
    
  = 
 0 1 4 1   m 3   12 
0 0 1 4 m4 −18
Using Gaussian elimination to solve this system, to 5D we obtain

m 1 = −0.43062, m 2 = −4.27751, m 3 = 5.54067, m 4 = −5.88517

Calculating the coefficients, we then find that the natural spline S is given by


 S1 (x), 1 ≤ x < 2
 S2 (x), 2 ≤ x < 3



S(x) = S3 (x), 3 ≤ x < 4
S4 (x), 4 ≤ x < 5





S5 (x), 5 ≤ x ≤ 6

where

S1 (x) = 3 + 1.85646(x − 2) − 0.21531(x − 2)2 − 0.07177(x − 2)3


S2 (x) = 4 − 0.49761(x − 3) − 2.13876(x − 3)2 − 0.64115(x − 3)3
S3 (x) = 3 + 0.13397(x − 4) + 2.77033(x − 4)2 + 1.63636(x − 4)3
SPLINES* 133

S4 (x) = 4 − 0.03828(x − 5) − 2.94258(x − 5)2 − 1.90431(x − 5)3


S5 (x) = 2 − 2.98086(x − 6) + 0.98086(x − 6)3
The data points and the cubic spline were previously displayed in Figure 13.
In the next example we demonstrate graphically a comment made earlier, namely
that interpolating polynomials of high degree tend to be highly oscillatory in
their behaviour. To do this, we consider the following data from the function
f (x) = 10/(1 + x 2 ).
j 0 1 2 3 4 5 6
xj −3 −2 −1 0 1 2 3
fj 1 2 5 10 5 2 1
One may verify that the interpolating polynomial of degree 6 is given by
P6 (x) = 10 − 6.4x 2 + 1.5x 4 − 0.1x 6
The function f is plotted (solid line) in Figure 14 along with the interpolating
polynomial (dashed line). (Since both f and P6 are symmetric about the y-axis,
only the part of the graph on [0, 3] has been displayed.) The oscillatory behaviour
of the interpolating polynomial of degree 6 is clearly seen.
Suppose we now fit a natural cubic spline to the data. To do this, we need to
solve the linear system
    
4 1 0 0 0 m1 12
 1 4 1 0 0   m   12 
  2   
 0 1 4 1 0   m 3  =  −60 
    
    
 0 0 1 4 1   m 4   12 
0 0 0 1 4 m5 12
Gaussian elimination yields (to 5D )
m 1 = m 5 = 1.15385, m 2 = m 4 = 7.38462, m 3 = −18.69231
Calculating the coefficients, we then find that the natural spline S is symmetric
about the y-axis. Moreover, on [0, 3] it is given by

 S4 (x), 0 ≤ x < 1

S(x) = S5 (x), 1 ≤ x < 2
 S (x), 2 ≤ x ≤ 3

6

where
S4 (x) = 5 − 5.65385(x − 1) + 3.69231(x − 1)2 + 4.34615(x − 1)3
S5 (x) = 2 − 1.38462(x − 2) + 0.57692(x − 2)2 − 1.03846(x − 2)3
S6 (x) = 1 − 0.80769(x − 3) − 0.19231(x − 3)3
The spline is also plotted in Figure 14 (using a dotted line). It is clear that it is a
much better approximation to f than the interpolating polynomial.
134 CURVE FITTING 3*
y
N

10 •........................
........
........
........
.......
.........
8 .......
.........
.......
.........
......
......
.....
6 ......
......
....
• ....
......
......
....... .. ... ...
....... .. . ....
4 .. ......
.. ...... ..
.. ..
.. ...... .
. ..
.. ....... . .. ..
.. . . ..
.. ............ ... ..
.. . ........... ..
2 . ... • . ... .
... ... ... ... ....................
. ..
.
................. ...
......
...........
...

I x
0 1 2 3
FIGURE 14. The function f (x) = 10/(1 + x 2 ) (solid line) approximated by an interpo-
lating polynomial (dashed line) and a natural cubic spline (dotted line).

Checkpoint

1. What characterises a spline?


2. What are two common types of cubic spline?
3. What type of linear system arises when determining a cubic spline?

EXERCISE

Given the data points (0, 1), (1, 4), (2, 15), and (3, 40), find the natural cubic
spline fitting this data. Use the spline to estimate the value of y when x = 2.3.
STEP 29

NUMERICAL DIFFERENTIATION
Finite differences

In Analysis, we are usually able to obtain the derivative of a function by the


methods of elementary calculus. If the function is very complicated or known
only from values in a table however, it may be necessary to resort to numerical
differentiation.

1 Procedure
Formulae for numerical differentiation may easily be obtained by differentiating
the interpolation polynomial. The essential idea is that the derivatives f 0 , f 00 , . . .
of a function f are represented by the derivatives Pn0 , Pn00 , . . . of the interpolat-
ing polynomial Pn . For example, differentiating the Newton forward difference
formula (see page 96)

f (x) = f (x j + θ h)
θ(θ − 1) 2 θ (θ − 1)(θ − 2) 3
 
= 1 + θ1 + 1 + 1 + · · · fj
2! 3!

df df dθ , etc.)
with respect to x gives formally (since x = x j + θ h, dx = dθ × dx

3θ 2 − 6θ + 2 3
 
1df 1
f (x) =
0
= 1 + (θ − 2 )1 +
1 2
1 + · · · fj
h dθ h 6
1 d2 f 1 h i
f 00 (x) = 2 2 = 2 12 + (θ − 1)13 + · · · f j , etc.
h dθ h
In particular, if we set θ = 0 we have formulae for derivatives at the tabular
points {x j }:

1h i
f 0 (x j ) = 1 − 12 12 + 13 13 − · · · f j
h
1 h i
f (x j ) = 2 12 − 13 + 11
00
12 14
− · · · f j , etc.
h

If we set θ = 21 , we have a relatively accurate formula at half-way points


(without second differences)

1h i
f 0 (x j + 12 h) = 1− 24 1
1 3
+ · · · fj
h
136 NUMERICAL DIFFERENTIATION

if we set θ = 1 in the formula for the second derivative, we have the result (without
third differences)
1 h i
f 00 (x j+1 ) = 2 12 − 12
1 4
1 + · · · fj
h
a formula for the second derivative at the next point.
Note that if only one term is retained, the well-known formulae
f (x j + h) − f (x j )
f 0 (x j ) ≈
h
f (x j + 2h) − 2 f (x j + h) + f (x j )
f 00 (x j ) ≈
h2
f (x j + h) − f (x j)
f 0 (x j + 12 h) ≈
h
f (x j + 2h) − 2 f (x j + h) + f (x j )
f (x j+1 ) ≈
00
h2

etc. are recovered.

2 Error in numerical differentiation


It must be recognized that numerical differentiation is subject to considerable error;
the basic difficulty is that while ( f (x) − Pn (x)) may be small, the differences
( f 0 (x) − Pn0 (x)) and ( f 00 (x) − Pn00 (x)) etc. may be very large. In geometrical
terms, although two curves may be close together, they may differ considerably
in slope, variation in slope, etc. (see Figure 15).
y
N
..
..
..
...
..
..
..
..
........................
....... .. ........
.... .. ..............
......
.....
. ..
.. .. y = f (x)
... ..
.. . ..
.
..
.
...
.. ← interpolating curve
.. . ...
.. . .
. ...
.. .. .
.. .. ...
.. .... ... .
......... .
........ ............. . .....
.... .... . ... .. .
..
.... .... ... ... ... ... ... .. .....
.
... . . ... ...
.. ... ............ .. .
...
... ... ... ... . .... ...
.... . ... ... ... ... .... ....
. .. ...... ..
.. . ... ........ ...........
.... .. .........
.. .....
.. .. .....
.
.. .. ......
.
....
....
I x
FIGURE 15. Interpolating f .
FINITE DIFFERENCES 137

It should also be noted that the formulae all involve dividing a combination
of differences (which are prone to loss of significance or cancellation errors,
especially if h is small), by a positive power of h. Consequently if we want to
keep round-off errors down, we should use a large value of h. On the other hand,
it can be shown (see Exercise 3 at the end of this Step) that the truncation error is
approximately proportional to h p , where p is a positive integer, so that h must be
sufficiently small for the truncation error to be tolerable. We are in a ‘cleft stick’
and must compromise with some optimum choice of h.
In brief, large errors may occur in numerical differentiation based on direct
polynomial approximation, so that an error check is always advisable. There are
alternative methods based on polynomials which use more sophisticated proce-
dures such as least-squares or mini-max, and other alternatives involving other
basis functions (for example, trigonometric functions). However, the best policy
is probably to use numerical differentiation only when it cannot be avoided!

3 Example
We estimate f 0 (0.1) and f 00 (0.1) for f (x) = e x using the data in Step 20 (page
90).
If we use the formulae from page 135 (with θ = 0) we obtain (ignoring fourth
and higher differences):
1
f 0 (0.1) ≈ [0.05666 − 12 (0.00291) + 13 (0.00015)]
0.05
= 20(0.05666 − 0.00145(5) + 0.00005)
= 1.1051
f (0.1) ≈ 400(0.00291 − 0.00015)
00

= 1.104
Since f 00 (0.1) = f 0 (0.1) = f (0.1) = 1.10517, it is obvious that the second result
is much less accurate (because of round-off errors).

Checkpoint

1. How are formulae for the derivatives of a function obtained from


interpolation formulae?
2. Why is the accuracy of the usual numerical differentiation process
not necessarily increased if the argument interval is reduced?
3. When should numerical differentiation be used?

EXERCISES

1. Derive formulae involving backward differences for the first and second
derivatives of a function.
138 NUMERICAL DIFFERENTIATION

2. The function f (x) = x is tabulated for x = 1.00 (0.05) 1.30 to five
decimal places:
x f (x)
1.00 1.00000
1.05 1.02470
1.10 1.04881
1.15 1.07238
1.20 1.09545
1.25 1.11803
1.30 1.14018
(a) Estimate f 0 (1.00) and f 00 (1.00) using Newton’s forward difference
formula.
(b) Estimate f 0 (1.30) and f 00 (1.30) using Newton’s backward difference
formula.
3. Use the Taylor series to find the truncation errors in the following formulae.
(a) f 0 (x j ) ≈ [ f (x j + h) − f (x j )]/ h.
(b) f 0 (x j + 12 h) ≈ [ f (x j + h) − f (x j )]/ h.
(c) f 00 (x j ) ≈ [ f (x j + 2h) − 2 f (x j + h) + f (x j )]/ h 2 .
(d) f 00 (x j + h) ≈ [ f (x j + 2h) − 2 f (x j + h) + f (x j )]/ h 2 .
STEP 30

NUMERICAL INTEGRATION 1
The trapezoidal rule

It is often either difficult or impossible to evaluate definite integrals of the form


Z b
f (x) dx
a

by analytical methods, so numerical integration or quadrature is used.


It is well known that the definite integral may be interpreted as the area under
the curve y = f (x) for a ≤ x ≤ b and may be evaluated by subdivision of
the interval and summation of the component areas. This additive property of
the definite integral permits evaluation in a piecewise sense. For any subinterval
x j ≤ x ≤ x j+n of the interval a ≤ x ≤ b, we may approximate f (x) by the
interpolating polynomial Pn (x). Then we have the approximation
Z x j+n Z x j+n
Pn (x) dx ≈ f (x) dx
xj xj

which will be a good approximation if n is chosen so that the error ( f (x) − Pn (x))
in each tabular subinterval x j+k−1 ≤ x ≤ x j+k (k = 1, 2, . . . , n) is sufficiently
small. It is notable that (for n > 1) the error is often alternately positive and
negative in successive subintervals, and considerable cancellation of error occurs;
in contrast with numerical differentiation, quadrature is inherently accurate! It is
usually sufficient to use a rather low degree polynomial approximation over any
subinterval x j ≤ x ≤ x j+n .

1 The trapezoidal rule


Perhaps the most straightforward quadrature is to divide the interval a ≤ x ≤ b
into N equal strips of width h by the points

x j = a + j h, j = 0, 1, 2, . . . , N

such that b = a + N h. Then we can use the additive property


Z b Z x1 Z x2 Z xN
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x1 x N −1

and the linear approximations (involving x = x j + θ h)


140 NUMERICAL INTEGRATION 1
Z x j+1 Z 1
f (x) dx = h f (x j + θ h) dθ
xj 0
 1
θ2 1
Z 
≈h [1 + θ1] f j dθ = h θ + 1 f j
0 2 0
= h 1 + 2 1 f j = h f j + 2 ( f j+1 − f j )
1 1
   

h
= ( f j + f j+1 )
2

to obtain the trapezoidal rule†


Z b h h
f (x) dx ≈ ( f 0 + f 1 ) + · · · + ( f N −1 + f N )
a 2 2
h
= ( f 0 + f N ) + h( f 1 + f 2 + · · · + f N −1 )
2

Integration by the trapezoidal rule therefore involves computing a finite sum of


values given by the integrand f , and is very quick. Note that this procedure can be
interpreted geometrically (see Figure 16) as the sum of the areas of N trapeziums
of width h and average height 12 ( f j + f j+1 ).

y
N

............................
................... .......................................... ..
. y = f (x)
................ .......... ..
..
................. ...... ..
.. ........ .
... ....... ..
.... .......
...
.... ....... ..
..... .......
....... .
...
.
.. ....
... .......
..... ...
... ....... ....
.. . ........
....... .....
. ........... ............
.. ........................ ................
.. ...................... ....................................
.. .......... ....

..................h
....................

I x
x0 = a x1 x2 x N −1 x N = b

FIGURE 16. The trapezoidal rule.

2 Accuracy
The trapezoidal rule corresponds to a rather crude polynomial approximation (a
straight line) between successive points x j and x j+1 = x j + h, and hence can only

† Thisrule is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 170.
THE TRAPEZOIDAL RULE 141

be accurate for sufficiently small h. An approximate (upper) bound on the error


may be derived as follows. From the Taylor expansion

h 2 00
f j+1 = f (x j + h) = f j + h f 0 (x j ) + f (x j ) + · · ·
2!
one has the trapezoidal form
Z x j+1
h 2 00
 
h h 0
f (x) dx ≈ ( f j + f j+1 ) = h f j + f (x j ) + f (x j ) + · · ·
xj 2 2 4

while we may expand f (x) in x j ≤ x ≤ x j+1 as

(x − x j )2 00
f (x) = f j + (x − x j ) f 0 (x j ) + f (x j ) + · · ·
2!
to get the exact form
Z x j+1
h 2 00
 
h
f (x) dx = h f j + f 0 (x j ) + f (x j ) + · · ·
xj 2 6

Comparison of these two forms shows that the truncation error is

h 3 00
1 1
h 2 f 00 (x j ) + · · · = − f (x j ) + · · ·

h 6 − 4 12
(The concept of truncation error was introduced on page 8.) If we ignore higher-
order terms, an approximate bound on this error in using the trapezoidal rule (over
N subintervals) is therefore

N 3 (b − a)h 2
h max f 00 (x) = max f 00 (x)

12 a≤x≤b 12 a≤x≤b

Where possible, we choose h small enough to make this error negligible. In


the case of hand computation from tables, it may not be possible. On the other
hand, in a computer program in which f (x) may be generated anywhere in a ≤
x ≤ b, the interval may be subdivided smaller and smaller until there is sufficient
accuracy. (The integral value for successive subdivisions can be compared, and
the subdivision process terminated when there is adequate agreement between
successive values.)

3 Example
The integral
Z 0.3
e x dx
0.1
is estimated using the trapezoidal rule and the data in Step 20 (page 90).
142 NUMERICAL INTEGRATION 1

If we use T (h) to denote the approximation with strip width h, we obtain


0.2
T (0.2) = [1.10517 + 1.34986] = 0.24550
2
0.1
T (0.1) = [1.10517 + 2(1.22140) + 1.34986] = 0.24489
2
0.05
T (0.05) = [1.10517 + 2(1.16183 + 1.22140 + 1.28403)
2
+ 1.34986] = 0.24474
R 0.3
Since 0.1 e x dx = e0.3 − e0.1 = 0.24469, we may observe that the error sequence
0.00081, 0.00020, 0.00005 decreases with h 2 as expected.

Checkpoint

1. Why is quadrature using a polynomial approximation for the in-


tegrand likely to be satisfactory, even if the polynomial is of low
degree?
2. What is the degree of the approximating polynomial corresponding
to the trapezoidal rule?
3. Why is the trapezoidal rule well suited for implementation on a
computer?

EXERCISES

1. Estimate the integral


Z 1.3 √
x dx
1.0
by using the trapezoidal rule and the data given in Exercise 2 of the previous
Step.
2. Use the trapezoidal rule with h = 1, 0.5, and 0.25 to estimate
Z 1
1
dx
0 1 + x
STEP 31

NUMERICAL INTEGRATION 2
Simpson’s rule

If it is undesirable (for example, in the use of tables) to increasingly subdivide an


interval a ≤ x ≤ b in order to get increasingly accurate quadrature, the alternative
is to use an approximating polynomial of higher degree. An integration based on
quadratic (that is, parabolic) approximation called Simpson’s rule is adequate for
most quadratures that one is likely to encounter.

1 Simpson’s rule
Simpson’s rule corresponds to quadratic approximation; thus, for x j ≤ x ≤
x j + 2h,
Z x j +2h Z 2
f (x) dx = h f (x j + θ h) dθ
xj 0
2 θ (θ − 1) 2
Z 
≈h 1 + θ1 + 1 f j dθ
0 2!
  2
θ 2 θ θ2
  3
=h θ + 1+ − 12 f j
2 6 4 0
= h 2 f j + 2( f j+1 − f j ) + 3 ( f j+2 − 2 f j+1 + f j )
1
 

h
= ( f j + 4 f j+1 + f j+2 )
3
A parabolic arc is fitted to the curve y = f (x) at the three tabular points x j , x j + h,
and x j + 2h. Consequently, if N = (b − a)/ h is even, one obtains Simpson’s rule:
Z b Z x2 Z x4 Z xN
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x2 x N −2
h
≈ [ f 0 + 4 f 1 + 2 f 2 + 4 f 3 + · · · + 4 f N −1 + f N )]
3
where
f j = f (x j ) = f (a + j h), j = 0, 1, 2, . . . , N
Integration by Simpson’s rule involves computing a finite sum of values given
by the integrand f , as does the trapezoidal rule. Simpson’s rule is also effective
for implementation on a computer, and one direct application in hand calculation
usually gives sufficient accuracy.
144 NUMERICAL INTEGRATION 2

2 Accuracy
For a known integrand f , we emphasize that it is quite appropriate to program
increased interval subdivision to provide the desired accuracy, but for hand calcu-
lation an error bound is again useful.
Suppose that in x j ≤ x ≤ x j + 2h the function f (x) has the Taylor expansion

(x − x j+1 )2 00
f (x) = f j+1 + (x − x j+1 ) f 0 (x j+1 ) + f (x j+1 ) + · · ·
2!
then
Z x j +2h 
1 h 2 00 1 h 4 (4)

f (x) dx = 2h f j+1 + f (x j+1 ) + f (x j+1 ) + · · ·
xj 3 2! 5 4!

One may re-express the quadrature rule for x j ≤ x ≤ x j + 2h by writing f j+2 =


f (x j+1 + h) and f j = f (x j+1 − h) as Taylor series; thus
Z x j +2h h
f (x) dx ≈ ( f j + 4 f j+1 + f j+2 )
xj 3
h 2 00
 
h
= f j+1 − h f 0 (x j+1 ) + f (x j+1 ) − · · · + 4 f j+1
3 2!
h 2 00
 
+ f j+1 + h f (x j+1 ) +
0
f (x j+1 ) + · · ·
2!
1 h 2 00 1 h 4 (4)
 
= 2h f j+1 + f (x j+1 ) + f (x j+1 ) + · · ·
3 2! 3 4!

Comparison of these two forms shows that the truncation error is

 h4 h 5 (4)
2h 1
5 − 1
3 f (4) (x j+1 ) + · · · = − f (x j+1 ) + · · ·
4! 90
Ignoring higher-order terms, we conclude that the approximate bound on this error
in estimating
Z b
f (x) dx
a
by Simpson’s rule (with N /2 subintervals of width 2h) is

N h5 (b − a)h 4
max f (4) (x) = max f (4) (x)

2 90 a≤x≤b 180 a≤x≤b

It is notable that the error bound is proportional to h 4 , compared with h 2 for the
cruder trapezoidal rule. In passing, one may note that Simpson’s rule is exact for
a cubic.
SIMPSON’S RULE 145

3 Example
We estimate the integral
Z 1.3 √
x dx
1.0
by using Simpson’s rule and the data in Exercise 2 of Step 29 on page 138.
There will be an even number of intervals if we choose h = 0.15 or h = 0.05.
If we use S(h) to denote the approximation with strip width h, we obtain
0.15
S(0.15) = [1 + 4(1.07238) + 1.14018] = 0.32148(5)
3
and
0.05
S(0.05) = [1 + 4(1.02470 + 1.07238 + 1.11803)
3
+ 2(1.04881 + 1.09545) + 1.14018]
= 0.32148(6)
Since f (4) (x) = − 15
16 x
−7/2 , an approximate bound on the truncation error is

0.30 15 4
× h = 0.0015625h 4
180 16
whence 0.0000008 for h = 0.15 and 0.00000001 for h = 0.05. Note that the
truncation error is negligible; within round-off error, the estimate is 0.32148(6).

Checkpoint

1. What is the degree of the approximating polynomial corresponding


to Simpson’s rule?
2. What is the error bound for Simpson’s rule?
3. Why is Simpson’s rule well suited for implementation on a com-
puter?

EXERCISES

1. Estimate Z 1 1
dx
0 1+x
to 4D, using numerical integration.
2. Use Simpson’s rule with N = 2 to obtain an approximation to
Z π/4
x cos x dx
0
Compute the resulting error, given that the true value of the integral is 0.26247
(5D ).
STEP 32

NUMERICAL INTEGRATION 3
Gaussian integration formulae

The numerical integration procedures previously discussed (namely the trapez-


oidal rule and Simpson’s rule) involve equally spaced values of the argument.
However, for a fixed number of points the accuracy may be increased if we do
not insist that the points are equidistant. This is the background of an alternative
integration process due to Gauss, which will now be considered. Briefly, assuming
some specified number of values of the integrand (of unspecified position), we
construct a formula by choosing the arguments (or abscissae) within the range of
integration so that they produce the most accurate integration rule.

1 Gauss two-point integration formula


Consider any two-point formula of the form
Z 1
f (x) dx ≈ w1 f (x1 ) + w2 f (x2 )
−1
where the weights w1 , w2 and the abscissae x1 , x2 are to be determined such that
the formula integrates 1, x, x 2 , and x 3 (and hence all cubic functions) exactly. We
have four conditions on the four unknowns, as follows:
(i) f (x) = 1 integrates exactly if 2 = w1 + w2
(ii) f (x) = x integrates exactly if 0 = w1 x1 + w2 x2
(iii) f (x) = x 2 integrates exactly if 32 = w1 x12 + w2 x22
(iv) f (x) = x 3 integrates exactly if 0 = w1 x13 + w2 x23
It is easily verified that
w1 = w2 = 1, x2 = −x1 , x12 = 1
3
satisfies the four equations given in (i)–(iv). Thus we have the Gauss two-point
integration formula†
Z 1    
f (x) dx ≈ f − √1 + f √1 ≈ f (−0.57735027) + f (0.57735027)
−1 3 3

A change of variable also renders this last form applicable to any interval; we
make the substitution
u = 12 [(b − a)x + (b + a)]
† Thisformula is suitable for implementation on a computer. Pseudo-code for study and use in
programming may be found on page 171.
GAUSSIAN INTEGRATION FORMULAE 147

in the integral we seek to evaluate,


Z b
φ(u) du say.
a

If we write
φ(u) = φ 1
− a)x + (b + a)] ≡ g(x)

2 [(b
then
Z b b−a
Z 1
φ(u) du = g(x) dx
a 2 −1
since du = 12 (b − a)dx, and u = a when x = −1, u = b when x = 1.
It is important to note that the Gauss two-point formula is exact for cubic
polynomials, and hence may be compared in accuracy with Simpson’s rule. (In
fact, the error for the Gauss formula is about 2/3 that for Simpson’s rule.) Since
one fewer function value is required for the Gauss formula, it may be preferred
provided the function evaluations at the irrational abscissae values are available.

2 Other Gauss formulae


The Gauss two-point integration formula discussed is but one of a large family of
such formulae. Thus, we might derive the Gauss three-point integration formula
Z 1 p p
f (x) dx ≈ 5
9 f (− 3/5) + 8
9 f (0) + 5
9 f ( 3/5)
−1

which is exact for quintics; indeed, the error is less than

1
max f (6) (x)

15 750 −1≤x≤1

This and the previous two point-formula represent the lowest order in a series of
formulae commonly referred to as Gauss-Legendre, because of their association
with Legendre polynomials.
There are yet other formulae associated with other orthogonal polynomials
(Laguerre, Hermite, Chebyshev, etc.); the general form of Gaussian integration
may be represented by the formula
Z b n
X
W (x) f (x) dx ≈ wi f (xi )
a i=1

where W (x) is the weight function in the integral, {x1 , x2 , . . . , xn } is the set of
points in the integration range a ≤ x ≤ b, and the weights wi in the summation
are again constants.
148 NUMERICAL INTEGRATION 3

3 Application of Gaussian quadrature


In general, the sets {xi } and {wi } are tabulated ready for reference, so that appli-
cation of Gaussian quadrature is immediate.
As an example, we apply the Gauss (Gauss-Legendre) two-point and four-point
formula to calculate Z π/2
sin t dt
0
The two-point formula (n = 2) is
Z 1
f (x) dx ≈ f (−0.57735027) + f (0.57735027)
−1
The change of variable
1 π π π
t= x+ = (x + 1)
2 2 2 4
yields
Z π/2 π 1
Z
sin t dt = sin(π(x + 1)/4) dx
0 4 −1
If we take g(x) = sin(π(x + 1)/4), then we have g(−0.57735027) = 0.325886
and g(0.57735027) = 0.945409, so that
Z π/2
π
sin t dt ≈ (0.325886 + 0.945409) = 0.998473
0 4
The four-point formula (n = 4) is
Z 1
f (x) dx ≈ 0.34785485[ f (−0.86113631) + f (0.86113631)]
−1
+ 0.65214515[ f (−0.33998104) + f (0.33998104)]
which leads to Z π/2
sin t dt ≈ 1.0000000
0
correct to seven decimal places. This accuracy is impressive enough; Simpson’s
rule with 64 points produces 0.99999983!
Checkpoint

1. What is a disadvantage of integration formulae using equally spaced


values of the argument?
2. What is the general form of the Gaussian integration formula?
3. How accurate are the Gauss-Legendre two-point and three-point
formulae?

EXERCISE Z 1 1
Apply the Gauss two-point and four-point formulae to evaluate du.
0 1+u
STEP 33

ORDINARY DIFFERENTIAL EQUATIONS 1


Single-step methods

In pure mathematics courses a lot of attention is paid to the properties of differ-


ential equations and ‘analytic’ techniques for solving them. Unfortunately, many
differential equations (including nearly all the nonlinear ones) encountered in the
‘real world’ are not amenable to analytic solution. Even the apparently simple
problem of solving
x+y
y0 = with y = 0 when x = 1
x−y
involves considerable manipulation before the unwieldy solution
ln(x 2 + y 2 ) = 2 tan−1 (y/x)
is obtained. Even then a lot more effort is required just to extract the value of
y corresponding to one value of x. In such situations it is preferable to use a
numerical approach from the start.
Partial differential equations are beyond the scope of this text, but in this Step
and the next one we shall have a brief look at some methods for solving the single
first-order ordinary differential equation
y 0 = f (x, y)
with given initial value y(x0 ) = y0 . The first-order differential equation and the
given initial value constitute a first-order initial value problem. The numerical
solution of this initial value problem involves estimating values of y(x) at (usually
equidistant) points x1 , x2 , . . . , x N . For convenience we shall assume that these
points are indeed equidistant and use h to denote the constant step length. In
practice, it is sometimes desirable to adjust the step length as the numerical
method proceeds. For instance, we may wish to use a smaller step length when
we reach a point at which the derivative is particularly large.
These numerical methods for first-order initial value problems may be used
(in slightly modified form) to solve higher-order differential equations. A simple
(optional) introduction to this topic is given in Step 35.

1 Taylor series
We already have one technique available for this problem; we can estimate y(x1 )
by a Taylor series of order p (the particular value of p will depend on the size of
h and the accuracy required):
h 2 00 h p ( p)
y(x1 ) ≈ y1 = y(x0 ) + hy 0 (x0 ) + y (x0 ) + · · · + y (x0 )
2! p!
150 ORDINARY DIFFERENTIAL EQUATIONS 1

Here, y(x0 ) is given, and y 0 (x0 ) can be found by substituting x = x0 and y = y0


in the differential equation, but y 00 (x0 ), . . . , y ( p) (x0 ) require differentiation of
f (x, y), which can be messy. Note that y1 , y2 , . . . , y N will be used to denote the
estimates of y(x1 ), y(x2 ), . . ., y(x N ).
Once y1 has been computed, we can estimate y(x2 ) by a Taylor series based
either on x1 (in which case the error in y1 will be propagated) or on x0 (in which
case p may have to be increased). In the local approach, yn+1 is computed
from a Taylor series based on xn , while in the global approach y1 , y2 , . . . , y N
are all computed from Taylor series based on x0 . The local approach is the one
more commonly used in practice. The Taylor series method based on the local
approach is called a single-step method since yn+1 depends only on the previous
approximation yn . All the methods covered in this Step are single-step methods;
multistep methods are considered in the next Step.
One way of avoiding the differentiation of f (x, y) is to fix p = 1 and compute

yn+1 = yn + h f (xn , yn ), n = 0, 1, 2, . . . , N − 1

This is known as Euler’s method. However, unless the step length h is very small,
the truncation error will be large and the results inaccurate.

2 Runge-Kutta methods
A popular way of avoiding the differentiation of f (x, y) without sacrificing ac-
curacy involves estimating yn+1 from yn and a weighted average of values of
f (x, y), chosen so that the truncation error is comparable to that of a Taylor series
of order p. The details of the derivation lie beyond the scope of this book, but we
can quote two of the simpler Runge-Kutta methods† .
The first has the same order of accuracy as the Taylor series with p = 2 and is
usually written as three steps:

k1 = h f (xn , yn )
k2 = h f (xn + h, yn + k1 )
yn+1 = yn + 12 (k1 + k2 )

The second is the fourth-order method:

k1 = h f (xn , yn )
k2 = h f (xn + h/2, yn + k1 /2)
k3 = h f (xn + h/2, yn + k2 /2)
k4 = h f (xn + h, yn + k3 )
yn+1 = yn + 16 (k1 + 2k2 + 2k3 + k4 )

† These methods are suitable for implementation on a computer. Pseudo-code for study and use
in programming may be found on page 172.
SINGLE-STEP METHODS 151

Neither method involves evaluating derivatives of f (x, y); instead f (x, y) itself
is evaluated several times (twice in the second-order method, four times in the
fourth-order method).

3 Example
It is instructive to compare some of the methods given above on a very simple
problem. For example, let us estimate y(0.5) given that

y 0 = x + y with y(0) = 1, that is, x0 = 0, y0 = 1

The exact solution is y(x) = 2e x − x − 1, so y(0.5) = 1.79744. We shall use a


fixed step length h = 0.1 and work to 5D.
(i) Euler’s method (first order):

yn+1 = yn + 0.1(xn + yn ) = 0.1xn + 1.1yn

so
y1 = 0.1(0) + 1.1(1) = 1.1
y2 = 0.1(0.1) + 1.1(1.1) = 1.22
y3 = 0.1(0.2) + 1.1(1.22) = 1.362
y4 = 0.1(0.3) + 1.1(1.362) = 1.5282
and y5 = 0.1(0.4) + 1.1(1.5282) = 1.72102
which is not even accurate to 1D (the error is approximately 0.08).
(ii) Taylor series (fourth order): Since

y 0 = x + y, y 00 = 1 + y 0 , y 000 = y 00 , and y (4) = y 000

we have
0.12
yn+1 = yn + 0.1(xn + yn ) + (1 + xn + yn )
2!
0.13 0.14
+ (1 + xn + yn ) + (1 + xn + yn )
3! 4!
≈ 0.00517 + 0.10517xn + 1.10517yn

Thus

y1 = 0.00517 + 0.10517(0) + 1.10517(1) = 1.11034


y2 = 0.00517 + 0.10517(0.1) + 1.10517(1.11034) = 1.24280
y3 = 0.00517 + 0.10517(0.2) + 1.10517(1.24280) = 1.39971
y4 = 0.00517 + 0.10517(0.3) + 1.10517(1.39971) = 1.58364
y5 = 0.00517 + 0.10517(0.4) + 1.10517(1.58364) = 1.79743

which is accurate to 4D (the error is approximately 0.00001).


152 ORDINARY DIFFERENTIAL EQUATIONS 1

(iii) Runge-Kutta (second order):


k1 = 0.1(xn + yn ), k2 = 0.1(xn + 0.1 + yn + k1 )
and yn+1 = yn + 12 (k1 + k2 )

n = 0 : k1 = 0.1(0 + 1) = 0.1
k2 = 0.1(0.1 + 1 + 0.1) = 0.12
y1 = 1 + 12 (0.1 + 0.12) = 1.11
n = 1 : k1 = 0.1(0.1 + 1.11) = 0.121
k2 = 0.1(0.2 + 1.11 + 0.121) = 0.1431
y2 = 1.11 + 21 (0.121 + 0.1431) = 1.24205
n = 2 : k1 = 0.1(0.2 + 1.24205) = 0.14421
k2 = 0.1(0.3 + 1.24205 + 0.14421) = 0.16863
y3 = 1.24205 + 12 (0.14421 + 0.16863) = 1.39847
n = 3 : k1 = 0.1(0.3 + 1.39847) = 0.16985
k2 = 0.1(0.4 + 1.39847 + 0.16985) = 0.19683
y4 = 1.39847 + 21 (0.16985 + 0.19683) = 1.58181
n = 4 : k1 = 0.1(0.4 + 1.58181) = 0.19818
k2 = 0.1(0.5 + 1.58181 + 0.19818) = 0.22800
y5 = 1.58181 + 21 (0.19818 + 0.22800) = 1.79490
which is accurate to 2D (the error is approximately 0.003).
As we might expect, the fourth-order method is clearly superior, the first-order
method is clearly inferior, and the second-order method falls in between.

Checkpoint

1. For each of the two types of method outlined in this Step, what is
the main disadvantage?
2. Why might we expect higher order methods to be more accurate?

EXERCISES
1. For the initial value problem y 0 = x + y with y(0) = 1 considered in the
previous section, obtain estimates of y(0.8) by doing three more steps of
(a) Euler’s method,
(b) the fourth-order Taylor series method,
(c) the second-order Runge-Kutta method,
with h = 0.1. Compare the accuracy of the three methods.
2. Use Euler’s method with step length h = 0.2 to estimate y(1) given that
y 0 = −x y 2 with y(0) = 2.
STEP 34

ORDINARY DIFFERENTIAL EQUATIONS 2


Multistep methods

As mentioned earlier, the methods covered in the previous Step are classified as
single-step methods, because the only value of the approximate solution used in
constructing yn+1 is yn , the result of the previous step. In contrast, multistep
methods make use of earlier values like yn−1 , yn−2 , . . ., in order to reduce the
number of times that f (x, y) or its derivatives have to be evaluated.

1 Introduction
Among the multistep methods that can be derived by integrating interpolating
polynomials we have (using f n to denote f (xn , yn )):
(a) the midpoint method (second order): yn+1 = yn−1 + 2h f n
(b) Milne’s method (fourth order): yn+1 = yn−3 + 4h3 (2 f n − f n−1 + 2 f n−2 )
(c) the family of Adams-Bashforth methods: the second-order formula in the
family is given by
h
yn+1 = yn + (3 f n − f n−1 )
2
while the formula of order 4 is
h
yn+1 = yn + (55 f n − 59 f n−1 + 37 f n−2 − 9 f n−3 )
24
(d) the family of Adams-Moulton methods: the second-order formula in this
family, given by
h
yn+1 = yn + ( f n+1 + f n )
2
is often referred to as the trapezoidal method. The Adams-Moulton formula
of order 4 is
h
yn+1 = yn + (9 f n+1 + 19 f n − 5 f n−1 + f n−2 )
24
Note that the family of Adams-Moulton methods in (d) requires evaluation of
f n+1 = f (xn+1 , yn+1 ). Because yn+1 is therefore involved on both the left and
right-hand sides of the expressions, such methods are known as implicit methods.
On the other hand, since yn+1 appears only as the term on the left-hand side in all
the families listed under (a)–(c), they are called explicit methods. Implicit methods
have the disadvantage that one usually requires some numerical technique (see
Steps 7–10) to solve for yn+1 . However, it is common to use an explicit method
154 ORDINARY DIFFERENTIAL EQUATIONS 2

and an implicit method together to produce a predictor-corrector method, and this


approach is discussed in more advanced texts such as Mathews (1992).
We will not go into the various ways in which multistep methods are used,
but clearly we will need more than one ‘starting value’, which may be obtained
by first using a single-step method (see the previous Step). An advantage of a
multistep method is that we need to evaluate f (x, y) only once to obtain yn+1 ,
since f n−1 , f n−2 , . . ., will already have been computed. In contrast, any (single-
step) Runge-Kutta method involves more than one function evaluation at each
step, which for complicated functions f (x, y) can be computationally expensive.
Thus the comparative efficiency of multistep methods is often attractive, but a
multistep method may be numerically unstable.

2 Stability
Numerical stability is discussed in depth in more advanced texts such as Burden
and Faires (1993). In general, a method is unstable if any errors introduced into
the computation are amplified as the computation progresses. It turns out that the
Adams-Bashforth and Adams-Moulton families of methods have good stability
properties.
As an example of a multistep method with poor stability properties, let us
apply the midpoint method given above with h = 0.1 to the differential equation
y 0 = −5y, y(0) = 1. The true solution to this problem is given by y(x) = e−5x .
We introduce error by taking y1 to be the value obtained by rounding the true
value e−0.5 to 5D, namely, y1 = 0.60653. The resulting method is then given by

yn+1 = yn−1 + 2 × 0.1 × f n = yn−1 + 0.2 × (−5yn ) = yn−1 − yn

Working to 5D, we construct the following table which allows us to compare the
consequent estimates yn with the true values y(xn ).

n xn yn y(xn ) = e−5xn |y(xn ) − yn |


0 0.0 1.0 1.0 0.00000
1 0.1 0.60653 0.60653 0.00000
2 0.2 0.39347 0.36788 0.02559
3 0.3 0.21306 0.22313 0.01007
4 0.4 0.18041 0.13534 0.04507
5 0.5 0.03265 0.08208 0.04943
6 0.6 0.14776 0.04979 0.09797
7 0.7 −0.11511 0.03020 0.14531
8 0.8 0.26287 0.01832 0.24455
9 0.9 −0.37798 0.01111 0.38909
10 1.0 0.64085 0.00674 0.63411
The estimates get worse as n increases. Not only do the approximations alternate
in sign after x6 , but their magnitudes also increase. Further calculation shows that
MULTISTEP METHODS 155

y20 has the value 77.82455 with an error over a million times larger than the true
value!

Checkpoint

1. What distinguishes an explicit multistep method from an implicit


one?
2. Give an advantage of multistep methods.

EXERCISES

1. Apply the second-order Adams-Bashforth method with h = 0.1 to the prob-


lem y 0 = −5y, y(0) = 1, to obtain the approximations y2 , . . . , y10 . (Take
y1 = 0.60653.)
Confirm that the approximations do not exhibit the instability behaviour of
the midpoint method as seen in the table of Section 2.
2. Retaining up to six digits, use the second-order Adams-Bashforth method
with step length h = 0.1 to estimate y(0.5) given that y 0 = x + y with
y(0) = 1. (Take y1 = 1.11 as the second starting value.)
STEP 35

ORDINARY DIFFERENTIAL EQUATIONS 3*


Higher order differential equations*

In the previous two Steps we discussed numerical methods for solving the first-
order initial value problem y 0 = f (x, y), y(x0 ) = y0 . However, ordinary differ-
ential equations that arise in practice are often of higher order. For example, as
explained in the footnote on page 4, a more realistic differential equation for the
motion of a pendulum is given by

y 00 = −ω2 sin y

which may be solved subject to y(x0 ) = y0 and y 0 (x0 ) = y00 , where y0 , y00 are
given values. (For notational consistency with the previous two Steps, we have
changed the variables from θ and t to y and x, respectively.) More generally, an
n-th order differential equation may be written in the form

y (n) = g(x, y, y 0 , y 00 , . . . , y (n−2) , y (n−1) )

and solved subject to the n initial conditions

(n−1)
y(x0 ) = y0 , y 0 (x0 ) = y00 , . . . , y (n−1) (x0 ) = y0

(n−1)
where y0 , y00 , . . . , y0 are given values. We shall see how this n-th order initial
value problem may be written as a system of first-order initial value problems,
which leads us to numerical procedures to solve the general initial value problem
that are extensions of the numerical methods considered in the previous two Steps.

1 Systems of first-order initial value problems


If we set w j = y ( j−1) for j = 1, 2, . . . , n, then the n-th order differential equation
y (n) = g(x, y, y 0 , y 00 , . . . , y (n−2) , y (n−1) ) becomes

wn0 = g(x, w1 , w2 , . . . , wn )

Moreover, since w 0j = w j+1 for j = 1, 2, . . . , n −1, we have an equivalent system


of n first-order differential equations, which is subject to the n initial conditions
(n−1)
w1 (x0 ) = y0 , w2 (x0 ) = y00 , . . . , wn (x0 ) = y0 .
For computational purposes, we choose to regard this hierarchy as a system of
n first-order initial value problems. Thus if the initial conditions for the simple
HIGHER ORDER DIFFERENTIAL EQUATIONS* 157

pendulum are y(0) = 0 and y 0 (0) = 1 for example, then the system of two
first-order initial value problems is given by
w10 = w2 , w1 (0) = 0, w20 = −ω2 sin w1 , w2 (0) = 1
We remark that a more general system of n first-order differential equations is
given by
w0j = g j (x, w1 , w2 , . . . , wn )
for j = 1, 2, . . . , n.

2 Numerical methods for first-order systems


For simplicity we shall consider only the n = 2 case. Then the second-order
initial value problem
y 00 = g(x, y, y 0 ), y(x0 ) = y0 , y 0 (x0 ) = y00
leads to the first-order system
w10 = w2 , w1 (x0 ) = y0 , w20 = g(x, w1 , w2 ), w2 (x0 ) = y00
To extend any of the numerical methods considered in the previous two Steps, we
simply apply the method to each of these two initial value problems. The easiest
way to see how this may be done is to write the system in vector form. Setting
w1 w2
     
y0
w= , g(x, w) = , w0 =
w2 g(x, w1 , w2 ) y00
we see that the system may be expressed as
w0 = g(x, w), w(x0 ) = w0
Now recall that Euler’s method for solving y 0 = f (x, y) is
yn+1 = yn + h f (xn , yn )
with the given starting (initial) value y(x0 ) = y0 . The analogous method for
solving the system is given by
wn+1 = wn + hg(xn , wn )
with w(x0 ) = w0 given. Here we have again adopted the convention that the
subscript n denotes the estimate at xn = x0 + nh, as in the previous Steps; and we
shall denote the components of wn by w1,n and w2,n . These are the approximations
to w1 (xn ) = y(xn ) and w2 (xn ) = y 0 (xn ), respectively. If we write out the separate
components in these vectors, we obtain the equations
w1,n+1 = w1,n + hw2,n , w1,0 = y0
and
w2,n+1 = w2,n + hg(xn , w1,n , w2,n ), w2,0 = y00
158 ORDINARY DIFFERENTIAL EQUATIONS 3*

As another illustration, recall the Runge-Kutta method given by


k1 = h f (xn , yn )
k2 = h f (xn + h, yn + k1 )
yn+1 = yn + 12 (k1 + k2 )
Then the analogous method for solving the second-order initial value problem is
k1 = hg(xn , wn )
k2 = hg(xn + h, wn + k1 )
wn+1 = wn + 12 (k1 + k2 )
In component form the equations are
k11 = hw2,n
k12 = hg(xn , w1,n , w2,n )
k21 = h(w2,n + k12 )
k22 = hg(xn + h, w1,n + k11 , w2,n + k12 )
w1,n+1 = w1,n + 21 (k11 + k21 )
w2,n+1 = w2,n + 12 (k12 + k22 )

3 Numerical example
If we use the Euler method to solve the pendulum problem
w10 = w2 , w1 (0) = 0, w20 = −ω2 sin w1 , w2 (0) = 1
the resulting equations are
w1,n+1 = w1,n + hw2,n , w1,0 = 0
and
w2,n+1 = w2,n − hω2 sin w1,n , w2,0 = 1

With ω = 1 and h = 0.2, we obtain the values given in the following table:

n xn w1,n w2,n
0 0.0 0 1
1 0.2 0.20000 1.00000
2 0.4 0.40000 0.96027
3 0.6 0.59205 0.88238
4 0.8 0.76853 0.77077
5 1.0 0.92268 0.63175
HIGHER ORDER DIFFERENTIAL EQUATIONS* 159

If we use the Runge-Kutta method given in the previous section, we obtain the
following values:

n xn w1,n w2,n
0 0.0 0 1
1 0.2 0.20000 0.98013
2 0.4 0.39205 0.92169
3 0.6 0.56875 0.82898
4 0.8 0.72377 0.70810
5 1.0 0.85215 0.56574
Since this Runge-Kutta method is second-order and the Euler method is only
first-order, we might expect the values in this table to be more accurate than those
displayed in the previous table. By obtaining very accurate approximations using
much smaller values of h, it may be verified that this is indeed the case.

Checkpoint

1. How may an n-th order initial value problem be written as a system


of n first-order initial value problems?
2. How may a numerical method for solving a first-order initial value
problem be extended to solve a system of two first-order initial value
problems?

EXERCISE

Apply the Euler method with step length h = 0.2 to obtain approximations
to y(1) and y 0 (1) for the second-order initial value problem
y 00 + y 0 + y = sin x, y(0) = y 0 (0) = 0
APPLIED EXERCISES

Here we give some exercises which have a more ‘applied’ nature than most of
the exercises found previously. Some of these exercises are not suitable for hand
calculation, but require the use of a computer.

EXERCISES

1. Consider the cylindrical tank of radius r lying with its axis horizontal as
described in Section 1 of Step 6. Suppose we wish to calibrate a measuring
stick so that it has markings showing when the tank is 10%, 20%, 30%, and
40% full. Find the values of h required (see Figure 2 on page 24) for doing
this calibration.
2. If H (t) is the population of a prey species at time t and P(t) is the population
of a predator species at time t, then a simple model relating these two
populations is given by the two differential equations
dH
= H (1 − 0.05P)
dt
dP
= P(0.01H − 0.6)
dt
It may be shown that one solution of this problem is obtained when H (t) and
P(t) satisfy
0.6 ln H (t) + ln P(t) − 0.01H (t) − 0.05P(t) = −30
If the population of the prey at t = 0 is H (0) = 1000, find the value of P(0).
3. Carbohydrates, proteins, fats, and alcohol are the main sources of energy in
food. The number of grams of these nutrients in 100 gram servings of each
of bread, lean steak, ice cream, and red wine is given in the following table.
Carbohydrate Protein Fat Alcohol
Bread 47 8 2 0
Steak 0 27 12 0
Ice cream 25 4 7 0
Red wine 0 0 0 10

Given that 100 grams of bread, lean steak, ice cream, and red wine provide
227, 218, 170, and 68 kilocalories respectively, find the number of kilocalories
provided by 100 grams of carbohydrate, protein, fat, and alcohol.
APPLIED EXERCISES 161

4. A lake populated with fish is divided into three regions X , Y , and Z . Let
x (t) , y (t) , and z (t) denote the proportions of the fish in regions X , Y , and
Z respectively after t days. Then as the fish swim around the lake, the
proportions after (t + 1) days satisfy

x (t+1) = 1 (t)
4x + 14 y (t) + 15 z (t)
y (t+1) = 2 (t)
5x + 12 y (t) + 25 z (t)
z (t+1) = 7 (t)
20 x + 14 y (t) + 25 z (t)

which may be written as x(t+1) = Ax(t) , where


   
x (t) 1
4
1
4
1
5
x(t) =  (t) 
   
2 1 2
 y  and A = 
 
5 2 5 
z (t) 7 1 2
20 4 5

Given that after day 1 we have x(1) = (0.24, 0.43, 0.33)T , find the initial
population distribution x(0) .
5. For the linear system x(t+1) = Ax(t) given in the previous exercise, it is
interesting to see whether there is an equilibrium population distribution;
that is, is there an x for which Ax = x? Use the power method to show that
there is such an x and find its value.
6. In applied mathematics the Bessel functions Jn of order n often arise. The
values of the Bessel function J0 (x) for x = 0.0 (0.1) 0.5 are given to 4D in
the following table.

x J0 (x)
0.0 1.0000
0.1 0.9975
0.2 0.9900
0.3 0.9776
0.4 0.9604
0.5 0.9385

Find the degree of the polynomial which fits the data and obtain an approxi-
mation to J0 (0.25) based on the interpolating polynomial of this degree.
7. Corrugated iron is manufactured by using a machine that presses a flat sheet
of iron into one whose cross section has the form of a sine wave. Suppose
a corrugated sheet 69 cm wide is required, the height of each wave from the
centre line is 1 cm and each wave has a period of 10 cm. The width of flat
162 APPLIED EXERCISES

sheet required is then given by the arc length of the curve f (x) = sin(π x/5)
from x = 0 to x = 69. From calculus, this arc length is
Z 69 q Z 69 q
L= 1 + ( f 0 (x))2 dx = 1 + π 2 cos2 (π x/5)/25 dx
0 0

Find the value of L.


8. The error function given by
2
Z x 2
erf(x) = √ e−t dt
π 0

commonly arises in applied mathematics and statistics. Produce a table of


values of the error function for x = 0.1 (0.1) 1.0
9. As explained in the footnote on page 4, a more realistic differential equation
for the motion of a pendulum is given by
d2 θ
= −ω2 sin θ
dt 2
On multiplying this equation by dθ
dt and integrating with respect to t, we have
dθ √
= ω 2 cos θ + c
dt
where c is an arbitrary constant. Assuming that ω = 1, θ (0) = 0, and c = −1
(so that θ 0 (0) = 1), use this first-order equation to obtain an estimate of θ (1).
(You may wish to compare your answer with the values obtained in Section
3 of Step 35.)
10. A skydiver experiences a downward force of mg and an upward force of
mkv 2 , where m is the mass of the skydiver, g is the force due to gravity, k is
a constant, and v(t) is the speed at time t. A differential equation modelling
this situation is given by
dv dv
m = mg − mkv 2 or = g − kv 2
dt dt

It may be shown that as t → ∞, v(t) → g/k, the terminal velocity. If
g = 9.81, k = 1/(100g), and the skydiver starts from rest, find (to the nearest
tenth of a second) the time it takes for the skydiver to reach a speed of half
the terminal velocity.
APPENDIX: PSEUDO-CODE

Basic pseudo-code is given for some of the algorithms introduced in the book. In
our experience, students do benefit if they study the pseudo-code of a method at
the same time as they learn it in a Step. If they are familiar with a programming
language, they should attempt to convert at least some of the pseudo-code into
computer programs, and apply them to the set Exercises.

Pseudo-code is given in this Appendix for the following procedures.

Nonlinear equations
1. Bisection method (Step 7)
2. Method of false position (Step 8)
3. Newton-Raphson iterative method (Step 10)

Systems of linear equations


4. Gaussian elimination (Step 11)
5. Gauss-Seidel iteration (Step 13)

Interpolation
6. Newton’s divided difference formula (Step 24)

Numerical integration
7. Trapezoidal rule (Step 30)
8. Gaussian integration formula (Step 32)

Differential equations
9. Runge-Kutta method (Step 33)
164 PSEUDO-CODE

1. Bisection Method (Step 7)


Assume the equation is f (x) = 0.
1 read a, b, 
2 repeat
3 x = (a + b)/2
4 if f (x) = 0 then do:
5 print ‘Root is’, x
6 stop
7 endif
8 if f (a) ∗ f (x) > 0 then do:
9 a=x
10 else do:
11 b=x
12 endif
13 until b − a < 
14 print ‘Approximation to root is’, x

Points for study


(a) What are the input values used for?
(b) Explain the purpose of lines 8–12.
(c) Amend the pseudo-code so that the process will always stop after exactly M
iterations.
(d) Amend the pseudo-code so that the process will stop as soon as | f (x)| < .
(e) Write a computer program based on the pseudo-code.
(f) Use the computer program on Exercises 1 and 2 of the Applied Exercises
(page 160).
PSEUDO-CODE 165

2. Method of False Position (Step 8)


Assume the equation is f (x) = 0.
1 read a, b, 
2 repeat
3 x = (a ∗ f (b) − b ∗ f (a))/( f (b) − f (a))
4 if f (x) = 0 then do:
5 print ‘Root is’, x
6 stop
7 endif
8 if f (a) ∗ f (x) > 0 then do:
9 a=x
10 else do:
11 b=x
12 endif
13 until | f (x)| < 
14 print ‘Approximation to root is’, x

Points for study


(a) What are the input values used for?
(b) Under what circumstances may the process stop with a large error in x?
(c) Amend the pseudo-code so that the process will stop after M iterations if the
condition in line 13 is not satisfied.
(d) Write a computer program based on the pseudo-code.
(e) Use the computer program on Exercises 1 and 2 of the Applied Exercises
(page 160).
166 PSEUDO-CODE

3. Newton-Raphson Iterative Method (Step 10)


Assume the equation is f (x) = 0.
1 read a, M, 
2 N =0
3 repeat
4 δ = f (a)/ f 0 (a)
5 a =a−δ
6 N = N +1
7 until |δ| <  or N = M
8 print ‘Approximation to root is’, a
9 if |δ| ≥  then do:
10 print ‘required accuracy not reached in’, M, ‘iterations’
11 endif

Points for study


(a) What are the input values used for?
(b) Why is M given in the output of line 10?
(c) What happens if f 0 (a) is very small?
(d) Amend the pseudo-code to take suitable action if f 0 (a) is very small.
(e) Write a computer program based on the pseudo-code.
(f) Use the computer program on Exercises 1 and 2 of the Applied Exercises
(page 160).
PSEUDO-CODE 167

4. Gaussian Elimination (Step 11)


Assume the system is
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. ..
. . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn

1 read n, a11 , . . . , ann , b1 , . . . , bn


2 for k = 1 to n − 1 do:
3 for i = k + 1 to n do:
4 m = aik /akk
5 for j = k + 1 to n do:
6 ai j = ai j − m ∗ ak j
7 endfor
8 bi = bi − m ∗ bk
9 endfor
10 endfor
11 xn = bn /ann
12 for i = n − 1 downto 1 do:
13 xi = bi
14 for j = i + 1 to n do:
15 xi = xi − ai j ∗ x j
16 endfor
17 xi = xi /aii
18 endfor
19 print ‘Approximate solution is’, x1 , . . . , xn

Points for study


(a) Explain what happens in lines 2–10.
(b) What process is implemented in lines 11–18?
(c) Amend the pseudo-code so that the code terminates with an informative
message when a zero pivot element is found.
(d) Write a computer program based on the pseudo-code.
(e) Use the computer program on Exercises 3 and 4 of the Applied Exercises
(pages 160–161).
168 PSEUDO-CODE

5. Gauss-Seidel Iteration (Step 13)


Assume the system is
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. .. .. ..
. . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn

1 read n, a11 , . . . , ann , b1 , . . . , bn , x1 , . . . , xn , 


2 repeat
3 s = 0.0
4 for i = 1 to n do:
5 yi = xi
6 endfor
7 for i = 1 to n do:
8 xi = bi
9 for j = 1 to i − 1 do:
10 xi = xi − ai j ∗ x j
11 endfor
12 for j = i + 1 to n do:
13 xi = xi − ai j ∗ y j
14 endfor
15 xi = xi /aii
16 s = s + |yi − xi |
17 endfor
18 until s < 
19 print ‘Approximate solution is’, x1 , . . . , xn
20 print ‘Value of s is’, s

Points for study


(a) What is the purpose of the number s?
(b) What are the y1 , y2 , . . . , yn used for?
(c) Why is it possible to replace the y j in line 13 with x j ?
(d) Amend the pseudo-code to allow a maximum of M iterations.
(e) Write a computer program based on the pseudo-code.
(f) Use the computer program to solve the system
8x + y − 2z = 5
x − 7y + z = 9
2x + 9z = 11
(g) Use the computer program on Exercises 3 and 4 of the Applied Exercises
(pages 160–161). Do the iterations appear to be converging?
PSEUDO-CODE 169

6. Newton Divided Difference Formula (Step 24)


Assume that for given data x0 , . . . , xn , f (x0 ), . . . , f (xn ), and given x ∈ [x0 , xn ],
we wish to calculate Pn (x), where Pn is the interpolating polynomial of degree n.
(The following algorithm is based on divided differences.)
1 read n, x, x0 , . . . , xn , f (x0 ), . . . , f (xn )
2 for i = 0 to n do:
3 di,0 = f (xi )
4 endfor
5 for i = 1 to n do:
6 for j = 1 to i do:
7 di, j = (di, j−1 − di−1, j−1 )/(xi − xi− j )
8 endfor
9 endfor
10 sum = d0,0
11 pr od = 1.0
12 for i = 1 to n do:
13 pr od = pr od ∗ (x − xi−1 )
14 sum = sum + di,i ∗ pr od
15 endfor
16 print ‘Approximation at x =’, x, ‘is’, sum

Points for study


(a) Follow the pseudo-code through with the data n = 2, x = 1.5, x0 = 0,
f (x0 ) = 2.5, x1 = 1, f (x1 ) = 4.7, x2 = 3, and f (x2 ) = 3.1. Verify that
the values dii calculated are the divided differences f (x0 , . . . , xi ).
(b) What quantity (in algebraic terms) is calculated in lines 10–15?
(c) Amend the pseudo-code so that the values P1 (x), P2 (x), . . . , Pn−1 (x) are
also printed out.
(d) Write a computer program based on the pseudo-code.
(e) Use the computer program to estimate f (2) for the data given in (a) above.
(f) For the data given in Exercise 6 of the Applied Exercises (page 161), use the
computer program to obtain an estimate of J0 (0.25).
170 PSEUDO-CODE

7. Trapezoidal Rule (Step 30)


Rb
Assume the integral is f (x) dx.
a

1 read a, b, N , M, 
2 done = f alse
3 U = 0.0
4 repeat
5 h = (b − a)/N
6 s = ( f (a) + f (b))/2
7 for i = 1 to N − 1 do:
8 x =a+i ∗h
9 s = s + f (x)
10 endfor
11 T =h∗s
12 if |T − U | <  then do:
13 done = tr ue
14 else do:
15 N =2∗ N
16 U =T
17 endif
18 until N > M or done
19 print ‘Approximation to integral is’, T
20 if N > M then do:
21 print ‘required accuracy not reached with M =’, M
22 endif

Points for study


(a) What are the input values used for?
(b) What value (in algebraic terms) does T have after line 11?
(c) What is the purpose of lines 12–17?
(d) Write a computer program based on the pseudo-code.
(e) Use the computer program on Exercises 7 and 8 of the Applied Exercises
(pages 161–162).
PSEUDO-CODE 171

8. Gaussian Integration Formula (Step 32)


Rb
Assume the integral is f (x) dx. Use the Gauss two-point formula.
a

1 read a, b √
2 x1 = (b + a − (b − a)/√3)/2
3 x2 = (b + a + (b − a)/ 3)/2
4 I = (b − a) ∗ ( f (x1 ) + f (x2 ))/2
5 print ‘Approximation to integral is’, I

Points for study


(a) What is the purpose of lines 2 and 3?
(b) What changes are required to produce an algorithm based on the Gauss
three-point formula?
(c) Write a computer program based on the pseudo-code.
(d) Use the computer program on Exercises 7 and 8 of the Applied Exercises
(pages 161–162).
172 PSEUDO-CODE

9. Runge-Kutta Method (Step 33)


Assume the equation is y 0 = f (x, y) and use the usual fourth-order method.
1 read x, y, h, M
2 print x, y
3 N =0
4 repeat
5 k1 = h ∗ f (x, y)
6 x = x + h/2
7 z = y + k1 /2
8 k2 = h ∗ f (x, z)
9 z = y + k2 /2
10 k3 = h ∗ f (x, z)
11 x = x + h/2
12 z = y + k3
13 k4 = h ∗ f (x, z)
14 y = y + (k1 + 2k2 + 2k3 + k4 )/6
15 print x, y
16 N = N +1
17 until N = M

Points for study


(a) What are the input values used for?
(b) How many times is the function f evaluated between lines 4 and 17?
(c) Amend the pseudo-code to use the second-order Runge-Kutta method.
(d) Write a computer program based on the pseudo-code.
(e) Use the computer program on Exercises 9 and 10 of the Applied Exercises
(page 162).
ANSWERS TO THE EXERCISES

STEP 1 EXERCISES (page 5)


Error points are indicated by asterisks. Note that errors of types (a) and (b) are
involved in the use of both formulae.

1. T ≈ 2 × 3.14∗ × 75∗ /981∗ ≈ 6.28 × 0.0765∗ ≈ 6.28 × 0.277∗ ≈ 1.74∗
p

seconds

2. R ≈ 0.028∗ × 3.14∗ × 56.25∗ × 2 × 981∗ × 650∗ ≈ 4.946∗ × 1129∗ ≈
5.58∗ × 103 cm3 /s

STEP 2 EXERCISES (page 9)

1. 1.2345 × 101 , 8.0059 × 10−1 , 2.96844 × 102 , 5.19 × 10−3 .

2. (a) 34.7, 3.47, 0.347, 0.0347.


(b) 34.782, 3.478, 0.347, 0.034.
(c) 34.8, 3.48, 0.348, 0.0348.
(d) 34.782, 3.478, 0.348, 0.035.
3. (a) |5/3 − 1.66| = 1/150.
(b) |5/3 − 1.666| = 1/1500.
(c) |5/3 − 1.67| = 1/300.
(d) |5/3 − 1.667| = 1/3000.

STEP 3 EXERCISES (page 13)

1. The result 13.57, max |eabs | = 0.005 + 0.005 = 0.01, so the answer is
13.57 ± 0.01 or 13.6 correct to 3S.
2. The result 0.01, max |eabs | = 0.01, so that although operands are correct to
5S, the answer may not even be correct to 1S ! This phenomenon is known as
loss of significance or cancellation (see Step 4 for more details).
3. The result 13.3651, max |eabs | ≈ (4.27+3.13)×0.005 = 0.037, so the answer
is 13.3651 ± 0.037 or 13 correct to 2S.
4. The result −1.85676, max |eabs | ≈ 0.513 × 0.005 + 9.48 × 0.0005 + 0.005 ≈
0.012, so the answer is −1.85676 ± 0.012 or −2 correct to 1S.
174 ANSWERS TO THE EXERCISES

5. The result 1.109 . . .,


X 0.005 0.005 0.005
max |erel | ≈ |erel | = + + ≈ 0.030
0.25 2.84 0.64
Since max |eabs | is approximately 1.109 max |erel |, then the value is 1.109 ±
0.033 or 1.1 correct to 2S.
6. The result 0.47, max |eabs | = 7×0.005 = 0.035, so the answer is 0.47±0.035
and we cannot even guarantee 1S.

STEP 4 EXERCISES (page 17)


1. (a) 12.01 × 102 → 1.20 × 103 .
(b) 6.19 × 102 + 0.361 × 102 = 6.551 × 102 → 6.55 × 102 .
(c) 0.37 × 102 → 3.70 × 101 .
(d) 6.19 × 102 − 0.361 × 102 = 5.829 × 102 → 5.83 × 102 .
(e) 3.63600 × 102 → 3.64 × 102 .
(f) 33.3000 × 100 → 3.33 × 101 .
(g) 1.25000 × 103 → 1.25 × 103 .
(h) −0.869300 . . . × 10−5 → −8.69 × 10−6 .
2. Since the initial errors are of unknown sign and size, we estimate E, the
maximum magnitude of the accumulated error, from the results of Step 3,
assuming the worst about the initial errors. To estimate the propagated error
we use max |eabs | = |e1 | + |e2 | for addition and subtraction, and
e e
1 2
max |erel | ≈ ∗ + ∗
x y
for multiplication and division. The magnitude of the generated error is
denoted by .
(a)  = 0.001 × 103 , max |eabs | = 0.005 × 102 + 0.005 × 102 = 0.01 × 102 ,
E = 0.02 × 102 .
(b)  = 0.001×102 , max |eabs | = 0.005×102 +0.005×101 = 0.0055×102 ,
E = 0.0065 × 102 .
(c)  = 0, max |eabs | = 0.005 × 102 + 0.005 × 102 = 0.01 × 102 , E =
0.01 × 102 (relatively large).
(d) As for (b).
(e)  = 0.004 × 102 , max |erel | ≈ 0.005/3.60 + 0.005/1.01, max |eabs | ≈
0.005 × (1.01 + 3.60) × 102 ≈ 0.023 × 102 , E ≈ 0.027 × 102 .
(f)  = 0, max |erel | ≈ 0.005/7.50 + 0.005/4.44, max |eabs | ≈ 0.005 ×
(7.50 + 4.44) × 100 ≈ 0.06 × 100 , E ≈ 0.06 × 100 .
ANSWERS TO THE EXERCISES 175

(g)  = 0, max |erel | ≈ 0.005/6.45 + 0.005/5.16


 
0.005 0.005
max |eabs | ≈ + × 1.25 × 103 ≈ 0.002 × 103
6.45 5.16

E ≈ 0.002 × 103 .
(h)  ≈ 0.0003 × 10−5 , max |erel | ≈ 0.005/2.86 + 0.005/3.29
 
0.005 0.005
max |eabs | ≈ + × 8.69 × 10−6 ≈ 0.028 × 10−6
2.86 3.29

E ≈ 0.031 × 10−6 .
3. (a) b − c = 5.685 × 101 − 5.641 × 101 = 0.044 × 101 → 4.400 × 10−1 .
a(b − c) = 6.842 × 10−1 × 4.400 × 10−1 = 30.1048 × 10−2 → 3.010 ×
10−1 .
ab = 6.842 × 10−1 × 5.685 × 101 = 38.896770 × 100 → 3.890 × 101 .
ac = 6.842 × 10−1 × 5.641 × 101 = 38.595722 × 100 → 3.860 × 101 .
ab − ac = 3.890 × 101 − 3.860 × 101 = 0.030 × 101 → 3.000 × 10−1 .
The answer obtained (working to 6S ) is 3.01048 × 10−1 with propagated
error at most 0.069 × 10−1 , so we can only rely on the first digit!
(b) a + b = 9.812 × 101 + 0.04631 × 101 = 9.85831 × 101 → 9.858 × 101 .
(a+b)+c = 9.858×101 +0.08340×101 = 9.94140×101 → 9.941×101 .
b + c = 4.631 × 10−1 + 8.340 × 10−1 = 12.971 × 10−1 → 1.297 × 100 .
a +(b +c) = 9.812×101 +0.1297×101 = 9.9417×101 → 9.942×101 .
The answer obtained (working to 6S ) is 9.94171 × 101 with propagated
error at most 0.00051 × 101 .
4. Direct use of f (x) = tan x − sin x leads to the approximation to f (0.1) given
by

tan(0.1) − sin(0.1) = 0.10033 . . . − 0.099833 . . .


→ 1.003 × 10−1 − 0.9983 × 10−1
→ 4.700 × 10−4

while using the alternative expression f (x) = 2 tan x sin2 (x/2) leads to the
approximation

2 × 0.10033 . . . × (0.049979 . . .)2


→ (2.000 × 100 ) × (1.003 × 10−1 ) × (4.998 × 10−2 )2
→ (2.000 × 100 ) × (1.003 × 10−1 ) × (2.498 × 10−3 )
→ (2.006 × 10−1 ) × (2.498 × 10−3 )
→ 5.011 × 10−4

This second approximation is more accurate.


176 ANSWERS TO THE EXERCISES

STEP 5 EXERCISES (page 21)


1. (a) The Taylor expansion for n = 2k is given by

x2 x4 x6 (−1)k x 2k
cos x = 1 − + − + ··· + + R2k
2! 4! 6! (2k)!
where
(−1)k+1 x 2k+1 sin ξ
R2k =
(2k + 1)!
and ξ lies between 0 and x. Alternatively, since the coefficient of the
x 2k+1 term will be zero, the same polynomial approximation is obtained
with n = 2k + 1 so that R2k may be replaced by

(−1)k+1 x 2k+2 cos η


R2k+1 =
(2k + 2)!
where η lies between 0 and x.
(b) Since f ( j) (x) = j!/(1 − x) j+1 , we have f ( j) (0) = j! so that the Taylor
expansion (for x 6= 1) is
1
= 1 + x + x 2 + x 3 + · · · + x n + Rn
1−x
where
x n+1
Rn =
(1 − ξ )n+2
and ξ lies between 0 and x. An alternative expression for Rn is

x n+1
Rn =
1−x
which may be obtained from the formula for the sum of a geometric series.
(c) Since f ( j) (x) = e x , the Taylor expansion is

x2 x3 xn
ex = 1 + x + + + ··· + + Rn
2! 3! n!
where
x n+1 eξ
Rn =
(n + 1)!
and ξ lies between 0 and x.
2. (a) cos(0.5) = 0.87758 (to 5D ) while the first four terms of the Taylor
expansion yield 0.87758.
(b) 1/(1 − 0.5) = 2 while the first four terms of the Taylor expansion yield
1.875.
(c) e0.5 = 1.64872 (to 5D ) while the first four terms of the Taylor expansion
yield 1.64583.
ANSWERS TO THE EXERCISES 177

3. From 1(c) we see that


x n+1 eξ 1n+1 e1
|Rn | = < for all x between 0 and 1.
(n + 1)! (n + 1)!
Thus |Rn | < 5 × 10−6 if (n + 1)! ≥ 2e × 105 ≈ 543 656; that is, we require
n = 9, since 9! = 362 880 and 10! = 3 628 800.
4. linear: 1 + x over the range −0.1 < x < 0.1.
quadratic: 1 + x + 21 x 2 over the range −0.3 < x < 0.3.
cubic: 1 + x + 12 x 2 + 16 x 3 over the range −0.5 < x < 0.5.
5. p0 = 1, q0 = 0,
p1 = p0 (3.1) + (−2) = 1.1, q1 = q0 (3.1) + (1) = 1,
p2 = p1 (3.1) + (2) = 5.41, q2 = q1 (3.1) + (1.1) = 4.2,
p3 = p2 (3.1) + (3) = 19.771, q3 = q2 (3.1) + (5.41) = 18.43.
Only three multiplications and three additions are required to evaluate P(3.1)
whereas 3.1×3.1×31−2×3.1×3.1+2×3.1+3 requires five multiplications
and three additions.
6. p0 = 2, q0 = 0,
p1 = p0 (2.6) + (−1) = 4.2, q1 = q0 (2.6) + (2) = 2,
p2 = p1 (2.6) + (3) = 13.92, q2 = q1 (2.6) + (4.2) = 9.4,
p3 = p2 (2.6) + (0) = 36.192, q3 = q2 (2.6) + (13.92) = 38.36,
p4 = p3 (2.6) + (5) = 99.0992, q4 = q3 (2.6) + (36.192) = 135.928.

STEP 6 EXERCISES (page 26)


1. The curves sketched in Figure 17 are similar to those in Section 2 of Step 6,
and enable us to deduce that there is one real root near x = −0.7. Tabulating
confirms its location:
x −0.7 −0.8 −0.75
cos x 0.7648 0.6967 0.7317
x + cos x 0.0648 −0.1033 −0.0183

2. (a) There is only one root which is near −1.


(b) There is only one root which is near −1/2.
(c) There is only one root which is near −1/2.
(d) There are two roots, one at 0 and the other near 3/2.

STEP 7 EXERCISES (page 29)


1. In Step 6 we saw that the root lies in the interval (−0.75, −0.7). Successive
bisections produce the following sequence of intervals containing the root:
(−0.75, −0.725), (−0.75, −0.7375), (−0.74375, −0.7375). Thus the root
to 2D is −0.74, since it is enclosed in the interval (−0.74375, −0.7375), of
178 ANSWERS TO THE EXERCISES

y
N

2 . y = x
....
....
. ......
..
....
....
.....
.
.
.... ........ y = − cos x
.......... 1 ....
.......
.... .... .....
....
......
. .......
.... . ..
.... .... ....
.... .... ....
....
....... ....
.
.... . ..
.... .... .... I x
.... .... ....
−3 −2 −1 .... .
..... 1 ......
2 3
.... .... ...
.
.... . .
.... ...... ....
...... ...
. ...
..... .............
. ...
........
. ..................
....
....
..
.....
.
....
....
..
.....
....
−2

FIGURE 17. Graphs of y = x and y = − cos x.

half-length 3.125 × 10−3 , which is less than the required 5 × 10−3 for 2D
accuracy.
2. Root is 0.615 to 3D.
3. (a) With an initial interval of (−1.1, −1) say, application of the bisection
method shows that the root to 2D is −1.03.
(b) With an initial interval of (−0.6, −0.5) say, the root to 2D is −0.57.
(c) With an initial interval of (−0.5, −0.4) say, the root to 2D is −0.44.

STEP 8 EXERCISES (page 33)

1. Tabulate f :
x f (x)
0 −2
0.2 −1.40266
0.6 −0.27072
0.8 +0.23471
There is a root in the interval 0.6 < x < 0.8. We have

1 0.6 −0.27072
x1 =
0.23471 + 0.27072 0.8 0.23471
0.14083 + 0.21657
= = 0.70712
0.50543
f (x1 ) = f (0.70712) = 2 sin(0.70712) + 0.70712 − 2
= 1.29929 + 0.70712 − 2 = 0.00642
ANSWERS TO THE EXERCISES 179

Since f (0.6) and f (0.70712) have opposite signs, the root is in the interval
0.6 < x < 0.70712. Repeating the process,

1 0.6 −0.27072
x2 =
0.00642 + 0.27072 0.70712 0.00642
0.00385 + 0.19143
= = 0.70464
0.27714
Since f (x2 ) = f (0.70464) = 0.00016, we know the root lies between 0.6
and 0.70464, so we compute

1 0.6 −0.27072
x3 = = 0.70458
0.00016 + 0.27072 0.70464 0.00016

Since | f (0.70458)| is less than the requested value of 5 × 10−5 we may stop.
Because x2 and x3 agree to 4D, we conclude that the root accurate to 4D is
0.7046. Note that all the xn computed have f (xn ) positive.
2. Let us take f (x) = 3 sin x − x − 1/x. We note that f (0.7) = −0.19592 and
f (0.9) = 0.33887, that is, the root is enclosed. We shall obtain a solution
accurate to four decimal places. The following results are obtained.
(a) Bisection gives the sequence of intervals: (0.7, 0.9), (0.7, 0.8), (0.75, 0.8),
(0.75, 0.775), (0.7625, 0.775), (0.7625, 0.76875), (0.7625, 0.76563),
(0.7625, 0.76406), (0.7625, 0.76328), (0.76289, 0.76328), (0.76309,
0.76328), (0.76309, 0.76318). Thus the root to 4D is 0.7631, since it
is enclosed in the interval (0.76309, 0.76318), of half-length less than
5 × 10−5 .
(b) If [an , bn ] is the interval bracketing the root at the n-th iteration of false
position, then the first iteration with a1 = 0.7 and b1 = 0.9 yields the
approximation x1 = 0.77327. Since f (0.77327) = 0.02896, the process
is repeated, now with a2 = 0.7 and b2 = 0.77327. This yields x2 =
0.76383. Since f (0.76383) = 0.00207, we take a3 = 0.7 and b3 =
0.76383 to obtain x3 = 0.76317 and f (x3 ) = 0.00015. Then a4 = 0.7
and b4 = 0.76317 gives x4 = 0.76312. One more iteration yields the
approximation 0.76312 again, so we conclude that the root is 0.7631 to
4D. Note that all the values of f (xn ) are positive.
(c) The secant method with x0 = 0.7, x1 = 0.9 gives x2 = 0.77327, x3 =
0.76143, x4 = 0.76314, and x5 = 0.76312. Again we conclude that the
root is 0.7631 to 4D.
3. In Step 7 we found that the root lies in the interval (−0.74375, −0.7375). False
position with a1 = −0.75 and b1 = −0.73 (using f (a1 ) = −0.01831 and
f (b1 ) = 0.01517) gives x1 = −0.73906. Since f (−0.73906) = 0.00004,
the process is repeated with a2 = −0.75 and b2 = −0.73906 to give x2 =
−0.73909. Since the magnitude of f (−0.73909) is less than the specified
value of 5 × 10−6 , we terminate the process and give the root as −0.7391.
180 ANSWERS TO THE EXERCISES

4. (a) With an initial interval of (−1.1, −1), the stopping criterion is satisfied
after three iterations of the method of false position and the root accurate
to 4D is −1.0299.
(b) With an initial interval of (−0.6, −0.5), the stopping criterion is satisfied
after three iterations and the root accurate to 4D is −0.5671.
(c) With an initial interval of (−0.5, −0.4), the stopping criterion is satisfied
after three iterations and the root accurate to 4D is −0.4441.

STEP 9 EXERCISES (page 36)


1. Using the iteration formula
xn+1 = 0.5 + sin xn
only six iterations are required:
x1 = 1.34147
x2 = 1.47382
x3 = 1.49530
x4 = 1.49715
x5 = 1.49729
x6 = 1.49730
Note that φ 0 (x) = cos x ≈ 0.07 near the root, so convergence is fast (and
‘one-sided’).
2. In Step 7 we found that the root is −0.74 to 2D. Using the iteration formula
xn+1 = − cos xn
with x0 = −0.74, we obtain
x1 = −0.73847
x2 = −0.73950
x3 = −0.73881
x4 = −0.73927
x5 = −0.73896
x6 = −0.73917
x7 = −0.73903
x8 = −0.73912
x9 = −0.73906
x10 = −0.73910

Since x9 and x10 agree to 4D we can give the root as −0.7391. Note that
φ 0 (x) = sin x ≈ −0.67 near the root, so convergence is slow (and ‘oscilla-
tory’).
ANSWERS TO THE EXERCISES 181

3. Using the iteration formula xn+1 = −e xn with a starting value of x0 = −1/2


yields the values −0.6065, −0.5452, −0.5797, −0.5601, −0.5712, −0.5649,
−0.5684, −0.5664, −0.5676, −0.5669, −0.5673. Thus we conclude that the
root accurate to 3D is −0.567.

STEP 10 EXERCISES (page 41)

1. If f (x) = 3xe x − 1, then f (0) = −1 while f (1/3) is clearly positive because


e1/3 > 1. Thus the root must lie in the interval 0 < x < 1/3. If x0 = 0.25 is
the initial guess,

f (0.25) = 0.75 × e0.25 − 1 = −0.03698

Since
f 0 (x) = 3(x + 1)e x
then f 0 (0.25) = 4.81510 and

0.03698
x1 = 0.25 + = 0.25 + 0.00768 = 0.25768
4.81510

Then

f (0.25768) = 3 × 0.25768 × e0.25768 − 1 = 0.00026


f 0 (0.25768) = 3 × (1.25768) × e0.25768 = 4.88203

and
0.00026
x2 = 0.25768 − = 0.25768 − 0.00005 = 0.25763
4.88203
Doing one more iteration yields x3 = 0.25763, so we conclude that the
root is 0.2576 to 4S. Note that only two or three iterations are required for
the Newton-Raphson process, whereas eight iterations were needed for the
iteration method based on
xn+1 = 13 e−xn

2. Since x k = a, we take f (x) = x k − a = 0 with f 0 (x) = kx k−1 to yield

xnk − a
 
1 a
xn+1 = xn − = 1− xn + k−1
kxnk−1 k kxn

With k = −1 we have an iterative formula for computing inverses without


division: xn+1 = xn (2 − axn ).
182 ANSWERS TO THE EXERCISES

3. With x0 = 1 and a = 10, we find:


 
1 10
x1 = 1+ = 5.5
2 1
 
1 10
x2 = 5.5 + = 3.65909
2 5.5
 
1 10
x3 = 3.65909 + = 3.19601
2 3.65909
 
1 10
x4 = 3.19601 + = 3.16246
2 3.19601
 
1 10
x5 = 3.16246 + = 3.16228
2 3.16246
 
1 10
x6 = 3.16228 + = 3.16228
2 3.16228

Thus 10 is 3.1623 to 4D.
4. In Step 7 we found that the root is −0.74 to 2D. Taking x0 = −0.74 and
f (x) = x + cos x, so that f 0 (x) = 1 − sin x, we obtain
−0.00153
x1 = −0.74 − = −0.73909
1.67429
and
−0.00001
x2 = −0.73909 − = −0.73909
1.67361
Since x1 and x2 agree to more than 4D we can give the root as −0.7391.
5. (a) The Newton-Raphson method is
xn + 2 cos xn
xn+1 = xn −
1 − 2 sin xn
With a starting value of −1 we obtain the approximations −1.03004,
−1.02987, and −1.02987, so the root to 4D is −1.0299.
(b) The Newton-Raphson method is
x n + e xn
xn+1 = xn −
1 + e xn
With a starting value of −0.5 we obtain the approximations −0.56631,
−0.56714, and −0.56714, so the root to 4D is −0.5671.
(c) The Newton-Raphson method is
xn (xn − 1) − e xn
xn+1 = xn −
2xn − 1 − e xn
With a starting value of −0.5 we obtain the approximations −0.44496,
−0.44413, and −0.44413, so the root to 4D is −0.4441.
ANSWERS TO THE EXERCISES 183

STEP 11 EXERCISES (page 50)


Full answers are given for Questions 1 and 2 only.
1.
m Augmented Matrix
1 1 −1 0
2 −1 1 6
3 2 −4 −4
1 1 −1 0
2 −3 3 6
3 −1 −1 −4
1 1 −1 0
−3 3 6
1/3 −2 −6

Solution by back-substitution

−2x3 = −6 ⇒ x3 = 3
−3x2 + 9 = 6 ⇒ x2 = 1
x1 + 1 − 3 = 0 ⇒ x1 = 2

2.
m Augmented Matrix
5.6 3.8 1.2 1.4
3.1 7.1 −4.7 5.1
1.4 −3.4 8.3 2.4
5.6 3.8 1.2 1.4
0.55 5.00 −5.36 4.33
0.25 −4.35 8.00 2.05
5.6 3.8 1.2 1.4
5.00 −5.36 4.33
−0.87 3.33 5.82

(The numbers in the table are displayed to 2D.)


Solution by back-substitution

3.33z = 5.82 ⇒ z = 1.75


5.00y − 5.36 × 1.75 = 4.33 ⇒ y = 2.74
5.6x + 3.8 × 2.74 + 1.2 × 1.75 = 1.4 ⇒ x = −1.98
3. x = 51/2, y = −9, z = 2.
4. To 2D the solution is x = −4.35, y = −2.45, z = 5.13.
184 ANSWERS TO THE EXERCISES

STEP 12 EXERCISES (page 54)


1. If there were no errors in the constants, the exact solution would be x = 2.6,
y = 1.2. With errors the solution intervals are 2.57 ≤ x ≤ 2.63 and 1.17 ≤
y ≤ 1.23.
2. (a) x = 1.2, y = 2.3.
(b) x = 1, y = 1, z = 2.
(c) x1 = 1.2, x2 = 3.5, x3 = 6.4.
3. Without partial pivoting, x = 1.004 × 100 ; y = 4.998 × 10−1 . With partial
pivoting, x = 1.000 × 100 ; y = 5.000 × 10−1 .

STEP 13 EXERCISES (page 58)


(4) (3) (4) (3) (4) (3)
1. S3 = x1 − x1 + x2 − x2 + x3 − x3 . The values for the fourth
iteration x (4) are obtained by continuing the table of Section 3 for one more
row, thus:
4 0.999917 0.999993 1.000017
Then S3 is found to be: S3 = 0.001226 + 0.000304 + 0.000215 = 0.001745.
2. (a) We first rearrange the equations to place the largest coefficients on the
leading diagonal:
20x + 3y − 2z = 51
2x + 8y + 4z = 25
x − y + 10z = −7
The recurrence relations are:
x (k+1) = 2.55 − 0.15y (k) + 0.1z (k)
y (k+1) = 3.125 − 0.25x (k+1) − 0.5z (k)
z (k+1) = −0.7 − 0.1x (k+1) + 0.1y (k+1)

Taking Sk < 0.000005 as the stopping criterion, these relations yield the
following table of results.
Iteration
k x (k) y (k) z (k) Sk (to 6D )
0 0 0 0 5.743750
1 2.550000 2.487500 −0.706250 0.998594
2 2.106250 2.951563 −0.615469 0.093816
3 2.045719 2.921305 −0.612441 0.008322
4 2.050560 2.918581 −0.613198 0.000632
5 2.050893 2.918876 −0.613202 0.000063
6 2.050848 2.918889 −0.613196 0.000004
7 2.050847 2.918886 −0.613196
ANSWERS TO THE EXERCISES 185

The solution to 5D is: x = 2.05085, y = 2.91889, z = −0.61320.


(b) x = 0.11236, y = 0.12360, z = 0.12360, w = 0.11236.

STEP 14 EXERCISES (page 62)


1. (a) (full solution)
m A I
2 6 4 1 0 0
6 19 12 0 1 0
2 8 14 0 0 1
2 6 4 1 0 0
3 0 1 0 −3 1 0
1 0 2 10 −1 0 1
2 6 4 1 0 0
0 1 0 −3 1 0
2 0 0 10 5 −2 1
8.5 −2.6 −0.2
Inverse matrix −3 1 0
0.5 −0.2 0.1
 
u1
Note: The first column of A−1 is  v1 , and is found by solving
w1
    
2 6 4 u1 1
 0 1 0   v1  =  −3 
0 0 10 w1 5

by back-substitution. The second column is found by solving


    
2 6 4 u2 0
 0 1 0   v2  =  1 
0 0 10 w2 −2

The third is from


    
26 4 u3 0
 01 0   v3  =  0 
00 10 w3 1
 
0.046 −0.605 1.031
(b) To 3D the inverse is  0.448 −0.403 0.398 
−0.362 0.851 −1.023
186 ANSWERS TO THE EXERCISES
 
0.705 2.544 −2.761
(c) To 3D the inverse is  −1.371 0.806 1.609 
2.013 −1.808 −0.030
   
51/2 2.7
2. (a) x =  −9 , y =  −1.0 
2 0.4
   
−4.349 0.426
(b) To 3D the solutions are x =  −2.448 , y =  −0.004 
5.133 −0.172
   
6.648 2.381
(c) To 3D the solutions are x =  0.103 , y =  2.937 
−1.761 −2.827
STEP 15 EXERCISES (page 68)
1. If we apply the Gaussian elimination technique to the matrix, then we have
the multiplier m 21 = c/a and the matrix reduces to
 
a b
0 d − ac × b
This matrix can be taken to be U and we take L to be
   
1 0 1 0
L= = c
m 21 1 a 1
It is easily verified that the product LU is the original matrix.
2. (a) From the answer to Exercise 1 of Step 11 we obtain
   
1 0 0 1 1 −1
L =  2 1 0  and U =  0 −3 3 
3 1/3 1 0 0 −2
Solving
   
0 0
Ly =  6  yields y =  6 
−4 −6
To find the solution x we solve Ux = y from which we obtain x3 = 3,
x2 = 1, and x1 = 2.
(b) Here we have
   
1 0 0 2 6 4
L= 3 1 0  and U =  0 1 0 
1 2 1 0 0 10
   
5 5
Ly =  6  yields y =  −9 
7 20
ANSWERS TO THE EXERCISES 187

Finally, solving Ux = y yields x3 = 2, x2 = −9, and x1 = 51/2.

STEP 16 EXERCISES (page 72)



1. kxk2 = 79, kxk∞ = 6.
2. The infinity norms of the matrices are 37, 19.3, and 1.83, respectively.
3. By making use of the answers to Exercise 1 of Step 14 the condition numbers
of the matrices are 418.1, 43.2, and 11.0, respectively.

STEP 17 EXERCISES (page 78)


1. Application of the power method with the starting vector w(0) = [1, 1]T yields:
 
(1) −0.1 (1)
w = , λ1 = 2.8 1 = 2.8
2.8
 
3.2 (2)
w(2) = , λ1 = −0.1 3.2
= −32
−2.6
 
(3) −6.7 (3)
w = , λ1 = −2.613.6
= −5.23077
13.6
 
23 (4)
w(4) = , λ1 = −35 13.6 = −2.57353
−35
 
(5) −66.1 (5)
w = , λ1 = 110.8
−35 = −3.16571
110.8
 
201.2 (6)
w(6) = , λ1 = −326.6
110.8 = −2.94765
−326.6
 
(7) −600.7 (7)
w = , λ1 = −326.6
985.6
= −3.01776
985.6
 
1805 (8)
w(8) = , λ1 = −2951
985.6 = −2.99412
−2951

The characteristic polynomial of A is given by det(A − λI). Since

−1.2 − λ
 
1.1
A − λI =
3.6 −0.8 − λ

the characteristic polynomial is then

(−1.2 − λ)(−0.8 − λ) − 3.6 × 1.1 = λ2 + 2λ − 3 = (λ + 3)(λ − 1)

Thus the eigenvalues are −3 and 1. The approximations from the power
method do appear to be converging to −3, the eigenvalue with the larger
magnitude.
188 ANSWERS TO THE EXERCISES

2. Five iterations of the normal power method with starting vector w(0) =
[1, 1, 1]T yield:
 
12
(1)
w(1) =  37  , λ1 = 371 = 37
24
 
342
(2)
w(2) =  1063  , λ1 = 1063
37 = 28.72973
656
 
9686
(3)
w(3) =  30121  , λ1 = 30121
1063 = 28.33584
18372
 
273586
(4)
w(4) =  850879  , λ1 = 850879
30121 = 28.24870
517548
 
7722638
(5)
w(5) =  24018793  , λ1 = 24018793
850879 = 28.22821
14599876
Note the rapid growth in the size of the components of the vectors. For the
scaled power method with the same starting vector we obtain:
 
12
(1)
w(1) =  37  , p = 2, λ1 = 37 1 = 37
24
 
0.32432
y(1) =  1 
0.64865
 
9.24324
(2)
w(2) =  28.72973  , p = 2, λ1 = 28.729731 = 28.72973
17.72973
 
0.32173
y(2) =  1 
0.61712
 
9.11195
(3)
w(3) =  28.33584  , p = 2, λ1 = 28.335841 = 28.33584
17.28316
 
0.32157
y(3) =  1 
0.60994
 
9.08290
(4)
w(4) =  28.24870  , p = 2, λ1 = 28.248701 = 28.24870
17.18230
ANSWERS TO THE EXERCISES 189
 
0.32153
y(4) =  1 
0.60825
 
9.07607
(5)
w(5) =  28.22821  , p = 2, λ1 = 28.22821
1 = 28.22821
17.15858
 
0.32152
y(5) = 1 
0.60785

STEP 18 EXERCISES (page 82)

1.
x f (x) = x 3 First diff. Second Third Fourth
0 0
1
1 1 6
7 6
2 8 12 0
19 6
3 27 18 0
37 6
4 64 24 0
61 6
5 125 30
91
6 216

2. (a)
x f (x) = 2x − 1 First diff. Second
0 −1
2
1 1 0
2
2 3 0
2
3 5
(b)
x f (x) = 3x 2 + 2x − 4 First diff. Second Third
0 −4
5
1 1 6
11 0
2 12 6
17 0
3 29 6
23
4 52
190 ANSWERS TO THE EXERCISES

(c)
x f (x) = 2x 3 + 5x − 3 First diff. Second Third Fourth
0 −3
7
1 4 12
19 12
2 23 24 0
43 12
3 66 36 0
79 12
4 145 48
127
5 272
If the polynomial has degree n, then the differences of order n are all equal
so that the differences of order (n + 1) are 0.
3.
x f (x) = e x First diff. Second Third Fourth
0.10 1.105171
56663
0.15 1.161834 2906
59569 147
0.20 1.221403 3053 12
62622 159
0.25 1.284025 3212 4
65834 163
0.30 1.349859 3375 10
69209 173
0.35 1.419068 3548 9
72757 182
0.40 1.491825 3730 10
76487 192
0.45 1.568312 3922
80409
0.50 1.648721
There is just a hint of excessive ‘noise’ at the fourth differences.

STEP 19 EXERCISES (page 86)


1.
x f (x) = 3x 3 − 2x 2 + x + 5 1 12 13 14
0 5
2
1 7 14
16 18
2 23 32 0
48 18
3 71 50
98
4 169

(a) 16, 32, 18, 18, 50.


ANSWERS TO THE EXERCISES 191

(b) 2, 16, 14, 32, 18.


(c) 2, 14, 18, 18, 32.
2. (a) 0.06263, 0.00320, 0.00018, −0.00002.
(b) 0.07275, 0.00354, 0.00016, −0.00002.
(c) 0.00338, −0.00002.
(d) 0.00306 in each case.
(e) 0.00016 in each case.
3. (a) Consider f (x) = x.
(b) 13 f j = 12 ( f j+1 − f j ) = 12 f j+1 − 12 f j
= ( f j+3 − 2 f j+2 + f j+1 ) − ( f j+2 − 2 f j+1 + f j )
= f j+3 − 3 f j+2 + 3 f j+1 − f j
(c) ∇ 3 f j = ∇ 2 ( f j − f j−1 ) = ∇ 2 f j − ∇ 2 f j−1
= ( f j − 2 f j−1 + f j−2 ) − ( f j−1 − 2 f j−2 + f j−3 )
= f j − 3 f j−1 + 3 f j−2 − f j−3
(d) δ 3 f j = δ 2 (δ f j )
= δ 2 ( f j+ 1 − f j− 1 )
2 2
= ( f j+ 3 − 2 f j+ 1 + f j− 1 ) − ( f j+ 1 − 2 f j− 1 + f j− 3 )
2 2 2 2 2 2
= f j+ 3 − 3 f j+ 1 + 3 f j− 1 − f j− 3
2 2 2 2

STEP 20 EXERCISES (page 91)


1. (a)
x f (x) = x 4 1 12 13 14
0.0 0.0000
1
0.1 0.0001 14
15 36
0.2 0.0016 50 24
65 60
0.3 0.0081 110 24
175 84
0.4 0.0256 194 24
369 108
0.5 0.0625 302 24
671 132
0.6 0.1296 434 24
1105 156
0.7 0.2401 590 24
1695 180
0.8 0.4096 770 24
2465 204
0.9 0.6561 974
3439
1.0 1.0000
192 ANSWERS TO THE EXERCISES

(b)
x f (x) = x 4 1 12 13 14
0.0 0.000
0
0.1 0.000 2
2 2
0.2 0.002 4 6
6 8
0.3 0.008 12 −1
18 7
0.4 0.026 19 4
37 11
0.5 0.063 30 2
67 13
0.6 0.130 43 4
110 17
0.7 0.240 60 −1
170 16
0.8 0.410 76 6
246 22
0.9 0.656 98
344
1.0 1.000
From the table in part (a), the true value of the fourth difference is 0.0024.
Thus the values in this last column should be 2.4. The worst round-off
error is therefore 6.0 − 2.4 = 3.6, which is within expectations.

2.
x f (x) 1 12 13
0 3
−1
1 2 6
5 6
2 7 12
17 6
3 24 18
35 6
4 59 24
59
5 118

Data could be fitted by a cubic.


ANSWERS TO THE EXERCISES 193

STEP 21 EXERCISES (page 95)

1. The first difference is 0.56464 − 0.47943 = 0.08521 so that the linear inter-
polating polynomial is

x − 0.5
P1 (x) = 0.47943 + × 0.08521
0.1

Thus
sin(0.55) ≈ 0.47943 + 0.5 × 0.08521 = 0.52204
The true value of sin(0.55) to 5D is 0.52269.
2. Difference table:

x f (x) = cos x 1 12
80◦ 0.1736
−28
80◦ 100 0.1708 −1
−29
80◦ 200 0.1679 0
−29
80◦ 300 0.1650 1
−28
80◦ 400 0.1622 −1
−29
80◦ 500 0.1593

(a) We have

cos(80◦ 350 ) ≈ f (80◦ 300 ) + 21 1 f (80◦ 300 )


= 0.1650 + 21 (−0.0028)
= 0.1636

(b) We have

cos(80◦ 350 ) ≈ 0.1650 + 21 (−0.0028) + 11


− 12 (−0.0001)

22
= 0.1636

(The second-order correction is 0.0000125.)


194 ANSWERS TO THE EXERCISES

3. Difference table:

x f (x) = tan x 1 12 13
80◦ 5.671
98
80◦ 100 5.769 4
102 −1
80◦ 200 5.871 3
105 0
80◦ 300 5.976 3
108 2
80◦ 400 6.084 5
113
80◦ 500 6.197
The second-order differences are approximately constant, so that quadratic
approximation is appropriate: setting θ = 12 ,

tan(80◦ 350 ) ≈ f (80◦ 300 ) + 12 1 f (80◦ 300 ) + 21 12 − 12 12 f (80◦ 300 )




3
= 5.976 + 21 (0.108) − 12 (0.005)
= 6.029

STEP 22 EXERCISES (page 99)


1.
x f (x) = e x 1 12 13
0.10 1.10517
5666
0.15 1.16183 291
5957 15
0.20 1.22140 306
6263 14
0.25 1.28403 320
6583 18
0.30 1.34986 338
6921 16
0.35 1.41907 354
7275
0.40 1.49182
ANSWERS TO THE EXERCISES 195

(a) We have

e0.14 = f (0.14)
≈ f (0.1) + 45 (0.05666) + 2 5 (− 5 )(0.00291)
14 1

+ 61 45 (− 15 )(− 65 )(0.00015)
= 1.10517 + 0.04532(8) − 0.00023(3) + 0.00000(5)
= 1.15027

(b) We have

e0.315 = f (0.315)
≈ f (0.30) + 10 (0.06583) + 2 10 10 (0.00320)
3 1 3 13

6 10 10 10 (0.00014)
1 3 13 23
+
= 1.34986 + 0.01974(9) + 0.00062(4) + 0.00002(1)
= 1.37025

2. The relation obviously holds for j = 0, and for j = 1 since

1 f (x0 ) = f (x0 + h) − f (x0 ) ⇒ f (x0 + h) = (1 + 1) f (x0 )

We proceed to a ‘proof by induction’; suppose the relation holds for j = k, so


that
k(k − 1) 2
f k = f 0 + k1 f 0 + 1 f 0 + · · · + 1k f 0
2
Then
k(k − 1) 3
1 f k = 1 f 0 + k12 f 0 + 1 f 0 + · · · 1k+1 f 0
2
But
1 f k = f k+1 − f k
so that

f k+1 = f k + 1 f k
 
k(k − 1)
= f 0 + (k + 1)1 f 0 + + k 12 f 0 + · · · + 1k+1 f 0
2
(k + 1)k 2
= f 0 + (k + 1)1 f 0 + 1 f 0 + · · · + 1k+1 f 0
2
that is, the relation holds for j = k + 1. We conclude that it holds for
j = 0, 1, 2 . . .
With reference to Section 4 of Step 22, note that

f j = f (x j ) = Pn (x j )

on setting θ = j for j = 0, 1, 2, . . .
196 ANSWERS TO THE EXERCISES

3. The relevant difference table is given in the answer to Exercise 2 of Step 20.
Since f 0 = 3, 1 f 0 = −1, 12 f 0 = 6, 13 f 0 = 6, and 14 f 0 = 0, we obtain

θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
P3 (x) = f 0 + θ1 f 0 + 1 f0 + 1 f0
2 6
= 3 − θ + 3θ(θ − 1) + θ (θ − 1)(θ − 2)
= θ 3 − 2θ + 3

Now x = x0 + θ h = θ (since x0 = 0, h = 1), so that the interpolating


polynomial for the first four tabular entries is

P3 (x) = x 3 − 2x + 3

The student may verify that any four adjacent tabular points have the same
interpolating cubic. This suggests that the tabulated function f is a cubic
in which case we have f ≡ P3 . However, it is by no means certain that
f is a cubic. The data could have come from any function of the form
f (x) = P3 (x) + g(x), where g(x) is zero at the points x = 0, 1, 2, 3, 4, 5. A
simple example is g(x) = x(x − 1)(x − 2)(x − 3)(x − 4)(x − 5).

STEP 23 EXERCISE (page 103)

The Lagrange coefficients are

(x + 1)(x − 1)(x − 3)(x − 4)


L 0 (x) = for x0 = −2
(−1)(−3)(−5)(−6)
(x + 2)(x − 1)(x − 3)(x − 4)
L 1 (x) = for x1 = −1
1(−2)(−4)(−5)
(x + 2)(x + 1)(x − 3)(x − 4)
L 2 (x) = for x2 = 1
(3)(2)(−2)(−3)
(x + 2)(x + 1)(x − 1)(x − 4)
L 3 (x) = for x3 = 3
(5)(4)(2)(−1)
(x + 2)(x + 1)(x − 1)(x − 3)
L 4 (x) = for x4 = 4
(6)(5)(3)(1)
Thus

f (0) ≈ L 0 (0) × 46 + L 1 (0) × 4 + L 2 (0) × 4


+ L 3 (0) × 156 + L 4 (0) × 484
−92 + 36 + 40 − 468 + 484
=
15
=0
ANSWERS TO THE EXERCISES 197

STEP 24 EXERCISES (page 109)

1. Let us order the points such that x0 = 27, x1 = 8, x2 = 1, x3 = 0, and


x4 = 64, to get the divided difference table (entries multiplied by 105 ):

x f (x)
27 3.00000
5263
8 2.00000 −347
14286 384
1 1.00000 −10714 −6
100000 165
0 0.00000 −1488
6250
64 4.00000

From Newton’s formula:

f (20) ≈ f (27) + (−7) f (27, 8) + (−7)(12) f (27, 8, 1)


+ (−7)(12)(19) f (27, 8, 1, 0)
+ (−7)(12)(19)(20) f (27, 8, 1, 0, 64)
= 3 − 7(0.05263) − 84(−0.00347)
− 1596(0.00384) − 31920(−0.00006)
= 3 − 0.36841 + 0.29148 − 6.12864 + 1.91520
= −1.29037

Since the terms are not decreasing we cannot have much confidence in this
result. The reader may recall that this example was quoted in Section 3 of Step
23, in a warning concerning the use of the Lagrange interpolation formula in
practice. With divided differences, we can at least see that interpolation for
f (20) is invalid!
2. Let us order the points as x0 = 0, x1 = 0.5 and x2 = 1. Then the divided
difference table is as follows:

x f (x) = e x
0 1
129744
0.5 1.64872 84168
213912
1 2.71828
198 ANSWERS TO THE EXERCISES

The quadratic approximation to e0.25 is given by

1 + 1.29744 × (0.25 − 0) + 0.84168 × (0.25 − 0) × (0.25 − 0.5) = 1.27176

Since n = 2, the magnitude of the error in the approximation is given by

f (ξ ) f (ξ )
000 000

3! (0.25 − 0.0)(0.25 − 0.5)(0.25 − 1.0) =
128

where ξ lies between 0 and 1. For f (x) = e x , f 000 (x) = e x and thus

e0 ≤ f 000 (ξ ) ≤ e1

It then follows that


1 e
≤ e0.25 − 1.27176 ≤

0.00781 = = 0.02124
128 128
The actual error has magnitude 0.01227 which is within the bounds.

3. (a) Let us order the points such that x0 = −1, x1 = 1, x2 = −2, x3 = 3,


x4 = 4, to get the divided difference scheme:

k xk f (xk )
0 −1 4
0
1 1 4 14
−14 1
2 −2 46 18 2
22 11
3 3 156 51
328
4 4 484

Then

f (0) ≈ f (−1) + (1) f (−1, 1) + (1)(−1) f (−1, 1, −2)


+ (1)(−1)(2) f (−1, 1, −2, 3)
+ (1)(−1)(2)(−3) f (−1, 1, −2, 3, 4)
= 4 + 1 × 0 − 1 × 14 − 2 × 1 + 6 × 2
=0

(b) Let us again order the points such that x0 = −1, x1 = 1, x2 = −2, x3 = 3,
x4 = 4, to get the Aitken scheme:
ANSWERS TO THE EXERCISES 199

k x f (x) xk − x
0 −1 4 −1
1 1 4 4 1
2 −2 46 −38 −10 −2
3 3 156 42 −15 −12 3
4 4 484 100 −28 −16 0 4

The validity of this interpolation is dubious. The terms in Newton’s divided


difference formula are not decreasing notably; in the Aitken scheme, we
do not obtain a repeated value on the diagonal.
4.
x f (x) xk − x
1 2.3919 −1
3 2.3938 2.3928(5) 1
0 2.3913 2.3925 2.3927(3) −2
4 2.3951 2.3929(7) 2.3927(3) 2.3927(3) 2
Thus f (2) ≈ 2.3927.

STEP 25 EXERCISES (page 112)

1. The root of f (x) = x + cos x is in the interval −0.8 < x < −0.7; in fact,
f (−0.8) = −0.10329 and f (−0.7) = 0.06484. Since f is known explicitly,
one may readily subtabulate (by successive interval bisection, say) and use
linear inverse interpolation:

0 + 0.01831
f (−0.75) = −0.01831, θ = = 0.22021
0.06484 + 0.01831

whence x = −0.75 + (0.22021)(0.05) = −0.73899

0 + 0.01831
f (−0.725) = 0.02350, θ = = 0.43795
0.02350 + 0.01831

whence x = −0.75 + (0.43795)(0.025) = −0.73905

0 + 0.01831
f (−0.7375) = 0.00265, θ = = 0.87349
0.00265 + 0.01831

whence x = −0.75 + (0.87349)(0.0125) = −0.73908. We then conclude


that the root to 4D is −0.7391 and as a check we have f (−0.7391) = 0.0000.
2. The function f (x) = 3xe x increases as x increases, so that there is a unique
α satisfying f (α) = 1. Indeed, in Step 7 we noted that 0.25 < α < 0.27,
200 ANSWERS TO THE EXERCISES

and this interval is quite small enough for linear inverse interpolation: since
f (0.27) = 1.0611 and f (0.25) = 0.9630, we have
1.0000 − 0.9630
θ= = 0.3772
1.0611 − 0.9630
whence α = 0.25 + (0.3772)(0.02) = 0.2575. Checking α = 0.258, we have
f (0.258) = 1.0018, which is closer to 1 than f (0.257) = 0.9969. (While the
value to 3D is obtained immediately by linear inverse interpolation, the method
of bisection described in Step 7 may be preferred when greater accuracy is
demanded.)
3. If the explicit form of the function is unknown so that it is not possible to
subtabulate readily, one may use iterative inverse interpolation. The relevant
difference table is:
x f (x) 1f 12 f 13 f
2 3.0671
33417
3 6.4088 752
34169 6
4 9.8257 758
34927 6
5 13.3184 764
35691 6
6 16.8875 770
36461 6
7 20.5336 776
37237 6
8 24.2573 782
38019 6
9 28.0592 788
38807 6
10 31.9399 794
39601 6
11 35.9000 800
40401 6
12 39.9401 806
41207
13 44.0608
To find x for which f (x) = 10, one may use inverse interpolation based on
Newton’s forward formula:
θ1 = (10 − 9.8257)/3.4927 = 0.1743/3.4927 = 0.0499 ≈ 0.05
ANSWERS TO THE EXERCISES 201
.
θ2 = 0.1743 − 12 (0.05)(−0.95)(0.0764) 3.4927
= (0.1743 + 0.0018)/3.4927 = 0.0504
and further corrections are negligible so that
x = 4 + 0.0504 = 4.0504
To find x for which f (x) = 20 one again may choose inverse interpolation
based on Newton’s forward formula:
θ1 = (20 − 16.8875)/3.6461 = 3.1125/3.6461 = 0.85365
≈ 0.85
.
θ2 = 3.1125 − 12 (0.85)(−0.15)(0.0776) 3.6461
= (3.1125 + 0.0049)/3.6461 = 0.8550
and further corrections are negligible so that
x = 6 + 0.8550 = 6.8550
To find x for which f (x) = 40, one may choose inverse interpolation based
on Newton’s backward formula. Thus,
θ1 = ( f (x) − f j )/∇ f j
θ1 (θ1 + 1) 2
 .
θ2 = f (x) − f j − ∇ fj ∇ fj
2
etc. Consequently,
θ1 = (40 − 39.9401)/4.0401 = 0.0599/4.0401 = 0.0148
≈ 0.015
.
θ2 = 0.0599 − 12 (0.015)(1.015)(0.0800) 4.0401
= (0.0599 − 0.0006)/4.0401 = 0.0147
and further corrections are negligible so that
x = 12 + 0.0147 = 12.0147
Let us now consider the check by direct interpolation. We have from Newton’s
forward formula
f (4.0504) = 9.8257 + (0.0504)(3.4927)
+ 12 (0.0504)(−0.9496)(0.0764)
= 9.9999
and
f (6.8550) = 16.8875 + (0.8550)(3.6461)
+ 12 (0.8550)(−0.1450)(0.0776)
= 20.0001
202 ANSWERS TO THE EXERCISES

while from Newton’s backward formula

f (12.0147) = 39.9401 + (0.0147)(4.0401)


+ 12 (0.0147)(1.0147)(0.0800)
= 40.0001

Finally, we may determine the cubic f and use it to check the answers:
θ (θ − 1) 2 θ (θ − 1)(θ − 2) 3
f (x) = f j + θ1 f j + 1 fj + 1 fj
2! 3!
= 9.8257 + (x − 4)(3.4927) + 21 (x − 4)(x − 5)(0.0764)
+ 16 (x − 4)(x − 5)(x − 6)(0.0006)
= (9.8257 − 4(3.4927) + 10(0.0764) − 20(0.0006))
+ 3.4927 − 29 (0.0764) + 373 (0.0006) x

 
+ 12 (0.0764) − 52 (0.0006) x 2 + 16 (0.0006)x 3

= 0.0001x 3 + 0.0367x 2 + 3.1563x − 3.3931

Hence f (4.0504) = 9.9999, f (6.8550) = 20.0001, and f (12.0147) =


40.0001. (In each case, the value obtained by iterative inverse interpolation in
fact renders the corresponding function value accurate to 3D.)

STEP 26 EXERCISES (page 120)


1. The following table displays the line and parabola values for y (given by ` and
p respectively), the respective errors, and squared errors.
Line equation: y = 2.13 + 0.20x
Parabola equation: y = −1.20 + 2.70x − 0.36x 2

x 1 2 3 4 5 6
y 1 3 4 3 4 2
` 2.33 2.53 2.73 2.93 3.13 3.33
y−` −1.33 0.47 1.27 0.07 0.87 −1.33
(y − `)2 1.7689 0.2209 1.6129 0.0049 0.7569 1.7689
p 1.14 2.76 3.66 3.84 3.30 2.04
y−p −0.14 0.24 0.34 −0.84 0.70 −0.04
(y − p)2 0.0196 0.0576 0.1156 0.7056 0.4900 0.0016

(y − l)2 = 6.1334, while for the parabola


P
For the
P line we2 have S =
S = (y − p) = 1.3900. How good was your line fitted ‘by eye’? How did
the value of S for your line compare with 6.1334?
ANSWERS TO THE EXERCISES 203

xi2 , and
P P P P
2. Computing n, xi , yi , xi yi , inserting in the normal equations
and solving gives:

(a) Normal equations: 23.9 = 8c1 + 348c2


1049.1 = 348c1 + 15260c2
Least squares line (to 2S ): y = −0.38 + 0.077x
Prediction: % nickel y when x = 38 is 2.6 (to 2S ).

(b) Normal equations: 348 = 6c1 + 219c2


13659 = 219c1 + 8531c2
Least squares line (to 3S ): y = −6.99 + 1.78x
Prediction: sales y when x = 48 is 78 (×$1000).

4. The matrix form for the normal equations is


    
9 5 10 30 c1
 24  =  10 30 100   c2 
72 30 100 354 c3
P P
the elements being the sums yi = 9, xi yi = 24, etc. The solution is

y = c1 + c2 x + c3 x 2 = −0.2571 + 2.3143x − 0.4286x 2

and S = 0.6286 to 4D.


5. We seek the values of c1 and c2 which minimize
4
X 4
X
S= i2 = (yi − c1 − c2 sin xi )2
i=1 i=1

Now
4
∂S X
= −2(yi − c1 − c2 sin xi )
∂c1 i=1
and
4
∂S X
= −2(yi − c1 − c2 sin xi ) sin xi
∂c2 i=1
so the normal equations may be written as
!
4
X 4
X
yi = 4ci + sin xi c2
i=1 i=1
! !
4
X 4
X 4
X
and (sin xi )yi = sin xi c1 + sin2 xi c2
i=1 i=1 i=1
204 ANSWERS TO THE EXERCISES

Tabulating:

xi yi sin xi (sin xi )yi sin2 xi


0 0 0 0 0
π/6 1 0.5 0.5 0.25
π/2 3 1 3 1
5π/6 2 0.5 1 0.25
P
6 2 4.5 1.5

Solving the equations

6 = 4c1 + 2c2
4.5 = 2c1 + 1.5c2

we obtain c1 = 0, c2 = 3.

STEP 27 EXERCISES (page 128)


1. The matrix A is given by
 
1 0
 1 0.5 
A=
 1

1 
1 0.5
Then  
24
AT A =
2
1.5
and  
  0  
T 1 1 1 1  1  6
A y=   =
0 0.5 1 0.5  3  4.5
2
The normal equations AT Ac = AT y are thus the same as those obtained
above, in the answer to Exercise 5 of the previous Step.
(0)
2. Let A(0) = A. Then (S (1) )2 = 12 + 12 + 12 + 12 = 4. Since a11 > 0, we

take S (1) = − 4 = −2. Thus the first component of w(1) is given by
  1/2 √
(1) 1 1 3
w1 = 1− =
2 −2 2

Since the last three elements of the first column of A(0) are the same, the three
remaining components of w(1) all have the same value, namely,
1 1
√ =√
(−2) × (−2) × 3/2 12
ANSWERS TO THE EXERCISES 205

Thus
 
−1/2 −1/2 −1/2 −1/2
 −1/2 5/6 −1/6 −1/6 
H(1) = I − 2w(1) (w(1) )T =  
 −1/2 −1/6 5/6 −1/6 
−1/2 −1/6 −1/6 5/6
and  
−2 −1
 0 1/6 
A(1) = H(1) A(0) =
 0

2/3 
0 1/6

STEP 28 EXERCISE (page 134)


Here n = 3 and the values of h 1 , h 2 , and h 3 are all 1. The linear system that
is obtained is given by
    
4 1 m1 48
=
1 4 m2 84

Upon solving, we find that m 1 = 36/5 and m 2 = 96/5. These two values
along with m 0 = m 3 = 0 can then be used in the formulae for the spline
coefficients given on page 131. So we have

a1 = f 1 = 4
f1 − f0 h 1 (2m 1 + m 0 ) 27
b1 = + =
h1 6 5
m1 18 m1 − m0 6
c1 = = , d1 = =
2 5 6h 1 5

Thus for x ∈ [0, 1] the spline is given by S1 (x), where

S1 (x) = 4 + 5 (x
27
− 1) + 5 (x
18
− 1)2 + 65 (x − 1)3

Similarly, we obtain

a2 = f 2 = 15
f2 − f1 h 2 (2m 2 + m 1 ) 93
b2 = + =
h2 6 5
m2 48 m2 − m1
c2 = = , d2 = =2
2 5 6h 2

so on the interval (1, 2] the spline is the cubic

S2 (x) = 15 + 5 (x
93
− 2) + 5 (x
48
− 2)2 + 2(x − 2)3
206 ANSWERS TO THE EXERCISES

Finally, we have

a3 = f 3 = 40
f3 − f2 h 3 (2m 3 + m 2 ) 141
b3 = + =
h3 6 5
m3 m3 − m2 16
c3 = = 0, d3 = =−
2 6h 3 5
and hence on the interval (2, 3] the spline is given by

S3 (x) = 40 + 5 (x
141
− 3) − 5 (x
16
− 3)3

The spline is plotted in Figure 18. The required estimate at x = 2.3 is given
by
5 (−0.7) − 5 (−0.7) = 21.3576
3
40 + 141 16

y
N
40 •
...
..
...
.....
.
...
...
.. ..
.
30 ...
..
.
....
.
...
...
...
.....
..
20 ....
...
....
.
...
....
• .
....
....
...
.
.....
.....
10 .....
.
...
........
..
........
.........
..........
.•
..
...
...
...
..............
......
........................
•....................................
I x
0 1 2 3
FIGURE 18. Natural cubic spline.

STEP 29 EXERCISES (page 137)


1. The Newton backward difference formula is given by

f (x) = f (x j + θ h)
θ(θ + 1) 2 θ (θ + 1)(θ + 2) 3
 
= 1 + θ∇ + ∇ + ∇ + · · · fj
2! 3!
and hence
3θ 2 + 6θ + 2 3
 
1df 1
f 0 (x) = ∇ + θ + 12 ∇ 2 +

= ∇ + · · · fj
h dθ h 6
2
1 d f 1 h i
f 00 (x) = 2 2 = 2 ∇ 2 + (θ + 1)∇ 3 + · · · f j
h dθ h
ANSWERS TO THE EXERCISES 207

2. The difference table with h = 0.05 is:


x f (x) 1 12 13
1.00 1.00000
2470
1.05 1.02470 −59
2411 5
1.10 1.04881 −54
2357 4
1.15 1.07238 −50
2307 1
1.20 1.09545 −49
2258 6
1.25 1.11803 −43
2215
1.30 1.14018
(a) We then have the approximations
h i
f 0 (1.00) ≈ 20 1 − 12 12 + 13 13 f (1.00)

= 20[0.02470 + 0.00029(5) + 0.00001(7)]


= 0.50024
f (1.00) ≈ (20)2 12 − 13 f (1.00)
00
 

= 400[−0.00059 − 0.00005]
= −0.256

The correct values are of course



1
f 0 (1.00) = √ = 0.5
2 x x=1.00

1
f 00 (1.00) = − 3/2 = −0.25
4x x=1.00

Although the input data are correct to 5D, the results are accurate to only
3D and 1D, respectively.
(b) The approximations are:
h i
f 0 (1.30) ≈ 20 ∇ + 12 ∇ 2 + 13 ∇ 3 f (1.30)

= 20[0.02215 − 0.00021(5) + 0.00002]


= 0.4391
208 ANSWERS TO THE EXERCISES

f 00 (1.30) ≈ (20)2 ∇ 2 + ∇ 3 f (1.30)


 

= 400[−0.00043 + 0.00006]
= −0.148

To 4D the correct values are 0.4385 and −0.1687. Thus the first approxi-
mation is accurate to only 2D (the error is about 0.0006), while the second
approximation is accurate to only 1D (the error is about 0.02).
3. (a) Expanding about x = x j :

h 2 00
f (x j + h) = f (x j ) + h f 0 (x j ) + f (x j ) + · · · ,
2!
so
f (x j + h) − f (x j ) h
= f 0 (x j ) + f 00 (x j ) + · · ·
h 2
and the truncation error ≈ 12 h f 00 (x j ).
(b) Denoting x j + 12 h by x j+ 1 , we expand about x = x j+ 1 :
2 2

 2
h 0 1 h
f (x j + h) = f (x j+ 1 ) + f (x j+ 1 ) + f 00 (x j+ 1 )
2 2 2 2! 2 2

1 h 3 000
 
+ f (x j+ 1 ) + · · ·
3! 2 2

and

1 h 2 00
 
h 0
f (x j ) = f (x j+ 1 h ) − f (x j+ 1 ) + f (x j+ 1 )
2 2 2 2! 2 2
 3
1 h
− f 000 (x j+ 1 ) + · · · ,
3! 2 2

so
f (x j + h) − f (x j ) h 2 000
= f 0 (x j+ 1 ) + f (x j+ 1 ) + · · ·
h 2 24 2
and the truncation error ≈ 24 h f (x j + 2 h).
1 2 000 1

(c) Expanding about x = x j :

4
f (x j + 2h) = f (x j ) + 2h f 0 (x j ) + 2h 2 f 00 (x j ) + h 3 f 000 (x j ) + · · · ,
3
so
f (x j + 2h) − 2 f (x j + h) + f (x j )
= f 00 (x j ) + h f 000 (x j ) + · · ·
h2
and the truncation error ≈ h f 000 (x j ).
ANSWERS TO THE EXERCISES 209

(d) Expanding about x = x j + h:

h 2 00
f (x j + 2h) = f (x j + h) + h f 0 (x j + h) + f (x j + h)
2!
h 3 000 h 4 (4)
+ f (x j + h) + f (x j + h) + · · ·
3! 4!
and

h 2 00
f (x j ) = f (x j + h) − h f 0 (x j + h) + f (x j + h)
2!
h 3 000 h 4 (4)
− f (x j + h) + f (x j + h) + · · · ,
3! 4!
so

f (x j + 2h) − 2 f (x j + h) + f (x j ) h 2 (4)
= f 00
(x j + h) + f (x j + h) + · · ·
h2 12
1 2 (4)
and the truncation error ≈ 12 h f (x j + h).

STEP 30 EXERCISES (page 142)


1. With b − a = 1.30 − 1.00 = 0.30, we may choose h = 0.30, 0.15, 0.10,
and 0.05 for the tabulated function. If T (h) denotes the approximation corre-
sponding to strip width h, we get

0.30
T (0.30) = (1.00000 + 1.14018) = 0.32102(7)
2
0.15
T (0.15) = (1.00000 + 1.14018) + (0.15)(1.07238)
2
= 0.16051(4) + 0.16085(7) = 0.32137(1)
0.10
T (0.10) = (1.00000 + 1.14018) + (0.10)(1.04881 + 1.09545)
2
= 0.10700(9) + 0.21442(6) = 0.32143(5)
0.05
T (0.05) = (1.0000 + 1.14018)
2
+ (0.05)(1.02470 + 1.04881 + 1.07238
+ 1.09545 + 1.11803)
= 0.05350(5) + 0.26796(9) = 0.32147(4)

To 8D, the answer is in fact 0.32148537, so that we may observe that the error
sequence 0.00045(8), 0.00011(4), 0.00005(0), 0.00001(1) decreases with h 2
(the truncation error dominates the round-off error).
210 ANSWERS TO THE EXERCISES

2. We have:
 
1 1 1
T (1) = + = 0.75
2 1+0 1+1
 
0.5 1 1 1
T (0.5) = + + 0.5
2 1+0 1+1 1 + 0.5
= 0.7083 (to 4D )
 
0.25 1 1
T (0.25) = +
2 1+0 1+1
 
1 1 1
+ 0.25 + +
1 + 0.25 1 + 0.5 1 + 0.75
= 0.6970 (to 4D )

The correct value is ln 2 ≈ 0.6931, so the errors are (approximately) 0.0569,


0.0152, and 0.0039, respectively (note again the decrease with h 2 ).

STEP 31 EXERCISES (page 145)

1. We have
1 2 24
f (x) = , f 00 (x) = , f (4) (x) =
1+x (1 + x)3 (1 + x)5
2 2
A bound on the truncation error for the trapezoidal rule is 12 h = 16 h 2 , so that
we would need to choose h ≤ 0.017 to obtain 4D accuracy. For Simpson’s
24 4 2 4
rule, however, the truncation error bound is 180 h = 15 h so that we may
choose h = 0.1. Tabulating:

x f (x) x f (x) x f (x)


0 1.000000 0.4 0.714286 0.8 0.555556
0.1 0.909091 0.5 0.666667 0.9 0.526316
0.2 0.833333 0.6 0.625000 1.0 0.500000
0.3 0.769231 0.7 0.588235

By Simpson’s rule,
Z 1 1 0.1
dx ≈ [1 + 4(0.909091 + 0.769231 + 0.666667
0 1+x 3
+ 0.588235 + 0.526316)
+ 2(0.833333 + 0.714286 + 0.625000
+ 0.555556) + 0.500000]
= 0.693150
ANSWERS TO THE EXERCISES 211

Thus to 4D, the approximation to the integral is 0.6932. (To 6D the true value
is 0.693147.)
2. Simpson’s rule with N = 2 yields the approximation
π 
0 + 4 × π8 cos(π/8) + π4 cos(π/4) = 0.26266 (to 5D )

24
To 5D the true value of the integral is 0.26247, so that the magnitude of the
error is approximately |0.26247 − 0.26266| = 0.00019.

STEP 32 EXERCISE (page 148)


Change of variable:
u = 12 (x + 1)

Z 1 1 1
Z 1 1
du = dx
2 (x
1+u 2 1
0 −1 1+ + 1)
Z 1 dx
=
−1 3+x
Two-point formula:
Z 1
1 1 1
du ≈ +
0 1+u 3 − 0.57735027 3 + 0.57735027
= 0.41277119 + 0.27953651 = 0.6923077
which is correct to 2D.
Four point formula:
Z 1  
du 1 1
≈ 0.34785485 +
0 1+u 3 − 0.86113631 3 + 0.86113631
 
1 1
+ 0.65214515 +
3 − 0.33998104 3 + 0.33998104
= 0.34785485[0.46753798 + 0.25899112]
+ 0.65214515[0.37593717 + 0.29940290]
= (0.34785485)(0.72652909) + (0.65214515)(0.67534007)
= 0.25272667 + 0.44041975 = 0.69314642
This approximation is correct to 5D.

STEP 33 EXERCISES (page 152)


1. (a) The approximations from Euler’s method are y6 = 1.94312, y7 = 2.19743,
and y8 = 2.48718. The true value is 2.65108 to 5D, so the estimate is not
even accurate to 1D (the error in y8 is approximately 0.16).
212 ANSWERS TO THE EXERCISES

(b) The approximations from the fourth-order Taylor series method are y6 =
2.04422, y7 = 2.32748, and y8 = 2.65105. The estimate is accurate to
4D (the error in y8 is approximately 0.00003).
(c) For the second-order Runge-Kutta method we calculate k1 = 0.22949 and
k2 = 0.26244, and obtain y6 = 2.04086. Further calculation yields k1 =
0.26409, k2 = 0.30049, y7 = 2.32315, k1 = 0.30231, k2 = 0.34255, and
y8 = 2.64558. The estimate is accurate to 1D, but not to 2D (the error
in y8 is approximately 0.0055, which is larger than the maximum error of
0.005 allowable for 2D accuracy).
2. Euler’s method is yn+1 = yn − 0.2xn yn2 = yn (1 − 0.2xn yn ), and thus we
obtain (with working displayed to 5D ):

y1 = 2(1 − 0.2 × 0 × 2) = 2
y2 = 2(1 − 0.2 × 0.2 × 2) = 1.84
y3 = 1.84(1 − 0.2 × 0.4 × 1.84) = 1.56915
y4 = 1.56915(1 − 0.2 × 0.6 × 1.56915) = 1.27368
y5 = 1.27368(1 − 0.2 × 0.8 × 1.27368) = 1.01412

The exact solution is y(x) = 2/(1 + x 2 ), so y(1) = 1 and the estimate is


accurate to 1D (the error in y5 is approximately 0.014).

STEP 34 EXERCISES (page 155)


1. Application of the Adams-Bashforth method of order 2 to the given problem
yields the formula

yn+1 = yn + 0.1
2 [−15yn + 5yn−1 ] = 0.25(yn + yn−1 )

With working to 5D, the results are:

n xn yn y(xn ) = e−5xn |y(xn ) − yn |


0 0.0 1.0 1.0 0.00000
1 0.1 0.60653 0.60653 0.00000
2 0.2 0.40163 0.36788 0.03775
3 0.3 0.25204 0.22313 0.02891
4 0.4 0.16342 0.13534 0.02808
5 0.5 0.10386 0.08208 0.02178
6 0.6 0.06682 0.04979 0.01703
7 0.7 0.04267 0.03020 0.01247
8 0.8 0.02737 0.01832 0.00905
9 0.9 0.01751 0.01111 0.00640
10 1.0 0.01122 0.00674 0.00448
ANSWERS TO THE EXERCISES 213

The accuracy does vary, but the estimates decrease in magnitude as they should,
and do not change sign.
2. The second-order Adams-Bashforth method is
h
yn+1 = yn + (3 f n − f n−1 ) = yn + 0.05(3xn + 3yn − xn−1 − yn−1 )
2
and thus we obtain:

y2 = 1.11 + 0.05(0.3 + 3.33 − 0.0 − 1.0) = 1.2415


y3 = 1.2415 + 0.05(0.6 + 3.7245 − 0.1 − 1.11) = 1.39723
y4 = 1.39723 + 0.05(0.9 + 4.19168 − 0.2 − 1.2415)
= 1.57973
y5 = 1.57973 + 0.05(1.2 + 4.73920 − 0.3 − 1.39723)
= 1.79183

which is accurate to 1D (the error is approximately 0.006; the true value of


1.79744 was given in Section 3 of Step 33).

STEP 35 EXERCISE (page 159)

The initial value problem may be written as the system

w10 = w2 , w1 (0) = 0, w20 = sin x − w1 − w2 , w2 (0) = 0

Thus the equations for Euler’s method are

w1,n+1 = w1,n + hw2,n , w1,0 = 0

and
w2,n+1 = w2,n + h(sin xn − w1,n − w2,n ), w2,0 = 0
Some computations then yield the values given in the following table.

n xn w1,n w2,n
0 0.0 0 0
1 0.2 0.00000 0.03973
2 0.4 0.00795 0.10967
3 0.6 0.02988 0.19908
4 0.8 0.06970 0.29676
5 1.0 0.12905 0.39176

The required approximations are y(1) ≈ 0.12905 and y 0 (1) ≈ 0.39176


214 ANSWERS TO THE EXERCISES

APPLIED EXERCISES (page 160)

1. We see from the derivation on page 23 that the equation to be solved is

θ − sin θ cos θ = cπ

where c takes the values 0.1, 0.2, 0.3, and 0.4. Application of the bisection
method shows that the corresponding values of θ are given to 4D by 0.8134,
1.0566, 1.2454, and 1.4124. The values of h are then given by r (1 − cos θ).
Hence the calibration markings should be at 0.3130r , 0.5082r , 0.6803r , and
0.8423r .
2. Application of the bisection method shows that P(0) = 611.
3. Let x1 , x2 , x3 , and x4 be the number of kilocalories provided by 100 grams
of carbohydrates, proteins, fats, and alcohol, respectively. Then the given
information shows that we need to solve the linear system
    
0.47 0.08 0.02 0 x1 227
 0 0.27 0.12 0   x2   218 
  = 
 0.25 0.04 0.07 0   x3   170 
0 0 0 0.10 x4 68

Gaussian elimination shows that to 3S the values are x1 = 374, x2 = 430,


x3 = 848, and x4 = 680.
4. The initial population distribution was x (0) = 0.5, y (0) = 0.3, and z (0) = 0.2.
5. Application of the power method shows that the largest eigenvalue is 1. The
method also shows that the corresponding equilibrium population distribution
is x = (0.234, 0.444, 0.322)T . (This is the eigenvector scaled so that the sum
of the components is equal to 1.)
6. The finite difference table is:

x f (x) 1 12 13
0.0 1.0000
−25
0.1 0.9975 −50
−75 1
0.2 0.9900 −49
−124 1
0.3 0.9776 −48
−172 1
0.4 0.9604 −47
−219
0.5 0.9385
ANSWERS TO THE EXERCISES 215

Since the second-order differences are approximately constant, we conclude


that a quadratic interpolating polynomial is appropriate. Then

J0 (0.25) ≈ 0.9900 + 21 (−0.0124) + 2 2 (− 2 )(−0.0048)


11 1
= 0.9844.

7. The length of sheet iron required is 75.3 cm.


8. To 4D, the table of the error function is as follows.
x erf(x) x erf(x)
0.1 0.1125 0.6 0.6039
0.2 0.2227 0.7 0.6778
0.3 0.3286 0.8 0.7421
0.4 0.4284 0.9 0.7969
0.5 0.5205 1.0 0.8427

9. To 4D the value of θ(1) is 0.8478.


10. Half the terminal velocity is 49.05. Some numerical experimentation then
shows that it takes the skydiver 5.5 seconds to reach this speed.
BIBLIOGRAPHY

The following is a short list of books which may be referred to for complementary
reading, proofs omitted in this text, or further study in Numerical Analysis.
Calculus
G. B. Thomas and R. L. Finney (1992). Calculus and Analytic Geometry (8th
edn). Addison-Wesley, Reading, Mass.
Linear Algebra
H. Anton (1993). Elementary Linear Algebra (7th edn). Wiley, New York.
Numerical Analysis
K. E. Atkinson (1993). Elementary Numerical Analysis (2nd edn). Wiley, New
York.
R. L. Burden and J. D. Faires (1993). Numerical Analysis (5th edn). PWS-Kent,
Boston.
E. W. Cheney and D. R. Kincaid (1994). Numerical Mathematics and Computing
(3rd edn). Brooks/Cole, Belmont, Calif.
S. D. Conte and C. de Boor (1980). Elementary Numerical Analysis (3rd edn).
McGraw-Hill, New York.
C. F. Gerald and P. O. Wheatley (1994). Applied Numerical Analysis (5th edn).
Addison-Wesley, Reading, Mass.
J. H. Mathews (1992). Numerical Methods for Mathematics, Science, and Engi-
neering (2nd edn). Prentice-Hall, Englewood Cliffs, N.J.
Tables
M. Abramowitz and I. A. Stegun (1965). Handbook of Mathematical Functions.
Dover, New York.
INDEX

Abscissae, 146 Condition number, 70–72


Absolute error, 10, 36, 116 Convergence
Accumulated error, 10, 11 bisection method, 28
Adams-Bashforth methods, 153, false position, 31
154 Gauss-Seidel, 57–58
stability, 154 linear, 32, 77
Adams-Moulton methods, 153, 154 Newton-Raphson, 39–40
stability, 154 power method, 77
Aitken’s interpolation method, 107– quadratic, 32, 40
108 range of, 18
Algorithm, 10 secant method, 32
Approximation to functions, 18–20 series, 18
linear, 92–94 simple iteration, 35–36
polynomial, 20, 89–91, 139 superlinear, 32
quadratic, 94, 143 Crout’s method, 67–68
Asymptotic series, 20 Cubic splines
Augmented matrix, 43, 46, 64 clamped, 130–132
construction, 129–132
Back-substitution, 44, 45, 48, 65 natural, 130–131
Backward differences, 84, 85–86 Curve fitting, 92, 114–134
Newton’s formula, 96–97, 98 errors, 115–116
operator, 84 least squares, 116–117
Bessel functions, 20, 115, 161 splines, 129–134
Bisection method, 27–29, 164 Curve sketching, 24–25
convergence, 28
effectiveness, 28 Definite integral, 79, 139
pseudo-code, 164 Derivative, 79, 135
Bit, 7 numerical, see Numerical differ-
entiation
Cancellation error, 16, 137 partial, 116–117
Cauchy-Schwarz inequality, 70 Determinant, 43–44, 59
Central differences, 84, 85–86 Differences
operator, 84 backward, 84, 85–86
Characteristic polynomial, 73 central, 84, 85–86
Chebyshev polynomials, 20, 115, divided, 96, 104–105
147 finite, 79–91
Chopping, 8 forward, 83–84, 85–86
Coefficient matrix, 43, 48, 59, 64 notation, 83–84
218 INDEX

polynomial, 88–89 Errors


table, 80, 85–86, 89, 97 absolute, 10, 36, 116
Differential equations, 2–3, 4–5, accumulated, 10, 11
149–159 cancellation, 16, 137
Adams-Bashforth methods, 153, chopping, 8
154 curve fitting, 115–116
Adams-Moulton methods, 153, generated, 10, 11, 15
154 linear systems, 51, 56
Euler’s method, 150, 157 loss of significance, 16, 137
explicit methods, 153 measurement, 4
higher order, 156 propagated, 10, 11, 15, 52
implicit methods, 153 relative, 10–11
midpoint method, 153, 154–155 round-off, 8, 48, 52, 80–82, 89,
Milne’s method, 153 90, 137
multistep methods, 153–155 sources of, 4
Runge-Kutta methods, 150–151, truncation, 8, 18, 137, 141, 144,
158 150
single-step methods, 149–152 Euler’s method, 150, 157
systems, 156–159 Existence of solutions, 43–44
Taylor series method, 149–150 Expansion series, 18, 20
Divided differences, 96, 104–105 Taylor, 18–19, 35, 37, 40, 99,
inverse interpolation, 111–112 141, 144, 149–150
Newton’s formula, 105 Exponent, 7
Doolittle’s method, 67
Double precision number, 7
False position, 1, 30–33, 38, 165
Effectiveness, 28, 31 convergence, 31
bisection method, 28 pseudo-code, 165
false position method, 31 Finite differences, 79–91
Eigenvalues, 42, 73–78 notation, 83–84
power method, 74–78 Floating point arithmetic, 14–17
Eigenvector, see Eigenvalues Floating point notation, 7
Elementary operations, 45 Forward differences, 83–84, 85–86
Elimination, see Gaussian elimin- inverse interpolation, 111
ation method Newton’s formula, 96, 97, 135
Equations operator, 83–84
algebraic, 1, 23 Forward substitution, 65
differential, 2–3, 4–5, 149–159 Fourier series, 20
linear systems, 2, 42–72 Functions, 18
nonlinear, 23–41 approximation, 18–20
normal, 117, 123–124 Bessel, 20, 115, 161
quadratic, 23 orthogonal, 20, 21, 117
transcendental, 23–24 spline, 114, 129–134
Error generation, 10, 11, 15 transcendental, 2, 23
Error propagation, 10, 11, 52 weight, 147
INDEX 219

Gaussian elimination method, 44– Interval arithmetic, 12


50, 52, 131, 167 Inverse interpolation, 110–113
number of operations, 52 divided differences, 111–112
pseudo-code, 167 forward differences, 111
Gaussian integration formula, 146– linear, 110
148, 171 Inverse matrix, 44, 59–61, 64
four point, 148 calculation of, 59–61
Gauss-Legendre, 147 generalized, 123
general form, 147 pseudo-inverse, 123, 124
pseudo-code, 171 solution of linear systems, 44,
three point, 147 59, 61–62
two point, 146–147, 148 Iterative methods, 34–36, 56
Gauss-Seidel iterative method, 56– Gauss-Seidel, 56–58
58, 168 inverse interpolation, 110–111
convergence, 57–58
pseudo-code, 168 Knots, 129
Generation of errors, 10, 11, 15
Graphs, 24 Lagrange interpolation formula,
101–103
Hermite polynomials, 147 Laguerre polynomials, 147
Householder matrices, 126–128 Least squares, 116–117
Householder transformations, 126– normal equations, 117, 123–124
127 solution using QR factorization,
126
IEEE Standard, 7 Legendre polynomials, 20, 115, 147
Ill-conditioning, 49, 53, 69, 70–72, Linear interpolation, 92–94, 96, 97,
117, 124 102
testing for, 70–72 inverse, 110
Increment, 88 Linear systems, 42–72
Initial value problem, 149, 156–157 errors, 51, 56
Integration, numerical, see Numer- existence of solutions, 43–44
ical integration overdetermined, 122
Interpolating polynomial, 98, 101– Loss of significance, 16, 137
102, 105, 114, 129, 135, 139 LU decomposition, 64
error, 106 Crout’s method, 67–68
uniqueness, 98 Doolittle’s method, 67
Interpolation, 79, 92–113 finding, 66–68
Aitken’s method, 107–108 solution of linear systems, 64–66
divided differences, 96, 105
inverse, 110–113 Mantissa, 7, 14
Lagrange’s formula, 101–103 normalized, 14
linear, 92–94, 96, 97, 102 Matrix, 43
Newton’s formulae, 96–100, augmented, 43, 46, 64
105, 135 coefficient, 43, 48, 59, 64
quadratic, 94, 96, 97 condition number, 70–72
220 INDEX

determinant, 43–44, 59 secant, 31–32, 37


eigenvalues, see Eigenvalues simple iteration, 34–36
generalized inverse, 123 single-step, 149–152
Householder, 126–128 Taylor series, 149–150
identity, 44 Midpoint method, 153, 154–155
ill-conditioning, see Ill-condi- stability, 154–155
tioning Milne’s method, 153
inverse, 44, 59–61, 64 Mistakes, 8
lower triangular, 64, 66 Model, mathematical, 2
LU decomposition, 64 Multipliers, 45, 48, 52, 66
norm, 70 Multistep methods, 153–155
orthogonal, 125 explicit, 153
pseudo-inverse, 123, 124 implicit, 153
QR factorization, 124–128 stability, 154–155
transpose, 123
tridiagonal, 131, 132 Nested multiplication, 21
upper triangular, 44, 48, 64, 66 Newton’s divided difference for-
Measurement errors, 4 mula, 105, 169
Methods pseudo-code, 169
Adams-Bashforth, 153, 154 Newton’s interpolation formulae,
Adams-Moulton, 153, 154 96–100, 105
Aitken, 107–108 backward difference, 96–97, 98
bisection, 27–29, 164 divided differences, 105
Crout, 67–68 forward difference, 96, 97, 135
direct, 56 Newton-Raphson method, 37–41,
Doolittle, 67 166
elimination, 42, 44–50, 52, 167 convergence, 39–40
Euler, 150, 157 pseudo-code, 166
explicit, 153 Nodes, 129
false position, 1, 30–33, 38, 165 Nonlinear equations, 23–41
Gaussian elimination, 44–50, Norm, 69–70
52, 167 matrix, 70
Gauss-Seidel iterative, 56–58, vector, 69–70
168 Normal equations, 117, 123–124
implicit, 153 Normalized mantissa, 14
LU decomposition, 64–68 Notation
matrix inversion, 59–61 finite differences, 83–84
midpoint, 153, 154–155 floating point, 7
Milne, 153 scientific, 7
multistep, 153–155 Number representation, 4, 7–8, 14
Newton-Raphson, 37–41, 166 binary, 7
power, 74–78 decimal, 7
predictor-corrector, 154 floating point, 7, 14
Runge-Kutta, 150–151, 158, 172 hexadecimal, 7
scaled power, 75–77 Numerical differentiation, 135–138
INDEX 221

error, 136–137 interpolating, see Interpolating


Numerical integration, 139–148 polynomial
Gauss four-point formula, 148 Laguerre, 147
Gauss, general form, 147 Legendre, 20, 115, 147
Gauss-Legendre, 147 orthogonal, 147
Gauss three-point formula, 147 piecewise, 114, 129
Gauss two-point formula, 146– Power method, 74–78
147, 148 convergence, 77
Simpson’s rule, 143–145, 147 scaled, 75–77
trapezoidal rule, 139–142 Predictor-corrector methods, 154
Principle of least squares, 116–117
Operations Propagation of errors, 10, 11, 52
elementary, 45 Pseudo-code, 3, 163–172
transformation, 45 Pseudo-inverse matrix, 123, 124
Operator, 83–84
backward difference, 84 QR factorization, 74, 124–128
central difference, 84 procedure, 126–128
Quadratic
forward difference, 83–84
convergence, 32, 40
shift, 83
equation, 23
Ordinary differential equations, see
interpolation, 94, 96, 97
Differential equations
Quadrature, see Numerical integra-
Orthogonal functions, 20, 21, 117
tion
polynomials, 147
Orthogonal matrix, 125
Range of convergence, 18
Overdetermined linear system, 122
Recursive procedure, 20–21
Relative error, 10–11
Parabola, 119 Remainder term, 18
Parameters, 115–116 Residuals, 49, 53
Partial derivative, 116–117 Roots, 23–26, 27
Partial differential equations, 79, double, 28
149 location of, 24–25
Partial pivoting, 52–53 repeated, 28
Pendulum problem, 4–5, 156, 158– Rounding, 8
159, 162 Round-off error, 8, 48, 52, 80–82,
Piecewise polynomials, 114, 129 89, 90, 137
Pivot elements, 48 Runge-Kutta methods, 150–151,
Pivotal condensation, see Partial piv- 158, 172
oting pseudo-code, 172
Polynomial, 20, 23
approximation, 20, 89–91 Scientific notation, 7
characteristic, 73 Secant method, 31–32, 37
Chebyshev, 20, 115, 147 convergence, 32
finite differences, 88–89 Series, 18
Hermite, 147 asymptotic, 20
222 INDEX

Bessel, 20 accuracy, 140–141


Chebyshev, 20 error bound, 141
convergence, 18 pseudo-code, 170
expansions, 18, 20 Triangular form, see Upper triangu-
Fourier, 20 lar form
Legendre, 20 Triangular matrix
Taylor, 18–19, 35, 37, 40, 99, lower, 64, 66
141, 144, 149–150 upper, 44, 48, 64, 66
truncation, 18 Tridiagonal system, 131, 132
Shift operator, 83 Truncation, 8, 18, 96, 97
Significant digits, 7 error, 8, 18, 137, 141, 144, 150
Simple iteration method, 34–36 Two-point integration (Gauss), 146–
convergence, 35–36 147, 148
Simpson’s rule, 143–145, 147
accuracy, 144 Uniqueness of interpolating polyno-
error bound, 144 mial, 98
Single precision number, 7 Upper triangular form, 44, 48
Single-step methods, 149–152
Sketching curves, 24–25 Vector norm, 69–70
Spline functions, 114, 129–134
Weight function, 147
cubic, see Cubic splines
Weights, 146
Square root, 1, 41
Stability, 154–155
Step length, 149
Substitution
back, 44, 45, 48, 65
forward, 65
Superlinear convergence, 32
Systems of differential equations,
156–159

Tables, 24
differences, 80, 85–86, 89, 97
Taylor series expansion, 18–19, 35,
37, 40, 99, 141, 144, 149–
150
Three-point integration (Gauss), 147
Transcendental, 2, 23
equations, 23–24
functions, 2, 23
Transformation operations, 45
Transformations
Householder, 126–127
orthogonal, 126
Trapezoidal rule, 139–142, 170

You might also like