Table of contents

These are my annotations for ECE204 Numerical Methods, Winter 2024, taught by Professor Douglas Harder. Notes directly from the course are presented as text, while annotations are in blockquotes.

The course is an introduction to how computational methods may be used to solve engineering problems numerically. Lectures give background on the techniques while labs use MATLAB and C++.

Main takeaways:

"The real world is modeled through equations" that must be computed to solve problems.
Solutions must be tested for robustness (stability) and have tolerable error, runtime.
Mathematical descriptions (algorithms) should be used whenever possible for ability to verify the design and simulate conditions.
The storage formats of computers (IEEE-754 floating point, UINT_MAX, big number operations) necessitate the use of specialized computer algorithms as opposed to more straightforward solving of problems.

Prerequisites

A strong understanding of linear algebra.
Familiarity with the C++ standard library.

The seven tools of numerical analysis

These seven tools allow us to approximate solutions. They allow us to predict the error of running simulations using floating-point numbers.

Weighted averages
Iteration
Linear algebra
Interpolation
Taylor series
Bracketing
Intermediate-value theorem

Four classes of problems

There are four classes of problems.

Evaluate an expression.
Find a solution to a system of algebraic equations
Find a solution to a system of analytic equations
Unconstrainted optimization

Evaluating an expression

We can evaluate an expression to a numeric value. Horner's rule can be used to evaluate polynomials. The effect of numeric arithmetic is mitigated if the variable is small and the coefficients of higher terms are decreased.

We can estimate a value between given equally spaced points in space or time. For space, given that we have values of the two endpoints of the sequence, shift and scale the surrounding points to -0.5 and 0.5. If they are four points, use -1.5, -0.5, 0.5, and 1.5. Then use the interpolating linear or cubic polynomial to approximate. Evaluate using Horner's rule.

For time, given we have only the most recent poin to the right of the interval to approximate, shift and scale the most recent three points to -1.5, -0.5, 0.5 or the most recent four points to -5/2, -3/2, -1/2, 1/2 and use interpolating linear or cubic polynomials, respectively.

To approximate the derivative, we can use rise-over-run on the two points but the error is $\mathcal{O{h)$, where $h$ is the width of the interval.

Divided-difference formulae

Given equally-spaced points in space $x_k$ find interpolating polynomial (quadratic), differentiate once or twice, then evaluate at $x_k$. Error analysis is done using Taylor series and intermediate value theorem (as errors can be written as convex combinations)

Given points in time equally-spaced $t_k$, we can find interpolating quadratic polynomial through last three points, differentiate, and evaluate it at $t_k$ for the first derivative. For second derivative do so with cubic through last four points, differentiated twice. Error analysis is done with Taylor series. This is called the three-point backward divided-difference method.

Estimate intergrals between given points

For any integral

$$\int_a^b f(x)\, dx = \bar{f}_{[a, b]} (b-a)$$, where $\bar{f}$ is the average value of $f$ on $[a, b]$, we can approximate the average value of the function by evaluating the function at various points and taking a weighted average of those values.

Actual formulae may be found by taking interpolating polynomials between a number of points and integrating that interpolating polynomial.

Weighted averages

Suppose you have $n$ values $x_1, \ldots, x_n$, all of which approximate one value $x$. The average of these $n$ values is

$$\bar{x} := \frac{1}{n} \sum_{j=1}^n x_j$$

If we have $n$ coefficients $c_1, \ldots, c_n$ with the property $\sum_{k=1}^n c_k = 1$, then we say that

$$\bar{x}_W := \sum_{i=1}^N c_n x_n$$

is the weighted average of the values $x_1$ through $x_n$.

Definition 1: Convex combination

Any weighted average where all coefficients $c_k \ge 0\ \forall\ k\in [1, \ldots, n]$

Any convex combination of $n$ real numbers must evaluate to a number within the inclusive range of the minimum and maximum bounds of the numbers provided.

In some cases, a weighted average gives a better approximation than a simple average. For example, a three-point sine approximation with the midpoint double-weighted is far better than an average of the three points.

$$\frac{1}4 \sin{2} + \frac{1}{2}\sin{2.5} + \frac{1}4 \sin{3}$$

This particular approximation is known as the composite trapezoidal rule. In general, if $n$ values each approximate a real value $x$, then a weighted average of those $n$ measurements is not subject to issues with floating point arithmetic and will represent $x$ well.

Approximating $\sqrt{2}$ requires the taking of an average. Estimating $a$ and $b$ given samples from a uniform distribution sampled from $[a, b]$ used a weighted average with some negative coefficients.

Fixed-point iteration

Definition 2: Fixed-point iteration

Fixed-point iteration is the process of performing an operation on the results of an input over and over again such that it converges to a value.

Fixed-point iteration is useful for problems that can be expressed in terms of a function $f: \mathbb{A}\mapsto\mathbb{B}$ and an input $x$ such that there exists a value

$$x = f(x)$$

If repeated recursive evaluation of the form

$$x_{k+1} = f(x_k)$$

has a suitable radius of convergence for our problem, then the fixed point theorem says that the sequence must converge to a solution of $x = f(x)$. This can be used as a numerical way to find solutions to certain problems, like:

the roots of a differentiable function (Newton's Method)
the solutions to a linear system (the Jacobi Method)

Example 1: Fixed-point iteration function, C++

We can use a function reference to implement a general routine that solves fixed-point functions.

double fixed_point(double f(double), double x_guess) {
    double x_previous{ 0 };

    do {
        x_previous = x_guess;
        x_guess = f(x_guess);
    } while (x_previous != x_guess); // Convergence condition

    return x_guess;
}

Because the floating-point representation means x_previous and x_guess may not be exactly equal, we can amend this to use a step tolerance. The step tolerance is the closest difference between x_previous and x_guess that we will accept for them to be considered equal.

Further, note that if the function does not converge we need a way for the function to exit the loop. Therefore, we also add a maximum number of iterations.

double fixed_point(double f(double), double x_guess, double step_tolerance,
                   std::size_t max_iterations) {
    double x_previous{ 0 };

    for (std::size_t i{0}; i < max_iterations; ++i) {
        x_previous = x_guess;
        x_guess = f(x_guess);

        if (std::abs(x_previous - x_guess) >= step_tolerance) {
            return x_guess;
        }
    }
    throw std::runtime_error("fixed_point did not converge");
}

Linear algebra

Many numerical solutions can be found by finding solutions to a system of linear equations.

Polynomial interpolation

Given $n$ points of form $(x_k, y_k)$, there is a polynomial of degree $n-1$ that passes through all of them as long as all $x_k$ are unique.

The coefficients of the polynomial can be found by solving a system of linear equations. The coefficient matrix of this operation is known as the Vandermonde matrix and raises the inputted $x$ coordinates to the 0th - $n-1$th powers.

$$V(x_1, x_2, \ldots, x_n) = \begin{bmatrix} x_1^{n-1} & x_1^{n-2} & \ldots & x_1 & 1 \\ x_2^{n-1} & x_2^{n-2} & \ldots & x_2 & 1 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_{n-1}^{n-1} & x_{n-1}^{n-2} & \ldots & x_{n-1} & 1 \\ x_n^{n-1} & x_n^{n-2} & \ldots & x_n & 1\end{bmatrix}$$

The coefficients are found by solving the below system:

$$\begin{bmatrix}a_{n-1} \\ a_{n-2} \\ \vdots \\ a_1 \\ a_0 \end{bmatrix} V = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{n-1} \\ y_n \end{bmatrix}$$

This makes sense because the interpolating polynomial for the $n=3$ (three-point) case must pass through each of the three, such that:

$y_1 = ax_1^2 + bx_1 + c$
$y_2 = ax_2^2 + bx_2 + c$
$y_3 = ax_3^2 + bx_3 + c$

Theorem 1: Existence and uniqueness of interpolation polynomials

Given $n+1$ points $(x_i, y_i)^n_{i=0}$ with distinct $x$ values. Then there is one and only one polynomial $p_n(x)\in\mathbb{P}_n$ satisfying the condition ¹

$$p_n(x_i) = y_i,\quad i\in\{0,\ldots,n\}$$

Using partial pivoting, we avoid adding a large multiple of one equation onto another. Doing so would lose information about the second row.

Shifting and scaling

These are two techniques that reduce error from floating-point arithmetic. Shifting is necessary if $x$ values are large. Scaling works best if they are equally spaced.

Taylor series

We use the representation:

$$f(x + h) = f(x) + f^{(1)}(x)h + \frac{1}2 f^{(2)}(x)h^2 + \ldots$$

A definition of error

Definition 3: Error

If $x_\text{approx}$ is an approximation of $x$, the error $\epsilon$ is such that

$$x = x_\text{approx} + \epsilon$$

The absolute error is defined as

$$\epsilon_\text{abs} := \lvert x - x_\text{approx}\rvert$$

The relative error is given as:

$$\epsilon_\text{rel} := {\epsilon_\text{abs} \over \lvert x \rvert}$$

The relative error is preferred because it is unit-free. For example, $\epsilon_\text{rel} = 0.01$ means the approximation is good to one part in a hundred regardless of the magnitude of the exact value.

A solution is only useful if its maximum error is acceptable. The maximum absolute error of each of the algorithms below can be determined. And should be, when evaluating potential solutions!

Definition 4: Significant digits

We say an approximation $x_\text{approx}$ is to (or within) $d$ significant digits if its relative error $\epsilon_\text{rel}$ satisfies the inequality:

$$\epsilon_\text{rel} \le 5\times 10^{-d}$$

Rounding

Computers cannot store real numbers with infinite precision. To compute results, algorithms must therefore round values during operations.

We round to $N$ digits by looking at the value of the $N+1$th digit. If the digit is $>5$ we round up. If the digit is $<5$ we round down. If the digit is $5$, we look at the $N$th digit. If it is even, round up. Otherwise, round down.

This is so that there is no biasing of rounding up or down for that specific case.

Precision and accuracy

Two algorithms can be compared based on their precision versus accuracy.

Definition 5: Precision

An algorithm is more precise if given uniformly-random inputs, it produces output that has less variance.

Definition 6: Accuracy

An algorithm is more accurate if given uniformly-random inputs, it produces output that has less deviance from the expected outputs

Sources of error

Subtractive cancellation

Subtractive cancellation is where subtracting good approximations to two nearby numbers yields a far worse approximation than you might expect. ¹

Cancellation is inherent to the subtraction operation: this means that even if the subtraction is computed exactly, if the inputs are approximated than the resulting error has subtractive cancellation.

Theorem 2: Subtractive cancellation

Subtractive cancellation occurs because subtraction is ill-conditioned at nearby inputs.

Let $\bar{x}= x + \epsilon_x \cdot x$ and $\bar{y}= y + \epsilon_y\cdot y$ be approximations of $x$ and $y$ with relative errors $\epsilon_x, \epsilon_y$.

Then

$$\begin{align}\bar{x} - \bar{y} &= (x + \epsilon_x x) - (y + \epsilon_y y) \\ &= x - y + \epsilon_x x - \epsilon_y y \\ &= (x-y)\left (1 + \frac{x \epsilon_x - y \epsilon_y}{x - y}\right )\end{align}$$

Therefore the relative error of the exact compuation of approximate values is

$$\epsilon_{x-y} = \left\lvert \frac{x\epsilon_x - y\epsilon_y}{x-y}\right\rvert$$

Given the $x-y$ in the denominator, that means that $\epsilon_{x-y}\to\infty$ as $x-y \to 0$.

This is a problem when $x$ and $y$ are very close to each other.

Floating-point error

Floating-point arithmetic is not associative.

$$x+ (y+z) \not\!\!\!\implies (x+y)+z$$

$x + y = x$ if $y$ is sufficiently small
$x-y$ loses precision if $x$ is sufficiently close to $y$.

To mitigate these issues:

Add numbers of the same sign from smallest to largest
avoid adding large numbers onto smaller numbers if the significance of the smaller matters
avoid subtracting numbers of approximately equal value

The Sterbenz lemma is a theorem that states when floating-point differences are computed exactly.

Theorem 3: Sterbenz lemma

In a floating-point number system with subnormal numbers (extended specification of IEEE 754), if $x$ and $y$ are floating point numbers such that

$$\frac{y}2 \le x \le 2y$$

then $x-y$ is also a floating-point number. Thus, a correctly rounded floating-point subtraction

$$x \ominus y = \text{fl}(x - y) = x - y$$

is computed exactly. ¹

Contrast the Sterbenz lemma to the phenomenon of subtractive cancellation. Even if the lemma guarantees that the subtraction will be performed exactly, there may still be inherent error in the subtraction that results if the floating-point operands are themselves approximations!

Horner's method

Horner's method is an algorithm to evaluate polynomials.

$$a_0 + a_1 x + a_2 x^2 + \ldots + a_n x^n = a_0 + x(a_1 + x(a_2 + \ldots + xa_n))$$

For a polynomial of degree $n$, this results in only $n$ multiplications and $n$ divisions.

Condition number of a matrix

The condition number of a matrix measures how sensitive it is to change. It is defined using the matrix norm.

Matrix norm

The matrix norm is defined as controlling the growth when multiplied. ²

Definition 7: Matrix norm

The norm of $A$ is the largest ratio

$$\lvert\lvert A \rvert\rvert := \text{max}_{x\ne 0} \frac{\lvert \lvert Ax\rvert\rvert}{\lvert\lvert x\rvert\rvert}$$

Through derivation this is equal to the square root of the largest eigenvalue of $A^T A$

$$\lvert\lvert A \rvert\rvert = \text{max}_{x\ne 0} \frac{x^T A^T Ax}{x^T x} = \Lambda (A^T A)$$

Sensitivity to error

A matrix's sensitivity to error is measured by its condition number.

Partial pivoting: solving linear systems

Motivation: Many physical phenomena can be modelled as systems of linear equations. Further, many non-linear systems may be approximated using linear systems.

Indeed, in 1947 John von Neumann and H. H. Goldstine published a comprehensive treatise on the presence of essentially random error, and how to minimize that error, while solving linear systems in their work Numerical Inverting of Matrices of High Order.

Gaussian elimination is a common technique for solving such a system. Given a system of $n$ linear equations in $n$ unknowns expressed as the coefficient matrix $A$, the variable vector $\vec u$ and the solution vector $\vec v$:

$$A\vec u = \vec v$$

we create the augmented matrix $A\lvert\vec v$ and apply the elementary row operations on the matrix to reach row-echelon form. Backward substitution may be used to find the unique solution if $\text{rank}(A)=n$

Definition 8: Elementary row operations

Swapping two rows
Adding a multiple of one row onto another
Multiplying a row by a non-zero scalar

However, this presents several issues:

Given dense $A$, the number of floating point operations may be as high as $\frac{2}3n^3 - \frac{1}2n^2 - \frac{1}6n$.
Adding a multiple of one row to anther may be a source of subtractive cancellation
Adding a large multiple of one row onto another row may give $x + y = y$ (ie swamp out the smaller value).

The partial pivoting method maintains precision by saying to swap the row with the largest in magnitude entry on or below the pivot to the pivot row.

If the largest entry in absolute value is moved to the pivot, then when we add a multiple of one row onto another, we are guaranteed that

$$\lvert \frac{a_{k,j}}{a_{i,j}}\rvert \le 1$$

And avoid the issue of adding a significant multiple of one row onto another.

Jacobi method

Theorem 4: Jacobi method

The Jacobi method is a fixed-point theorem. It states that given a linear system of form

$$A \vec u = \vec v$$

with strictly diagonally-dominant matrix $A$ and known $\vec v$, then the recursive iteration

$$u_k = A_\text{diag}^{-1}(v - A_\text{off}\cdot u_{k-1})$$

always converges to the solution of the system $u$ given any initial guess $u_0$.

Approximating integrals

There are three main methods:

Midpoint rule: estimate definite integral using piecewise constant functions.
Trapezoidal rule: estimate using piecewise linear functions.
Simpson's rule: estimate using piecewise quadratic functions.

Trapezoid rule

Definition 9: Trapezoid rule

Approximates the definite integral.

Given a function $f(x)$, we take a selection of points. We then connect them into a set of trapezoids and calculate the area.

$$\int_a^b f(x), dx \approx (b-a)\frac{1}2(f(a)+f(b))$$

Given equal spacing of the points $$\Delta x$$ from $x_0$ to $x_{N-1}$, this becomes¹

$$\int_a^b f(x), dx \approx \frac{\Delta x}2 \sum^N_{k=1} (f(x_{k-1}) + f(x_k))$$

Comparisons can be drawn between this and Riemann summing.

Centered $O(h^5)$ backwards approximation

We may integrate the interpolating polynomial on the points

$(t_{k-3}, f(t_{k-3}))$
$(t_{k-2}, f(t_{k-2}))$
$(t_{k-1}, f(t_{k-1}))$
$(t_k, f(t_k))$

For each $k$. The interpolating polynomial will

Simpson's rule

Definition 10: Simpson's rule

Approximate the function $f$ using piecewise quadratic functions.

Partition the interval into even number of subintervals with same length.
Over the first pair of subintervals approximate $\int_{x_0}^{x_2} f(x), dx$ with $\int_{x_0}^{x_2} p(x), dx$, where $p(x) = Ax^2 + Bx^2 + C$ is the interpolating polynomial passing through
$(x_0, f(x_0))$
$(x_1, f(x_1))$
$(x_2, f(x_2))$

Theorem 5: Simpson's rule

Assume that $f(x)$ is continuous over $I = [a, b]$. Let $n$ be a positive even integer and $\Delta x = \frac{b-a}n$. Let $I$ be divided into $n$ subintervals, each of length $\Delta x$, with endpoints at $P={x_0, x_1, x_2, \ldots, x_n }$. Set

$$S_n = \frac{\Delta x}{3} \left [ f(x_0) + 4f(x_1) + 2 f(x_2) + \ldots + 2f(x_{n-2}) + 4f(x_{n-1}) + f(x_n) \right ]$$

Then:

$$\lim_{n\to +\infty} S_n = \int_a^b f(x), dx$$

The error bound for the statement above approximating $f(x)$ can be given as:

If $M$ is the maximum value of $\lvert f^{(4)}(x)\rvert$ over $I$, then the upper bound for the actual error $\delta$ in using $S_n$ to estimate $\int_a^bf(x), dx$ is given by

$$\delta \le \frac{M(b-a)^5}{180 n^4}$$

Bracketing

Bracketing is the numerical equivalent of the binary search. It can be modified to become the interpolation search to make convergence even faster.

Least-squares best-fitting polynomials

Suppose that we are given a linear system that is overdetermined. There is no linear combination of the columns of $A$ that equals the target vetor.

What linear combination is closest to $\vec v$?

Let $A: \mathcal{U}\mapsto \mathcal{V}$ be a linear map, usually from $\mathbb{R}^n\mapsto \mathbb{R}^m$

We want to minimize the norm:

$$\lvert\lvert \mathbb{A} \vec u - \vec v\rvert \rvert_{2}$$

and find a $\vec u$ that makes it as small as possible.

The object is to find $\vec u_0$ such that $\mathbb{A}u_0 - v$ is perpendicular to all vectors in range of $\mathbb{A}\vec u$

We derive that

$$u_0 = (\mathbb{A}^T\mathbb{A})^{-1} \mathbb{A}^T \vec v$$

$$A^T A \vec u_0 = A^T v$$

Is the vector that minimizes the norm.

Newton's method

The Taylor series for a real-valued function of a vector variable

$$f(\vec u) \approx f(\vec)$$

Generalization to finding simultaneous zero for multidimensional plane

We can use linear algebra and a clever formulation of the Newton's method to find simultaneous roots of $n$ functions in $n$ dimensions.

It keeps the convergence property of exponential convergence.

Secant method

Generalizing the secant method for higher dimensions. Start with three guesses, calculate the function's value at each guess and you get a plane for both functions. THen you can eleiminate the worst guess.

solving methods extend from linear, non-linear, systems of linear, and systems of nonlinear (newton's method in n dimensions or generalized secant method)

bisection, newton, secant, inverse parabolic, successive overrelaxation

Approximating integrals

The backwards functions allow real-time calculation of integrals using historical data.

Steps:

Initialize variable holding complete integral, $S$.
Store either 2 or 3 previous data points in memory buffer.
When new data point comes in, do the following:
- Calculate backwards-composite three point integral using previous 2 data points + new data point.
- Add result to $S$ (integral for one step)
- or calculate backwards-composite four point integral using previous 3 data points + new data point.
- Add result to $S$ (integral for one step)
What if there is a discontinuity (that is, no data arrives for one step)
- Use the trapezoid rule for $h=2$ (two steps) and previous 1 data point.
- add to $S$ (this has huge error but least of all of the approximation methods)
- then revert back to previous method

References

Wikipedia Contributors, Trapezoidal rule, Accessed 2024.

Wikipedia Contributors, Catastrophic cancellation, Accessed 2024-02-12, [Online]

⁴

Wikipedia Contributors, Sterbenz lemma, Accessed 2024-02-12, [Online]

⁵

LibreTexts Mathematics, 7.7: Approximate Integration, Accessed 2024.

⁶

Douglas Wilhelm Harder, In a nutshell: Least-squares best-fitting polynomial, [PDF]

⁷

Douglas Wilhelm Harder, ECE 204 Numerical methods, "Course introduction", [PDF]

⁸

Douglas Wilhelm Harder, ECE 204 Numerical methods, "3. Tools", [PDF]

⁹

Oleksandr Kaleniuk, Geometry for Programmers, Published 2023, [Book]

Gilbert Strang, Introduction to Linear Algebra, Published 2023, [Online]

ECE204 Numerical Methods

Prerequisites

The seven tools of numerical analysis

Four classes of problems

Evaluating an expression

Divided-difference formulae

Estimate intergrals between given points

Weighted averages

Fixed-point iteration

Linear algebra

Polynomial interpolation

Shifting and scaling

Taylor series

A definition of error

Rounding

Precision and accuracy

Sources of error

Subtractive cancellation

Floating-point error

Horner's method

Condition number of a matrix

Matrix norm

Sensitivity to error

Partial pivoting: solving linear systems

Jacobi method

Approximating integrals

Trapezoid rule

Centered $O(h^5)$ backwards approximation

Simpson's rule

Bracketing

Least-squares best-fitting polynomials

Newton's method

Generalization to finding simultaneous zero for multidimensional plane

Secant method

Approximating integrals

References