Numerical Linear Algebra

The study of algorithms for performing linear algebra operations numerically

Page Rank

Idea: If many web pages link to your website, there must be a consensus that it is important

We represent the web’s structure as a directed graph

Adjacency Matrix

Gi,j={1, if link j -> i exists0, otherwise

Then the degree for node q is the sum of entries in column q
Notice: Matrix G is not necessarily symmetric about the diagonal

If page j links to page i, this is considered a vote by j that i is important

Global Importance

If page i has many incoming links, it is probably important

The Random Surfer Model

Imagine an internet user who starts at a page, and follows links at random from page to page for K steps

Criticisms

Potential issues with this algorithm?

A Markov Chain Matrix

Let P be a (large!) matrix of probabilities, where Pij is the probability of randomly transitioning from page j to page i

Pij={1degj,if link j -> i exists0,otherwise

We can build this matrix P from our adjacency matrix G

Dead Ends

To deal with dead-end links, we will simply teleport to a new page at random!

Escaping Cycles

Most of the time (a fraction α), follow links randomly, via P

Some Useful Properties of M

The entries of M satisfy 0Mij1
Each column of M sums to 1$$\sum_{i=1}^RM_{ij} = 1$$
Interpretation: If we are on a webpage, probability of being on some webpage after a transition is 1

Markov Transition Matrices

The Google Matrix M is an example of a Markov matrix
We define a Markov matrix Q by the two properties:

0Qij1$$and$$i=1RQij=1

Probability Vector

Now, define a probability vector as a vector q such that

0qi1$$and$$i=1Rqi=1

If the surfer starts at a random page with equal probabilities, this can be represented by a probability vector, where pi=1R

Evolving the Probability Vector

Now we have:

p1=Mp0$$Likewise,foranystep$n$,nextstepprobabilitiesare$$pn+1=Mpn

Preserving a Probability Vector

If pn is a probability vector, is pn+1=Mpn also a probability vector?

Page Rank Idea

With what probability does our surfer end up at each page after many steps, starting from p0=1Re?

Making Page Rank Efficient

How can we apply/implement Page Rank in a way that is computationally feasible?

Precomputation

The ranking vector p can be precomputed once and stored, independent of any specific query

Sparsity

In numerical linear algebra, we often deal with two kinds of matrices

Sparse Matrix-Vector Multiplication

Multiplying a sparse matrix with a vector can be done efficiently

Exploiting Sparsity

Note: M is fully dense to escape cycles
The trick: Use linear algebra manipulations to perform the main iteration$$p^{n+1}=Mp^n$$without ever creating/storing M!
Recall:

M=α(P+1RedT)+(1α)1ReeT

Consider computing:

Mpn=αPpn+αRedTpn+(1α)ReeTpnMpn=αPpn+αRedTpn+(1α)Re

We never form M explicitly

Google Search: Other Factors

Page Rank can be “tweaked” to incorporate other (commercial?) factors.

Replace standard teleportation (1α)1ReeT with (1α)veT, where a special probability vector v places extra weight on whatever sites you like.

Convergence of Page Rank

We will need some additional facts about Markov Matrices, involving eigenvalues and eigenvectors

Eigenvalues and Eigenvectors

An eigenvalue λ and corresponding eigenvector x of a matrix Q are a scalar and non-zero vector, respectively, which satisfy$$Qx=\lambda x$$
Equivalently, this can be written as$$Qx=\lambda I x$$where I is the identity matrix
Rearranging gives$$(\lambda I - Q)x = 0$$which implies that the matrix λIQ must be singular for λ and x to be an eigenvalue/eigenvector pair, since we want x0

Singular

A singular matrix A satisfies detA=0
Thus to find the eigenvalues λ of Q, we can solve the characteristic polynomial given by $$\det(\lambda I -Q) = 0$$which is $$(\lambda - Q_{00})(\lambda-Q_{11})-(Q_{01}\times Q_{10})=0$$solve for λ
plug λ back in Qn=λn to find corresponding eigenvectors

Properties

det(A − λI) = 0 → eigenvalues

Convergence Properties

  1. Every Markov matrix Q has 1 as an eigenvalue
  2. Every eigenvalue of a Markov matrix Q satisfies |λ|1
    1. So 1 is the largest eigenvalue
  3. A Markov matrix Q is a positive Markov matrix if Qij>0i,j
  4. If Q is a positive Markov matrix, then there is only one linearly independent eigenvector of Q with |λ|=1
    The number of iterations required for Page Rank to converge to the final vector p depends on the size of the 2nd largest eigenvalue, |λ2|
pk=(Mk)p0=c1x1+l=2Rcl(λl)kxl
Result

For google matrix, |λ|alpha

if α=0.85, then |λ2|114|0.85|114108. What does this say?

After 114 iterations, any vector components of p0 not corresponding to the eigenvalue |λ1| will be scaled down by about ~q08

  • The resulting vector p114 is likely to be a good approximation of the dominant eigenvector, x1

A smaller value of |λ2|α implies faster convergence

Gaussian Elimination

Solving Linear Systems of Equations

Ax=b$$where $A$ is a matrix, $b$ is a **right-hand-side** (column) vector, and $x$ is a (column) **vector of unknowns** ### Review: Gaussian Elimination * Eliminating variables via row operations, until one remains * **back-substituting** to recover the value of all the other variables This was done by applying combinations of: 1. **Multiplying a row** by a constant 2. **Swapping rows** 3. Adding a multiple of one row to another row Our view will be the following: 1. **Factor** matrix $A$ into $A=LU$ where $L$ and $U$ are triangular 2. **Solve** $Lz=b$ for intermediate vector $z$ 3. **Solve** $Ux=z$ for $x$ The upper triangular matrix after row operation is the $U$ in our matrix factorization $A=LU$ * $L$ is the multiplicative factors * The diagonal entries of $L$ are always 1’s >[!tldr]- Algorithm >![Pasted image 20250423075839.png](/img/user/assets/Pasted%20image%2020250423075839.png) ### Factorization and Triangular Solves Given the factorization $A=LU$, we can rapidly solve $Ax=b$. How? * This is the same as solving $LUx=b$. Rewrite it as $L(Ux)=b$ * Define $z=Ux$, and rewrite this as two separate solves: * First: Solve $Lz=b$ for $z$ * Then: Solve $Ux = z$ for $x$ #### Advantages $L$ and $U$ are both **triangular**: * all entries above, or below, the diagonal are zero, respectively * this makes them more **efficient** to solve The RHS vector $b$ is not need to factor $A$ into $A=LU$ * Factoring work can be reused for different $b$’s #### Numerical Problems During factoring, what if a diagonal entry. $a_{k,k}$ is zero? Or close to zero? * A division by zero is obviously bad news What about **nearly** zero? * This will make the multiplicative factor, $a_{ik}/a_{kk}$, large in magnitude * A large factor can cause **large floating point error** during subtraction and magnify existing floating point error * leads to numerical instability! ##### Solution: Row/Partial Pivoting We will **always** swap, to **minimize** floating point error incurred **Strategy:** 1. **Find** the row with the **largest magnitude** entry in the current column beneath the current row 2. **Swap** those rows if larger than the current entry ##### Pivoting with a Permutation Matrix How can our factorization view account for row swaps? Find a modified factorization of $A$ such that $PA=LU$ where $P$ is the **permutation matrix** * A permutation matrix $P$ is a matrix whose effect is to swap rows of the matrix it is applied to >[!tldr] $P$ is simply a permuted (row-swapped) version of identity matrix, $I$ >e.g. to swap rows 2 and 3 of a $3\times 3$ matrix, swap rows 2 and 3 of $I$ >![Pasted image 20250423081417.png](/img/user/assets/Pasted%20image%2020250423081417.png) Given the factorization $PA=LU$, multiplying $Ax=b$ by $P$ shows that $PAx=Pb$ * Using our factorization to replace $PA$, we have $LUx = Pb = b’$ Solve by **first** permuting entries of $b$ according to $P$ to form $b’$ * Then **forward and backward substitution** lets us find $x$ as usual ##### Finding the permutation matrix, $P$ Start with $P$ set to be an $n\times n$ identity matrix, $I$ * Whenever we swap a pair of rows during $LU$ factorization, also swap the corresponding rows of $P$ * including the already stored factors * The final $P$ will be the desired permutation matrix ### Cost of Gaussian Elimination We will measure cost in **total FLOPs:** *FLoating point OPerations* * Approximate as the number of: **adds + subtracts + multiplies + divides** #### Cost of Factorization ![Pasted image 20250423083144.png](/img/user/assets/Pasted%20image%2020250423083144.png) Summing over all the loops we get: $$\sum_{k=1}^n\sum_{i=k+1}^n(1+\sum_{j=k+1}^n2) = \frac{2n^3}{3}+ O(n^2)

Cost of Triangular Solve

Pasted image 20250423083447.png

i=1n(1+j=i+1n2)

Triangular solves cost n2+O(n) FLOPs each, so total is 2n2+O(n)
Factorization cost dominates when n is large, and scales worse

Row Subtraction via Matrices

Zeroing a (sub-diagonal) entry of a column by row subtraction can be written as applying a specific matrix M such that

MAold=Anew

where

R2:=R2a21a11(R1)

can be written as a matrix:
Pasted image 20250423084312.png

M(3)M(2)M(1)A=U

Therefore:

A=(M(3)M(2)M(1))1U=(M(1))1(M(2))1(M(3))1U

Inverse of M(i)

What is (M(k))1?

Costs: Solving Ax=b by Matrix Inversion

An obvious alternative for solving Ax=b is:

  1. Inver A to get A1
  2. Multiply A1b to get x
    One can show that the above is actually more expensive than using our factor and triangular solve strategy

Norms and Conditioning

Norms are measurements of “size”/magnitude for vectors or matrices
Conditioning describes how the output of a function/operation/matrix changes due to changes in input

Vector Norms

There are many reasonable norms for a vector, x=[x1,x2,,xn]T
Common choices are:

  1. 1-norm (or taxicab/Manhattan norm): x1=i=1n|xi|
  2. 2-norm (or Euclidean norm): x2=i=1nxi2
  3. -norm (or max norm): x=maxi|xi|

p-norms

These are collectively called p-norms, for p=1,2, and can be written: $$|x|p=\left(\sum^n|x_i|^p\right)^\frac{1}{p}$$

Key Properties of Norms

If the norm is zero, then the vector must be the zero-vector

x=0xi=0i

The norm of a scaled vector must satisfy

αx=|α|x for scalar α

The triangle inequality holds:

x+yx+y

Defining Norms for Matrices

Matrix norms are often defined/”induced” as follows, using p-norms of vectors:

A=maxx0Axx A1=maxji=1n|Aij|

* Max absolute row sum

A=maxij=1n|Aij|

Matrix 2-norm

Using the vector 2-norm, we get the matrix 2-norm, or spectral norm:

A2=maxx0Ax2x2

The matrix’s 2-norm relates to the eigenvalues
Specifically, if λi are the eigenvalues of ATA, then

A2=maxi|λi|

Matrix Norm Properties

  1. A=0Aij=0i,j
  2. αA=|α|A for all scalar α
  3. A+BA+B
  4. AxAx
  5. ABAB
  6. I=1

Conditioning of Linear Systems

Conditioning describes how the output of a function/operation/problem changes due to changes in input

For a linear system Ax=b, we ask:

  1. How much does a perturbation of b cause the solution x to change?
  2. How much does a perturbation of A cause the solution x to change?

For a given perturbation, we say the system is:

  1. Well-conditioned if x changes little
  2. Ill-conditioned if x changes lots
    For an ill-conditioned system, small errors can be radically magnified!

Geometric Intuition: Line Intersection

Pasted image 20250423093135.png
Well-conditioned case: Nearly perpendicular lines
Ill-conditioned case: Nearly parallel lines

Condition Number Summary

Condition number of a matrix A is denoted κ(A)=AA1

  1. κ1A is well-conditioned
  2. κ1A is ill-conditioned
    For system Ax=b,κ(A) provides upper bounds on relative change in x due to relative change in b
Δxxκ(A)Δbb

or in A

Δxx+Δxκ(A)ΔAA

κ Depends on the Norm

We defined the condition number as

κ(A)=AA1

without specifying which norm. Different norms will give different κ
We can specify the norm with a subscript, e.g., $$\kappa(A)_2=|A|_2\cdot|A^{-1}|_2$$
If unspecified, always assume the 2-norm

Equivalence of Norms

The norms we’ve looked at differ from one another by no more than a constant factor
That is$$C_1|x|_a\leq |x|b\leq C_2|x|_a$$
for constants C1,C2, and norms a and b.

Numerical Solutions: Residuals and Errors

Condition number plays a role in understanding/bounding the accuracy of numerical solutions
If we compute an approximate solution xapprox, how good is it?

Residual

As a proxy for error, we often use the residual r:

r=bA(xapprox)

i.e., by how much does out computed solution fail to satisfy the original problem?

Residual vs Errors

Assuming xapprox=x+Δx, we have$$r = b-A(x+\Delta x)$$or $$A(x+\Delta X) = b - r$$(r looks just like a perturbation of b!)
So, applying our earlier bound using Δb=r, we have$$\frac{|\Delta x|}{|x|}\leq \kappa(A)\frac{|r|}{|b|}$$

Interpreting this Bound

The solution’s relative error, Δxx is bounded by the condition number times the relative size of residual r with respect to b
Moral: If we (roughly) know A’s condition number, the computed residual indicates how large the error might be

Use of the Residual - Iterative Methods

Many alternate algorithms for solving Ax=b are called iterative

Gaussian Elimination & Error

In floating point, Gaussian elimination with pivoting on Ax=b is quite stable and accurate

(A+ΔA)x^=b

where ΔA=Aϵmachine
Again, applying our earlier bound gives

xx^x^κ(A)(Aϵmachine)Aκ(A)ϵmachine

e.g. if κ(A)=1010 and ϵmachine=1016, then the relative error in x is around 106, or around 6 accurate digits

Conditioning is algorithm-independent

Condition is a property of the problem itself