SP20

Computational Methods

Numerical root-finding, linear-system solvers, and time-stepping ODEs — algorithms that let a computer chase analytic problems past the point of paper.

Concepts learned

Tech

Python
NumPy
SciPy
Matplotlib

Hero demo Lorenz Solver (RK4)

t = 0.00

(x, z) = (1.00, 1.00)

RK4 trajectory for the lorenz attractor system, integrated with step size dt = 0.010 from t = 0 to t = 50. The butterfly: chaotic 3D flow projected onto the xz-plane.

dt (integration step)0.010

Smaller dt → more accurate RK4 integration but more total work.

Integration time t_end50

How far in simulated time to integrate forward from t = 0.

Speed (points / frame)10

How many trajectory samples are revealed per animation frame.

step 0 / 5001

Reflection

Pre-interview preview. Akwasi’s first-person reflection on this course is pending — track at issue #46.

Bisection method

Bracket-and-halve root finding with a live convergence trace.

Walk me through this step by step

Suppose I ask you to find x such that x² = 2. Your calculator returns 1.41421356… in a millisecond, but how does it actually do that? You can’t solve it on paper. Bisection is the simplest answer — the slowest, most reliable algorithm in the toolkit, and the one every “fancy” method falls back to when it gets confused.
The whole trick is one observation about continuous functions. If f(a) is negative and f(b) is positive, then the graph has to cross zero somewhere between a and b. We don’t know where yet, but we know the root is trapped in the interval. Bisection’s job is to shrink that trap until it is small enough to call the answer.
Look at $f (x) = x^{2} - 2$ f(x) = x^2 - 2 on the interval $[0, 2]$ [0, 2]. We have $f (0) = - 2$ f(0) = -2 (negative) and $f (2) = 2$ f(2) = 2 (positive), so a root is trapped inside. Cut the interval in half. The midpoint is 1, and $f (1) = - 1$ f(1) = -1 is still negative. So the sign change must happen between 1 and 2 — we just threw half the work away.
Do it again on $[1, 2]$ [1, 2]: midpoint 1.5, $f (1.5) = 0.25$ f(1.5) = 0.25 (positive), so the root is now in $[1, 1.5]$ [1, 1.5]. Again on $[1, 1.5]$ [1, 1.5]: midpoint 1.25, $f (1.25) = - 0.4375$ f(1.25) = -0.4375, so root is in $[1.25, 1.5]$ [1.25, 1.5]. Four steps in we are at $[1.375, 1.4375]$ [1.375, 1.4375] — already two decimal digits of $2$ \sqrt{2}.
Every step exactly halves the width of the bracket. After 10 steps the trap is 1/1024 of its original size. After 20 steps it is one in a million. After 50 it is smaller than a double-precision float can even represent. That is the entire convergence story — the bracket width is $(b_{0} - a_{0}) / 2^{n}$ (b_0 - a_0) / 2^n after n steps, and you stop when it is below your tolerance.
To get one more decimal digit of precision, you need just over three more bisection steps (because 2³ = 8 and 2⁴ = 16, sandwich 10 between them). So 21 steps gets you 6 decimal digits, 30 gets you 9, 50 gets you ~15. The demo’s default tolerance is $1 0^{- 6}$ 10^{-6}, which is why it halts around iteration 21.
The catch is that bisection is slow. Every step buys you the same constant factor of improvement — one bit. Newton’s method, which we will meet next, can buy you doubling-digits per step when it works. But bisection’s superpower is that it always works. It never diverges, never oscillates, never needs a derivative. It cannot fail as long as its preconditions hold.
Those preconditions are everything: f must be continuous, and the starting interval must have a sign change. Skip either and the IVT argument breaks. A jump discontinuity inside $[a, b]$ [a, b] can let $f (a) \cdot f (b) < 0$ f(a) \cdot f(b) < 0 be true with no actual root in between — the graph “crosses” the axis by teleporting over it.
Try it yourself.

Load the Slow shrink preset ( $f (x) = (x - 1)^{7}$ f(x) = (x-1)^7). The function sits very close to zero on both sides of the root, so just checking $∣ f (m) ∣ < ε$ |f(m)| < \varepsilon would let the method “converge” while the bracket is still huge. The demo also checks that the bracket width has shrunk — that “or” keeps the high-multiplicity case honest.
So that is bisection: the bracketing baseline that every faster root-finder is compared to. The very next demo, Newton’s method, throws away bracketing in exchange for using the derivative to take much bigger steps — and the featured problem “Quadratic convergence of Newton’s method near a simple root” counts exactly how much faster Newton is when it works. Read them back-to-back: bisection gives you one bit per step always; Newton gives you doubling-digits per step until it doesn’t.

[a, b] = [0.0000, 2.0000]

∣ b - a ∣ = 2.00 e + 0

Bisection root finder on f(x) = x² − 2, bracketed by [0.000, 2.000], with tolerance 1e-6 and up to 30 iterations.

Left endpoint a0.000

Left side of the bracket [a, b]. f(a) and f(b) must have opposite signs.

Right endpoint b2.000

Right side of the bracket [a, b].

Tolerance0.00000100

Stop once |f(mid)| or half the bracket width is below this.

Max iterations30

Hard cap on the number of bracket halvings.

step 0 / 30

Newton’s method

Tangent-line iteration with quadratic-convergence visualisation.

Walk me through this step by step

Bisection earns you one bit of precision per step. If you also know the slope of the function, you can do much better — converging in about five steps instead of twenty for the same accuracy. That is Newton’s method, and the leap from “one bit per step” to “doubling digits per step” is one of the most useful speedups in all of numerical computing.
The geometric idea: at your current guess, draw the tangent line to the graph. The tangent crosses the x-axis at one specific point — that crossing becomes your next guess. Up close, any smooth function looks almost like a straight line, so the tangent’s prediction of where the graph hits zero is almost exactly right. Then you repeat from the new spot.
A worked example with $f (x) = x^{2} - 2$ f(x) = x^2 - 2 and starting guess $x_{0} = 1.5$ x_0 = 1.5: we have $f (1.5) = 0.25$ f(1.5) = 0.25 and slope $f^{'} (1.5) = 3$ f'(1.5) = 3. The tangent line crosses zero at $1.5 - 0.25/3 \approx 1.4167$ 1.5 - 0.25/3 \approx 1.4167. That’s our next guess. Check it: $f (1.4167) \approx 0.0069$ f(1.4167) \approx 0.0069 — already two decimal digits closer to $2$ \sqrt{2} from one step.
The pattern that produces is the famous Newton formula:
$x_{n + 1} = x_{n} - \frac{f ( x _{n} )}{f ^{'} ( x _{n} )}$ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}
Read it as: “from where you are now, step backward by (current value) divided by (current slope).” If the value is already small, the step is small. If the slope is steep, the step is small. If the slope is gentle, the step is big.
What makes Newton fast is the famous doubling rule: each step squares the error rather than halving it. If you start with one digit of accuracy, the next step gives two, then four, then eight, then sixteen. With the square-root preset, five iterations take you from 1.5 to $2$ \sqrt{2} accurate to a dozen digits — past anything a single-precision float could represent.
Why does this work? Near a smooth root, the function genuinely is almost a straight line, and the tangent is that line. So the new error is roughly the previous error squared — small numbers squared get tiny fast. This is what mathematicians call quadratic convergence, and it is the gold standard for iterative algorithms.
The catch is that Newton is fragile. If your starting guess is far from the root, or the slope is nearly horizontal, or the function has a kink, Newton can wander off in surprising directions. It can cycle. It can shoot to infinity. It can land on a different root than the one you wanted. Bisection, in contrast, just slowly does its job.
The “Bad seed (cycle)” preset shows the worst case. With $f (x) = x^{3} - 2 x + 2$ f(x) = x^3 - 2x + 2 seeded at $x_{0} = 0$ x_0 = 0: f(0) = 2, slope = -2, tangent crosses zero at $0 - 2/ (- 2) = 1$ 0 - 2/(-2) = 1. Now from x = 1: f(1) = 1, slope = 1, tangent crosses at $1 - 1/1 = 0$ 1 - 1/1 = 0. We’re back where we started, and the iteration ping-pongs between 0 and 1 forever.
Try it yourself.

Run the “Quadratic convergence” preset (sine, seed 3.0). Pause after each step and read off $∣ f (x_{n}) ∣$ |f(x_n)| in the iterate table. You should see $1 0^{- 2} \to 1 0^{- 4} \to 1 0^{- 8} \to 1 0^{- 16}$ 10^{-2} \to 10^{-4} \to 10^{-8} \to 10^{-16}. That is the digit-doubling, made visible: each step squares the previous error.
So the trade is concrete: bisection gives you one bit per step always; Newton gives you doubling-digits per step until it doesn’t. The featured problem “Quadratic convergence of Newton’s method near a simple root” below derives the error-squaring rule from a Taylor expansion. For the slow-but-bulletproof counterpart that needs no derivative, see the bisection demo just above. Production root finders like Brent’s method actually cheat and combine both — Newton when it is racing, bisection as a safety net when it isn’t.

x_{n} = 1.500000

∣ f (x_{n}) ∣ = 2.50 e - 1

Newton's method on x² − 2, starting at x₀ = 1.50, hunting for a root with tolerance 1e-6 over up to 20 iterations.

Initial guess x₀1.50

Where Newton's method starts. Different seeds can converge to different roots — or not at all.

Tolerance0.00000100

Stop once |f(x_n)| drops below this. Smaller = stricter.

Max iterations20

Hard cap — protects against non-convergent seeds.

step 0 / 20

Numerical integration comparator

Side-by-side midpoint, trapezoidal, and Simpson’s rule error analysis.

\int_{a}^{b} f (x) d x \approx 0.3320

error = 1.30 e - 3

Quadrature: integrating Quadratic (x^2) from 0.00 to 1.00 with 8 subintervals using the midpoint rule.

Lower limit a0.00

Upper limit b1.00

Steps n8

More steps tighten the estimate. Simpson's rule rounds up to the next even count.

n = 8 subintervals

Monte Carlo integration

Compare convergence rates of Monte Carlo integration against the trapezoidal rule on the same integrand.

Walk me through this step by step

Imagine you want to know the area of an oddly-shaped pond in a rectangular field. You can’t measure it directly, but you can throw a thousand pebbles at random across the field and count how many splash into the water. The fraction that splash, multiplied by the field’s area, is a surprisingly good estimate of the pond’s area. Monte Carlo integration is exactly that idea, dressed up for any function you can evaluate. No calculus required — just averages and a good random number generator.
Suppose you want to compute the integral of f(x) from a to b. Instead of slicing the interval into neat strips like the trapezoidal or Simpson’s rule, you pick N points at random inside $[a, b]$ [a, b] and evaluate f at each one. The average f-value tells you roughly how tall the function is on average, and multiplying by the interval width $(b - a)$ (b - a) gives you the area underneath. That is the whole algorithm. The cleverness is in believing it actually converges.
Let’s estimate $\int_{0}^{1} x^{2} d x$ \int_0^1 x^2\, dx by hand. The exact answer is $1/3 \approx 0.3333$ 1/3 \approx 0.3333. Pick four random x-values in $[0, 1]$ [0, 1] — say 0.17, 0.42, 0.68, 0.91 — and square them: 0.029, 0.176, 0.462, 0.828. The average is 0.374, and the interval width is 1, so our estimate is 0.374. With only four darts we are already in the right neighbourhood, off by about 12%. Bump the count to 1000 and the error typically drops below 2%.
Writing the recipe down:
$\int_{a}^{b} f (x) d x \approx \frac{b - a}{N} i = 1 \sum N f (x_{i})$ \int_a^b f(x)\, dx \approx \frac{b - a}{N} \sum_{i=1}^{N} f(x_i)
where the points $x_{i}$ x_i are drawn uniformly from $[a, b]$ [a, b]. That’s it — no derivatives, no smoothness assumptions, no fancy weights. The same formula generalises to two, three, or twenty dimensions: replace $(b - a)$ (b - a) with the volume of the bounding box and draw points uniformly inside it. The integrator in this demo is roughly thirty lines of code.
Why does averaging random samples work at all? The Law of Large Numbers is the engine: as N grows, the sample average converges to the true average value of f on the interval. Each new dart you throw nudges the running mean a little closer to the right answer. More darts, more accurate average — that is the whole intuition. The same theorem underpins every opinion poll, every clinical trial, and every casino’s pricing model.
The catch is how slowly it converges. The error shrinks like $1/ N$ 1/\sqrt{N}, not like $1/ N$ 1/N. To gain one decimal digit of accuracy you must multiply N by 100. Going from 1000 samples to 100,000 only buys you one extra digit. The trapezoidal rule, by contrast, gives roughly two extra digits per ten-times-finer mesh in 1D. So for a smooth one-dimensional integrand, Monte Carlo is genuinely worse than the deterministic rules.
So why bother? Because deterministic rules suffer the curse of dimensionality. To halve the trapezoidal-rule error in 20 dimensions you need roughly a million times more grid points. Monte Carlo does not care — its $1/ N$ 1/\sqrt{N} rate is independent of dimension. By around four or five dimensions Monte Carlo starts winning, and by twenty it is the only thing that fits in memory. Financial risk modelling, particle physics, and graphics rendering all live in that high-dimensional regime.
A nice bonus: the sample variance gives you a free error bar. The demo reports a 95% confidence half-width of about $1.96 \cdot s / N$ 1.96 \cdot s / \sqrt{N}, where s is the sample standard deviation of the f-values. So you do not just get an estimate — you get an honest “I’m 95% sure the true value lies within plus-or-minus this much” range. Plain quadrature has nothing comparable without a separate error analysis. With Monte Carlo the uncertainty is baked in.
Try it yourself.

Load the Runge preset ( $\int_{- 1}^{1} 1/ (1 + 25 x^{2}) d x$ \int_{-1}^{1} 1/(1 + 25 x^2)\, dx). Start at 100 samples and watch the estimate jump around — the confidence half-width is wide. Crank N up to 10,000 and the half-width should shrink by roughly $100 = 10$ \sqrt{100} = 10. Then imagine sampling more densely where the integrand is tall and less where it is flat — clever sampling so each dart counts for more. That intuition is exactly what variance-reduction tricks like importance sampling do.
Read this one paired with the Numerical integration comparator above — they are the two halves of numerical integration. Quadrature spends a careful, deterministic budget of function evaluations and crushes smooth low-dimensional integrals. Monte Carlo spends a sloppy random budget and crushes high-dimensional ones. The same Strang chapter connects them: deterministic rules optimise the placement of N points; Monte Carlo gives up and lets randomness do the placing. Pick the right tool for your dimension and you will never look back.

n = 1000

\hat{I} = 0.3404

I = 0.3333

\pm 0.0183

Monte Carlo integration of x squared on [0.00, 1.00] using 1000 random samples. The exact integral equals 0.3333; each estimate averages f(x) over uniformly drawn x_i.

n samples1000

Number of uniformly random x_i drawn from [a, b].

seed42

Deterministic PRNG seed — same seed reproduces the same samples.

n = 1000

Gaussian elimination — step by step

Watch row operations reduce an augmented matrix to row-echelon form.

s t e p 1/4

s in g u l a r : n o

Gaussian elimination on the 3x3 simple (no swap needed) 3x3 system. 4 row operations (pivot + eliminate) drive the matrix to upper-triangular form.

Step delay700 ms

Milliseconds between row operations. Lower = faster animation.

step 1 / 4

LU decomposition

Animate the LU factorisation step by step on a 4×4 matrix.

det (A) = 12.000

singular: no

sign: +1

LU decomposition of "Well-conditioned". det(A) = 12.000, factored cleanly. For b = (1, 2, 3), x = (-0.17, 0.00, 0.83).

b01.0

b12.0

b23.0

det(A) = 12.00

Least squares curve fitting

Click to drop points; fit polynomials of varying degree with sum-of-squared-residuals shown live.

slope m = 1.537

intercept b = 0.264

R^{2} = 0.994

Fitting y = mx + b to 24 points from preset "Linear" using closed-form normal equations. Estimated slope 1.537 vs true 1.50. R² = 0.994.

Noise0.30

Standard deviation of Gaussian noise added to each true y value.

N points24

How many sample points to draw from the true line.

n = 24 points (ŷ at x=5.0 = 7.95)

Lagrange vs cubic spline

Add interpolation nodes; compare Runge’s-phenomenon-prone Lagrange against natural cubic splines.

n = 11

max ∣ L - f ∣ = 1.916

max ∣ S - f ∣ = 0.022

Interpolating runge's function (equispaced) through 11 nodes with Lagrange polynomial and natural cubic spline.

Show Lagrange polynomialShow natural cubic splineShow original f(x)

11 nodes

Eigenvalue power iteration

Iterate Ax → x until the dominant eigenvector emerges.

λ \approx 3.500

iteration k = 0

converged: n o

Power iteration on matrix preset "Symmetric 2×2" — well-conditioned symmetric matrix, dominant eigenvalue ≈ 3.618. v ← Av/‖Av‖ converges to the dominant eigenvector at rate |λ₂/λ₁|. Currently λ ≈ 3.500.

Step delay200 ms

Milliseconds between successive v ← Av/‖Av‖ steps.

Max iterations100

Cap on the number of power-iteration steps before the demo halts.

iteration 0

Discrete Fourier transform

Stem plots for a signal’s DFT amplitude and phase, with windowing options.

N = 64

peak bin = 4

f_{peak} = 4.00 Hz

Discrete Fourier transform of the single sinusoid (k=4) signal sampled at N = 64 points (sample rate 64 Hz). Peak frequency component sits at bin k = 4, or about 4.00 Hz.

N (sample count)64

Number of samples in the signal. Snaps to a multiple of 8; powers of 2 preferred.

Sample rate64 Hz

Sampling rate in hertz; used to label the spectrum's frequency axis.

N = 64

Featured problems

Quadratic convergence of Newton's method near a simple root

Newton’s iteration $x_{n + 1} = x_{n} - f (x_{n}) / f^{'} (x_{n})$ x_{n+1} = x_n - f(x_n)/f'(x_n) is famous for doubling the number of correct digits each step. Show this explicitly — and pin down what “near a simple root” means.

Let $r$ r be a simple root ( $f (r) = 0$ f(r)=0, $f^{'} (r) \neq = 0$ f'(r) \neq 0) and write $e_{n} = x_{n} - r$ e_n = x_n - r for the error at step $n$ n. Taylor-expand $f$ f around $r$ r:

f (x_{n}) = f (r) + f^{'} (r) e_{n} + \frac{1}{2} f^{''} (ξ) e_{n}^{2} = f^{'} (r) e_{n} + \frac{1}{2} f^{''} (ξ) e_{n}^{2}

Substitute into the Newton update:

e_{n + 1} = e_{n} - \frac{f ^{'} ( r ) e _{n} + \frac{1}{2} f ^{''} ( ξ ) e _{n}^{2}}{f ^{'} ( x _{n} )} = \frac{f ^{''} ( ξ )}{2 f ^{'} ( x _{n} )} e_{n}^{2}

In the limit $x_{n} \to r$ x_n \to r, $f^{'} (x_{n}) \to f^{'} (r)$ f'(x_n) \to f'(r) and the prefactor converges to $C = f^{''} (r) / (2 f^{'} (r))$ C = f''(r) / (2 f'(r)). So asymptotically $∣ e_{n + 1} ∣ \approx C e_{n}^{2}$ |e_{n+1}| \approx C\, e_n^2. Errors square per step — if $e_{n} = 1 0^{- 3}$ e_n = 10^{-3}, then $e_{n + 1} \approx C \cdot 1 0^{- 6}$ e_{n+1} \approx C \cdot 10^{-6}.

The two caveats. First, “near”: you need $∣ C \cdot e_{0} ∣ < 1$ |C \cdot e_0| < 1 for the recursion to contract — far from the root Newton can overshoot or diverge. The bisection visualizer above is the always-converges-but-linearly backstop you reach for when you don’t have a good starting guess. Second, “simple”: at a multiple root $f^{'} (r) = 0$ f'(r)=0 so the analysis breaks; Newton degrades to linear convergence unless you reweight (modified Newton with multiplicity $m$ m: $x_{n + 1} = x_{n} - m f / f^{'}$ x_{n+1} = x_n - m\, f/f').

Why RK4 is fourth-order — counting matched Taylor terms

The classical Runge–Kutta integrator advances $y^{'} = f (t, y)$ y' = f(t,y) by:

y_{n + 1} = y_{n} + \frac{h}{6} (k_{1} + 2 k_{2} + 2 k_{3} + k_{4})

with slopes evaluated at the start, two midpoints, and the endpoint. Why the $1 : 2 : 2 : 1$ 1{:}2{:}2{:}1 weighting? And what does “fourth order” actually buy us?

The exact solution Taylor-expanded around $t_{n}$ t_n gives

y (t_{n} + h) = y_{n} + h y_{n}^{'} + \frac{h ^{2}}{2} y_{n}^{''} + \frac{h ^{3}}{6} y_{n}^{'''} + \frac{h ^{4}}{24} y_{n}^{(4)} + O (h^{5})

where each $y^{(k)}$ y^{(k)} can be rewritten in terms of $f$ f and its partial derivatives using the chain rule (e.g. $y^{''} = f_{t} + f f_{y}$ y'' = f_t + f f_y). Now expand each slope $k_{i}$ k_i the same way; the weights $1, 2, 2, 1$ 1, 2, 2, 1 divided by 6 are precisely the unique choice that matches every Taylor coefficient up through $h^{4}$ h^4. The leftover error sits in the $h^{5}$ h^5 term, so the local truncation error is $O (h^{5})$ O(h^5) and the global error over a fixed interval is $O (h^{4})$ O(h^4) — halving the step size cuts the error by a factor of 16.

This is the trade-off RK4 hits: four function evaluations per step for an order-of-magnitude advantage over forward Euler at any practical step size. Drop in the Lorenz system at the top of the page (chaotic, sensitive to step size) and RK4 is what keeps the trajectory recognisable while Euler at the same $h$ h blows up in a few hundred steps. The cost-vs-accuracy sweet spot is the reason it’s still everyone’s default for non-stiff problems.

Conjugate Gradient solves an n×n SPD system in at most n steps

Steepest Descent on an ill-conditioned SPD quadratic

ϕ (x) = \frac{1}{2} x^{T} A x - b^{T} x

zig-zags down the long axis of the contour ellipses and converges geometrically with rate $(κ - 1) / (κ + 1)$ (\kappa - 1)/(\kappa + 1), where $κ$ \kappa is the condition number of $A$ A. Conjugate Gradient does something startling: on an $n \times n$ n \times n SPD system, it produces the exact solution in at most $n$ n iterations, regardless of conditioning.

The mechanism is a strictly enlarging Krylov subspace. After $k$ k iterations, the CG iterate $x_{k}$ x_k is the unique minimiser of $ϕ$ \phi (equivalently, of the $A$ A-energy norm $∥ x - x^{⋆} ∥_{A}$ \|x - x^\star\|_A) over the affine space $x_{0} + K_{k} (A, r_{0})$ x_0 + \mathcal{K}_k(A, r_0) with

K_{k} (A, r_{0}) = span {r_{0}, A r_{0}, A^{2} r_{0}, \dots, A^{k - 1} r_{0}}

where $r_{0} = b - A x_{0}$ r_0 = b - A x_0 is the initial residual. Each CG step appends a fresh power of $A$ A to the subspace. After $n$ n steps the Krylov subspace fills $R^{n}$ \mathbb{R}^n (or terminates earlier if $r_{0}$ r_0 happens to live in a smaller $A$ A-invariant subspace), so the constrained minimiser equals the unconstrained minimiser equals $x^{⋆}$ x^\star.

The HW3 problem on a 3×3 SPD matrix bore this out exactly: CG converged in 3 iterations, Gauss–Seidel took about 6, and Steepest Descent ground through 16 before reaching comparable error. Practical CG on a 10,000×10,000 sparse system never actually runs $n$ n iterations — it terminates much earlier on a residual tolerance, exploiting clustered eigenvalues — but the $\leq n$ \le n guarantee is what removes the conditioning anxiety that plagues most iterative solvers. The bisection visualizer up the page is the slow-but-sure cousin; CG is what you reach for when you have structure to exploit.

What’s coming

The author’s written reflection lands alongside the v5 interview. The demos and featured problems above are wired and playable today.