SP20

Machine Learning & Statistical Data Analysis

Foundations of ML and the math behind it — and the habit of thinking about bias that has generalized far beyond the course.

Concepts learned

Tech

Python
NumPy
scikit-learn
Matplotlib

Hero demo Gradient Descent Visualizer

Walk me through this step by step

Imagine you’re hiking down into a valley wrapped in thick fog. You can’t see the bottom, you can’t see ten metres ahead — but you can feel the slope of the ground under your boots. So you do the only sensible thing: turn to face whichever way is steepest downhill, take a step, and repeat. Eventually, if the valley is well-behaved, you reach the lowest point. That is gradient descent. The whole algorithm. Every other detail is figuring out how big the step should be, and what to do when the terrain misbehaves.
To turn that picture into maths, you need three things. The terrain is the loss function $L (x, y)$ L(x, y) — every point in the plane has a height, and we want the lowest. The slope under your feet is the gradient $\nabla L$ \nabla L — it points in the direction of steepest increase, so the negative gradient points downhill. And the learning rate $η$ \eta sets how big a step you take. The update is one line:
$x_{t + 1} = x_{t} - η \nabla L (x_{t})$ x_{t+1} = x_t - \eta\, \nabla L(x_t)
Let’s do one step by hand on the simplest surface. The Bowl is $L (x, y) = x^{2} + y^{2}$ L(x, y) = x^2 + y^2, shaped like the inside of a salad bowl with its minimum at the origin. Its gradient is $\nabla L = (2 x, 2 y)$ \nabla L = (2x, 2y). Start at (2.5, 1.8); the gradient there is (5.0, 3.6). With $η = 0.05$ \eta = 0.05 the update subtracts 0.05·(5.0, 3.6) = (0.25, 0.18). Your new position is (2.25, 1.62). The loss dropped from 9.49 to 7.69. One step closer to the bottom.
Repeat that update and the trajectory walks straight into (0, 0). On a bowl the negative gradient always points roughly toward the centre, so every step gets you closer. Load the Bowl (vanilla) preset and watch the green dot do exactly this — it slows down as it approaches the minimum because the gradient itself shrinks. Near the bottom the slope is gentle, so the steps shrink in proportion and the iterates settle in.
The single most important knob is $η$ \eta. Drag the Learning rate slider on the Bowl preset. At $η = 0.001$ \eta = 0.001 the dot crawls — each step is so timid it barely moves. At $η = 0.05$ \eta = 0.05 it lands neatly. Past about $η = 0.5$ \eta = 0.5 the step is so long it overshoots the bottom and lands on the opposite wall of the bowl, higher than it started. Crank it further and every step is longer than the last; the iterates fling off to infinity.
Vanilla gradient descent treats each step independently — every iteration restarts from rest. Momentum keeps a running velocity $v$ v that the gradient nudges, with coefficient $β$ \beta recycling the previous velocity:
$v_{t} = β v_{t - 1} - η \nabla L (x_{t - 1}), x_{t} = x_{t - 1} + v_{t}$ v_t = \beta\, v_{t-1} - \eta\, \nabla L(x_{t-1}), \quad x_t = x_{t-1} + v_t
A consistent downhill direction builds speed; transient zig-zags cancel. Compare Bowl (vanilla) ( $β = 0$ \beta = 0) with Bowl (momentum) ( $β = 0.9$ \beta = 0.9) — momentum reaches the minimum in roughly a quarter of the steps.
Where momentum really pays off: load the Rosenbrock valley preset. The surface is a narrow, curving canyon with the minimum at (1, 1). The gradient mostly points across the canyon walls — there’s barely any slope along the long axis. Vanilla GD bounces wall-to-wall and crawls toward the minimum at a snail’s pace. Momentum lets the long-axis component build up while the cross-wall oscillations cancel themselves out, so the dot threads the canyon at speed.
The other failure modes are on display in the remaining presets. Saddle escape drops you near a saddle point where the gradient is almost zero — vanilla GD stalls there indefinitely; momentum eventually drifts off the ridge. Vanishing plateau is the deep-learning nightmare: far from the minimum the gradient shrinks to almost nothing, so each step is tiny no matter what $η$ \eta you pick. Cranking the learning rate or adding momentum is what unsticks you.
Try it yourself.

Load Rosenbrock valley, then drag momentum $β$ \beta down to 0 while keeping $η$ \eta at 0.001. Watch the dot pin itself ricocheting between the canyon walls and barely move along the valley floor. Now push $β$ \beta back up to 0.95 with the same $η$ \eta and the dot suddenly threads the canyon — the gradient hasn’t changed, you’ve just stopped throwing away the history of where you were going.
This same update is the engine underneath almost every model on this page. The polynomial regression demo below fits its coefficients by gradient descent on a least-squares loss, and the K-means image compression demo runs an analogous alternating minimisation that descends a clustering objective at each pass. The Featured Solution Gradient descent convergence for an L²-regularised quadratic makes the η story from step 5 precise — it proves the sharp bound on how large the learning rate can be before the iterates bounce instead of converging, and generalises everything you just watched to any convex quadratic.

L = 8.000

∥\nabla L ∥ = 5.657

Gradient descent on the quadratic surface, starting at (2.0, 2.0) with learning rate 0.050 and momentum 0.90, descending toward the minimum at (0, 0).

Learning rate η0.050

How big each gradient step is. Too small → slow; too big → overshoots.

Momentum β0.90

0 = vanilla GD; 0.9 typically accelerates over flat valleys.

Start x2.00

Start y2.00

step 0 / 400

Reflection

Machine Learning & Statistical Data Analysis was where the foundations clicked for me — the math behind the algorithms more than the algorithms themselves. Linear algebra from a prior term meant the gradient updates and matrix manipulations weren’t the obstacle; the conceptual hurdle was K-Nearest Neighbors, of all things — wrapping my head around why “just compare to the nearest points” was a principled approach took longer than expected.

The projects ranged across spam detection, price prediction, and cancer classification, and each one drove home the same lesson: there isn’t a single right model — there’s iteration toward the right balance for the problem you’re solving. The piece that’s stayed with me longest, though, isn’t an algorithm. It’s the habit of thinking about bias — in the data, in the model, in the framing of the question. That generalizes far beyond ML.

The course was during COVID, and if I took it again I’d push deeper on the final project and chase the extra-credit material harder.

Polynomial regression — the bias / variance trade-off in one demo

MSE = 0.0430

nonzero coeffs = 5/5

Polynomial regression of degree 4 fitted to 50 noisy samples (σ = 0.20) using no regularization. The orange curve is the underlying truth, the green curve is the model's fit.

Polynomial degree4

Higher degree → more flexible curve. Easy to overfit past degree 8.

Regularization λ (log)0.0000

0 = no penalty. Higher λ shrinks coefficients toward zero.

Noise σ0.20

How noisy the data is around the truth function.

Sample size n50

Number of points sampled from the truth function.

Regularization

λ = 0

Slide the degree up and the curve gains expressive power; past degree 8 it starts chasing the noise instead of the signal. Then turn on Ridge or Lasso with a small λ and watch the same high-degree fit smooth back out. Lasso pushes most coefficients exactly to zero — sparsity for free.

K-means image compression — clustering as quantisation

Walk me through this step by step

Open a JPEG of a sunset on your phone and the file claims it uses thousands of distinct colours — sweeping oranges, pinks, dusty purples, near-blacks in the corners. Your eye can’t actually distinguish most of them. Now imagine you’re only allowed to keep sixteen colours, and you have to snap every pixel to whichever of those sixteen it most resembles. Pick the sixteen well and the image still looks like a sunset; pick them badly and it looks like paint-by-numbers. K-means is the algorithm that picks them well — automatically, from the image itself.
Strip the image away and start with six dots scattered on a page. We want two groups; that’s it. Drop two markers anywhere — call them seeds. Walk to each dot in turn, ask which seed it is closer to, and colour the dot to match. You’ve made a first guess at the clustering.
Now slide each seed to the average position of the dots wearing its colour — its new centre of gravity. Some dots are suddenly closer to the other seed than the one they’re wearing, so reassign them. Recompute the two centres again. Repeat. After a handful of rounds nothing changes — the seeds and the colourings stop moving and the clustering has settled.
That assign-then-update loop is the whole algorithm. There’s no derivative, no learning rate, no gradient. Each iteration does two cheap things — relabel every point, then recompute every centre — and the structure emerges from repetition alone. It’s the simplest non-trivial unsupervised method in the toolkit.
What is the loop actually optimising? Call $x_{i}$ x_i a data point and $μ_{c (i)}$ \mu_{c(i)} the centre of the cluster it currently belongs to. Then the quantity we drive down is the within-cluster sum of squares $J = \sum_{i} ∥ x_{i} - μ_{c (i)} ∥^{2}$ J = \sum_i \|x_i - \mu_{c(i)}\|^2, sometimes called the inertia — total squared distance from every point to its own centre.
Here is the aha: the loop is coordinate descent on $J$ J. The assignment step holds the centres fixed and picks the labels that minimise $J$ J — each point picks its nearest centre, by definition the smallest contribution. The update step holds the labels fixed and picks the centres that minimise $J$ J — the mean is provably the point that minimises sum of squared distances to a set. Neither step can ever increase $J$ J, so the algorithm always converges.
Converges, yes — but not always to the best clustering. The loop is greedy and gets stuck at whichever local minimum of $J$ J the initial seeds happen to fall into. Drop two seeds into the same dense blob and one cluster will starve while the other does all the work. k-means++ fixes most of this by spreading the seeds out — each new seed is sampled with probability proportional to its squared distance from the closest seed already chosen.
Now snap back to the image. Each pixel is a point in 3D RGB space — red, green, blue all between 0 and 255 — so a photo with thousands of unique colours is just thousands of points in a cube. K-means with $K = 16$ K = 16 hunts for the sixteen points in that cube whose neighbourhoods cover the colours best, then every pixel is recoloured to its nearest cluster centre. That is the entire compression.
Try it yourself.

Load the Sunset palette sample and slide K down to 2. The image collapses to two colours — a warm and a cool — and the soft gradient becomes a hard band. Slide K up to 16 and the bands fade back into a smooth sky. Then flip to Four-color checkerboard with K = 4: K-means recovers the exact four originals and the reconstruction is pixel-perfect.
This assign-then-update structure is more familiar than it looks. The gradient descent demo at the top of the page is coordinate descent in disguise once you fix the step size, and the polynomial regression demo next door is yet another loop that optimises a sum of squared distances — just with a parameter vector instead of a label per point. The same picture keeps recurring across the course: define an objective, alternate the variables you minimise over, and watch it converge.

or upload:

Original

K-means quantised (K = 4)

K = 4

inertia = 3.72 e + 6

K-means image compression of "Four-color checkerboard" with K = 4 clusters. The right pane re-renders every pixel in its assigned cluster's mean colour, so larger K yields a higher-fidelity (but less compressed) approximation.

Number of clusters K4

Each pixel maps to its cluster's mean colour. Smaller K = more compression, less fidelity.

Seed7

Changes k-means++ initialisation; same seed = identical run.

Pixels: 16,384Iterations: 1Compression: 4,108 B vs. 49,152 B (12.0× smaller)

Pick a sample (or upload your own image), then slide K between 2 and 32. Every pixel gets mapped to its cluster’s mean colour, so small K = aggressive compression (and a posterised look), large K = near-original fidelity. The palette under the panes is the K colours K-means picked.

Featured problems

Bias–variance decomposition for a polynomial fit

Suppose we draw $n$ n noisy samples from a smooth target $y = f (x) + ε$ y = f(x) + \varepsilon with $E [ε] = 0$ \mathbb{E}[\varepsilon] = 0 and $Var (ε) = σ^{2}$ \mathrm{Var}(\varepsilon) = \sigma^2, fit a polynomial $\hat{f}$ \hat f of degree $d$ d by least squares, and ask: what determines the test error at a fixed query point $x$ x?

Adding and subtracting $E [\hat{f} (x)]$ \mathbb{E}[\hat f(x)] inside the squared term and taking expectations over both the training set and the noise gives the canonical decomposition:

E [(y - \hat{f} (x))^{2}] = bias^{2} (E [\hat{f} (x)] - f (x))^{2} + variance Var (\hat{f} (x)) + σ^{2}

The lesson is that the three pieces pull in different directions. As $d$ d grows, bias shrinks (the model can chase finer wiggles in $f$ f) but variance grows (the model also chases the noise). The irreducible $σ^{2}$ \sigma^2 sits there regardless of how clever the model is. In the polynomial-regression demo above, sliding the degree past about 8 on a 20-point sample is where I watch variance start to dominate — the curve starts threading individual noisy points instead of the underlying trend. Turning on ridge with a small $λ$ \lambda doesn’t change the decomposition; it shifts the operating point by trading a little extra bias for a much larger reduction in variance.

Why K-Nearest Neighbors is principled, not just intuitive

KNN felt like a hand-wave the first time I saw it — “just compare against the nearest points” — and of every algorithm in the course it took the longest to feel like a principled estimator rather than a heuristic. Gradient descent on a regression loss I can write down and differentiate; KNN doesn’t train, it just memorises. The discomfort was: where is the math?

The reframe that fixed it: KNN is a non-parametric estimate of the posterior $P (y = c ∣ x)$ P(y = c \mid x) with a bandwidth set implicitly by $k$ k:

\hat{P} (y = c ∣ x) = \frac{1}{k} i \in N_{k} (x) \sum 1 {y_{i} = c}

Once it’s a posterior estimate, the rest of the picture falls into place. The decision rule $\overset{y}{^} = ar g max_{c} \hat{P} (y = c ∣ x)$ \hat y = \arg\max_c \hat P(y=c\mid x) is just the plug-in Bayes classifier. Cover and Hart’s classical result that 1-NN error is at most twice the Bayes error in the asymptotic limit stops being mystical — it’s a statement about how well a 1-sample posterior estimate approximates the truth. Cross-validating $k$ k is selecting a bandwidth, exactly the same problem as picking a kernel width.

The lesson generalised beyond KNN: every “lazy” or memory-based learner I’ve met since — kernel ridge regression, Gaussian processes, even retrieval-augmented language models — sits at some point on the same axis of “how much structure do I impose on the posterior, and how much do I read off the data”.

Gradient descent convergence for an L²-regularised quadratic

Take a convex quadratic with an L² penalty and ask the question every implementer of gradient descent eventually has to answer: how large can the learning rate $η$ \eta be before the iterates stop converging?

With objective

J (θ) = \frac{1}{2} θ^{⊤} H θ + b^{⊤} θ + \frac{λ}{2} ∥ θ ∥^{2}

(symmetric positive semi-definite $H$ H), the batch gradient step is:

θ_{t + 1} = θ_{t} - η ((H + λ I) θ_{t} + b)

Let $θ^{⋆} = - (H + λ I)^{- 1} b$ \theta^\star = -(H + \lambda I)^{-1} b be the unique minimiser (the regulariser makes the inverse exist even when $H$ H is singular). Subtracting the fixed-point equation from the update gives a linear recurrence for the error $e_{t} = θ_{t} - θ^{⋆}$ e_t = \theta_t - \theta^\star:

e_{t + 1} = (I - η (H + λ I)) e_{t}

Diagonalise $H + λ I = Q Λ Q^{⊤}$ H + \lambda I = Q \Lambda Q^\top. In the eigenbasis each coordinate of the error decays by the scalar $1 - η μ_{i}$ 1 - \eta \mu_i per step, where $μ_{i}$ \mu_i ranges over the eigenvalues of $H + λ I$ H + \lambda I. Monotone descent in every coordinate requires $∣1 - η μ_{i} ∣ < 1$ |1 - \eta \mu_i| < 1 for all $i$ i, which gives the sharp bound:

0 < η < \frac{2}{λ _{m a x} ( H ) + λ}

Two things I find satisfying about this. First, the convergence rate is set by the eigenvalue ratio $κ = λ_{m a x} / λ_{m i n}$ \kappa = \lambda_{\max} / \lambda_{\min} — so ridge regularisation literally improves convergence by shrinking $κ$ \kappa (push $λ$ \lambda up and the denominator’s smallest eigenvalue rises). Second, the gradient descent visualizer at the top of the page is exactly this story in two dimensions: pick an elongated loss surface, crank the learning rate, and you see the eigenvalue-1 direction overshoot while the eigenvalue-0.1 direction still creeps along the long axis.

What’s coming

A written reflection from the author’s interview is in development. Track progress at github.com/Akosah285/engineering-portfolio.