Esc to close · to navigate · to open

SP20

Machine Learning & Statistical Data Analysis

Foundations of ML and the math behind it — and the habit of thinking about bias that has generalized far beyond the course.

Concepts learned

Tech

  • Python
  • NumPy
  • scikit-learn
  • Matplotlib

Hero demo Gradient Descent Visualizer

Walk me through this step by step
  1. Imagine you’re hiking down into a valley wrapped in thick fog. You can’t see the bottom, you can’t see ten metres ahead — but you can feel the slope of the ground under your boots. So you do the only sensible thing: turn to face whichever way is steepest downhill, take a step, and repeat. Eventually, if the valley is well-behaved, you reach the lowest point. That is gradient descent. The whole algorithm. Every other detail is figuring out how big the step should be, and what to do when the terrain misbehaves.

  2. To turn that picture into maths, you need three things. The terrain is the loss function L(x, y) — every point in the plane has a height, and we want the lowest. The slope under your feet is the gradient \nabla L — it points in the direction of steepest increase, so the negative gradient points downhill. And the learning rate \eta sets how big a step you take. The update is one line:

    x_{t+1} = x_t - \eta\, \nabla L(x_t)
  3. Let’s do one step by hand on the simplest surface. The Bowl is L(x, y) = x^2 + y^2, shaped like the inside of a salad bowl with its minimum at the origin. Its gradient is \nabla L = (2x, 2y). Start at (2.5, 1.8); the gradient there is (5.0, 3.6). With \eta = 0.05 the update subtracts 0.05·(5.0, 3.6) = (0.25, 0.18). Your new position is (2.25, 1.62). The loss dropped from 9.49 to 7.69. One step closer to the bottom.

  4. Repeat that update and the trajectory walks straight into (0, 0). On a bowl the negative gradient always points roughly toward the centre, so every step gets you closer. Load the Bowl (vanilla) preset and watch the green dot do exactly this — it slows down as it approaches the minimum because the gradient itself shrinks. Near the bottom the slope is gentle, so the steps shrink in proportion and the iterates settle in.

  5. The single most important knob is \eta. Drag the Learning rate slider on the Bowl preset. At \eta = 0.001 the dot crawls — each step is so timid it barely moves. At \eta = 0.05 it lands neatly. Past about \eta = 0.5 the step is so long it overshoots the bottom and lands on the opposite wall of the bowl, higher than it started. Crank it further and every step is longer than the last; the iterates fling off to infinity.

  6. Vanilla gradient descent treats each step independently — every iteration restarts from rest. Momentum keeps a running velocity v that the gradient nudges, with coefficient \beta recycling the previous velocity:

    v_t = \beta\, v_{t-1} - \eta\, \nabla L(x_{t-1}), \quad x_t = x_{t-1} + v_t

    A consistent downhill direction builds speed; transient zig-zags cancel. Compare Bowl (vanilla) (\beta = 0) with Bowl (momentum) (\beta = 0.9) — momentum reaches the minimum in roughly a quarter of the steps.

  7. Where momentum really pays off: load the Rosenbrock valley preset. The surface is a narrow, curving canyon with the minimum at (1, 1). The gradient mostly points across the canyon walls — there’s barely any slope along the long axis. Vanilla GD bounces wall-to-wall and crawls toward the minimum at a snail’s pace. Momentum lets the long-axis component build up while the cross-wall oscillations cancel themselves out, so the dot threads the canyon at speed.

  8. The other failure modes are on display in the remaining presets. Saddle escape drops you near a saddle point where the gradient is almost zero — vanilla GD stalls there indefinitely; momentum eventually drifts off the ridge. Vanishing plateau is the deep-learning nightmare: far from the minimum the gradient shrinks to almost nothing, so each step is tiny no matter what \eta you pick. Cranking the learning rate or adding momentum is what unsticks you.

  9. Try it yourself.

  10. This same update is the engine underneath almost every model on this page. The polynomial regression demo below fits its coefficients by gradient descent on a least-squares loss, and the K-means image compression demo runs an analogous alternating minimisation that descends a clustering objective at each pass. The Featured Solution Gradient descent convergence for an L²-regularised quadratic makes the η story from step 5 precise — it proves the sharp bound on how large the learning rate can be before the iterates bounce instead of converging, and generalises everything you just watched to any convex quadratic.

L = 8.000\|\nabla L\| = 5.657

Gradient descent on the quadratic surface, starting at (2.0, 2.0) with learning rate 0.050 and momentum 0.90, descending toward the minimum at (0, 0).

0.050

How big each gradient step is. Too small → slow; too big → overshoots.

0.90

0 = vanilla GD; 0.9 typically accelerates over flat valleys.

2.00
2.00
step 0 / 400

Reflection

Machine Learning & Statistical Data Analysis was where the foundations clicked for me — the math behind the algorithms more than the algorithms themselves. Linear algebra from a prior term meant the gradient updates and matrix manipulations weren’t the obstacle; the conceptual hurdle was K-Nearest Neighbors, of all things — wrapping my head around why “just compare to the nearest points” was a principled approach took longer than expected.

The projects ranged across spam detection, price prediction, and cancer classification, and each one drove home the same lesson: there isn’t a single right model — there’s iteration toward the right balance for the problem you’re solving. The piece that’s stayed with me longest, though, isn’t an algorithm. It’s the habit of thinking about bias — in the data, in the model, in the framing of the question. That generalizes far beyond ML.

The course was during COVID, and if I took it again I’d push deeper on the final project and chase the extra-credit material harder.

Polynomial regression — the bias / variance trade-off in one demo

\text{MSE} = 0.0430\text{nonzero coeffs} = 5/5

Polynomial regression of degree 4 fitted to 50 noisy samples (σ = 0.20) using no regularization. The orange curve is the underlying truth, the green curve is the model's fit.

4

Higher degree → more flexible curve. Easy to overfit past degree 8.

0.0000

0 = no penalty. Higher λ shrinks coefficients toward zero.

0.20

How noisy the data is around the truth function.

50

Number of points sampled from the truth function.

Regularization
λ = 0

Slide the degree up and the curve gains expressive power; past degree 8 it starts chasing the noise instead of the signal. Then turn on Ridge or Lasso with a small λ and watch the same high-degree fit smooth back out. Lasso pushes most coefficients exactly to zero — sparsity for free.

K-means image compression — clustering as quantisation

Walk me through this step by step
  1. Open a JPEG of a sunset on your phone and the file claims it uses thousands of distinct colours — sweeping oranges, pinks, dusty purples, near-blacks in the corners. Your eye can’t actually distinguish most of them. Now imagine you’re only allowed to keep sixteen colours, and you have to snap every pixel to whichever of those sixteen it most resembles. Pick the sixteen well and the image still looks like a sunset; pick them badly and it looks like paint-by-numbers. K-means is the algorithm that picks them well — automatically, from the image itself.

  2. Strip the image away and start with six dots scattered on a page. We want two groups; that’s it. Drop two markers anywhere — call them seeds. Walk to each dot in turn, ask which seed it is closer to, and colour the dot to match. You’ve made a first guess at the clustering.

  3. Now slide each seed to the average position of the dots wearing its colour — its new centre of gravity. Some dots are suddenly closer to the other seed than the one they’re wearing, so reassign them. Recompute the two centres again. Repeat. After a handful of rounds nothing changes — the seeds and the colourings stop moving and the clustering has settled.

  4. That assign-then-update loop is the whole algorithm. There’s no derivative, no learning rate, no gradient. Each iteration does two cheap things — relabel every point, then recompute every centre — and the structure emerges from repetition alone. It’s the simplest non-trivial unsupervised method in the toolkit.

  5. What is the loop actually optimising? Call x_i a data point and \mu_{c(i)} the centre of the cluster it currently belongs to. Then the quantity we drive down is the within-cluster sum of squares J = \sum_i \|x_i - \mu_{c(i)}\|^2, sometimes called the inertia — total squared distance from every point to its own centre.

  6. Here is the aha: the loop is coordinate descent on J. The assignment step holds the centres fixed and picks the labels that minimise J — each point picks its nearest centre, by definition the smallest contribution. The update step holds the labels fixed and picks the centres that minimise J — the mean is provably the point that minimises sum of squared distances to a set. Neither step can ever increase J, so the algorithm always converges.

  7. Converges, yes — but not always to the best clustering. The loop is greedy and gets stuck at whichever local minimum of J the initial seeds happen to fall into. Drop two seeds into the same dense blob and one cluster will starve while the other does all the work. k-means++ fixes most of this by spreading the seeds out — each new seed is sampled with probability proportional to its squared distance from the closest seed already chosen.

  8. Now snap back to the image. Each pixel is a point in 3D RGB space — red, green, blue all between 0 and 255 — so a photo with thousands of unique colours is just thousands of points in a cube. K-means with K = 16 hunts for the sixteen points in that cube whose neighbourhoods cover the colours best, then every pixel is recoloured to its nearest cluster centre. That is the entire compression.

  9. Try it yourself.

  10. This assign-then-update structure is more familiar than it looks. The gradient descent demo at the top of the page is coordinate descent in disguise once you fix the step size, and the polynomial regression demo next door is yet another loop that optimises a sum of squared distances — just with a parameter vector instead of a label per point. The same picture keeps recurring across the course: define an objective, alternate the variables you minimise over, and watch it converge.

Original

K-means quantised (K = 4)

K = 4\text{inertia} = 3.72e+6

K-means image compression of "Four-color checkerboard" with K = 4 clusters. The right pane re-renders every pixel in its assigned cluster's mean colour, so larger K yields a higher-fidelity (but less compressed) approximation.

4

Each pixel maps to its cluster's mean colour. Smaller K = more compression, less fidelity.

7

Changes k-means++ initialisation; same seed = identical run.

Pixels: 16,384Iterations: 1Compression: 4,108 B vs. 49,152 B (12.0× smaller)

Pick a sample (or upload your own image), then slide K between 2 and 32. Every pixel gets mapped to its cluster’s mean colour, so small K = aggressive compression (and a posterised look), large K = near-original fidelity. The palette under the panes is the K colours K-means picked.

What’s coming

A written reflection from the author’s interview is in development. Track progress at github.com/Akosah285/engineering-portfolio.