FA19

Discrete & Probabilistic Systems

Counting structures, modelling random processes, and the surprising amount of analysis that falls out of carefully placed independence assumptions.

Concepts learned

Tech

Python
NumPy
Matplotlib
Jupyter

Hero demo Markov Chain Visualizer

Walk me through this step by step

Suppose today is sunny. What’s tomorrow? It might be sunny again, it might rain — but here’s the assumption that makes the whole machinery work: tomorrow depends only on today, not on whether it rained last Tuesday or stayed sunny all of June. The past doesn’t matter once you know where you are right now. That memoryless property is the entire Markov trick, and the surprising payoff is that whole systems of weather, gamblers’ bankrolls, and even the web graph fall neatly into the same little bit of algebra.
So how do you write that down? You list the states (Sun, Rain) and give a transition probability for each ordered pair. From Sun: stay sunny with probability 0.8, switch to rain with 0.2. From Rain: dry out with probability 0.3, stay wet with 0.7. Stack those rows and you have a transition matrix $P$ P — each row sums to 1 because something has to happen tomorrow.
$P = (0.8 0.3 0.2 0.7)$ P = \begin{pmatrix} 0.8 & 0.2 \\ 0.3 & 0.7 \end{pmatrix}
Now walk one day forward. Today is definitely sunny, so the distribution is $π_{0} = (1, 0)$ \pi_0 = (1, 0). Tomorrow’s sun probability is “chance I started in Sun × chance Sun→Sun” plus “chance I started in Rain × chance Rain→Sun”, which is $1 \cdot 0.8 + 0 \cdot 0.3 = 0.8$ 1 \cdot 0.8 + 0 \cdot 0.3 = 0.8. The rain half works the same way and lands on 0.2. So $π_{1} = (0.8, 0.2)$ \pi_1 = (0.8, 0.2). That whole calculation is just the row-vector matrix product $π_{1} = π_{0} P$ \pi_1 = \pi_0 P.
Do it again. $π_{2} = π_{1} P$ \pi_2 = \pi_1 P gives $(0.8 \cdot 0.8 + 0.2 \cdot 0.3, 0.8 \cdot 0.2 + 0.2 \cdot 0.7) = (0.70, 0.30)$ (0.8 \cdot 0.8 + 0.2 \cdot 0.3,\; 0.8 \cdot 0.2 + 0.2 \cdot 0.7) = (0.70, 0.30). One more step lands on $(0.65, 0.35)$ (0.65, 0.35), then $(0.625, 0.375)$ (0.625, 0.375), then $(0.6125, 0.3875)$ (0.6125, 0.3875). The numbers are clearly crawling towards something — and that something is $(0.6, 0.4)$ (0.6, 0.4). Sixty percent sun, forty percent rain, forever, regardless of what happens this afternoon.
That limit is called the stationary distribution, written $π$ \pi. It’s the unique mix of states the chain settles into no matter where you started. The defining property is the cleanest equation in the course: applying one more step changes nothing.
$π P = π$ \pi P = \pi
Read literally, that says $π$ \pi is a left eigenvector of $P$ P with eigenvalue 1. The probabilistic story (long-run occupancy) and the linear-algebra story (special eigenvector) are the same story.
Why does the algebra promise a unique answer? Because for “nice” chains, a theorem (Perron–Frobenius) guarantees exactly one stationary distribution with all-positive entries. “Nice” means two things in plain English: you can get from any state to any other given enough time (no isolated islands), and the chain isn’t stuck in a clockwork cycle that revisits states only on multiples of some period. Weather satisfies both — sun and rain are reachable from each other, and the self-loops break any rigid periodicity.
Not every chain mixes like that. Swap to the Absorbing chain preset: state Sink has a row of $(0, 0, 1)$ (0, 0, 1) — once you enter, you never leave. There’s no unique stationary distribution that forgets the start; everything eventually piles up at the sink with probability 1. Absorbing chains are how you model “gambler goes broke” or “particle reaches the boundary” — the interesting question shifts from “long-run mix” to “how long until I’m absorbed, and from which non-sink state?”
Try it yourself.

Load the Random walk on 5-node ring preset and start the chain in state S0. Watch the distribution sweep around the ring for the first few steps, then flatten out — the stationary distribution is uniform $(0.2, 0.2, 0.2, 0.2, 0.2)$ (0.2, 0.2, 0.2, 0.2, 0.2) by symmetry. Now restart from the uniform distribution: it doesn’t move at all. That’s what “stationary” literally means — the distribution is a fixed point of one step of the dynamics, even though individual walkers keep wandering.
So what is this good for? Anywhere you have a system that bounces between states with fixed transition probabilities. The Gamblers’ ruin demo further down is a Markov chain on bankrolls with two absorbing states (broke, target). And the PageRank toy at the bottom of this page is the headline application — it models a random surfer clicking links on the web, and the PageRank of a page is literally its entry in the stationary distribution of that chain. The featured solution “Stationary distribution of a 3-state Markov chain via the eigenvalue 1” works the $π P = π$ \pi P = \pi arithmetic on a three-state weather model end-to-end; read it next to see the same machinery one dimension up.

t = 0

∥ π - π^{*} ∥_{1} = 0.800

Markov chain "Weather (sun/rain)" with 2 states; after 0 steps, the distribution is 0.8000 (L1) away from its stationary distribution.

Step delay (ms)600

Time between Markov steps. Lower = faster animation.

Initial state

step 0

Reflection

Pre-interview preview. Akwasi’s first-person reflection on this course is pending — track at issue #45.

Bayes’ theorem

Sliders for prior, likelihood, and false-positive rate; live posterior and 2×2 table update as you drag.

P (D ∣ +) = 0.0902

P (D ∣ -) = 0.0000

Bayes' theorem on a population of 10000: with prior P(D) = 0.0010, sensitivity 0.990, and specificity 0.990, the posterior P(D|+) is 0.0902. When the prior is small the posterior after a positive test is often much lower than the test's accuracy suggests.

Prior P(D)0.0010

Base rate of the condition in the population before testing.

Sensitivity P(+|D)0.990

Probability the test is positive given the person has the condition.

Specificity P(-|¬D)0.990

Probability the test is negative given the person does not have the condition.

Population:

pop = 10000

Birthday paradox simulator

Drag the room size and watch the collision probability climb past 50% near 23 people.

Walk me through this step by step

Imagine a room of 23 strangers. What are the chances that two of them share a birthday? Most people guess one or two percent — there are 365 days, and 23 people feels tiny. The real answer is just over 50 percent. This walkthrough is about why that gap exists, and it turns out to be one specific counting mistake everyone makes.
The mistake is asking the wrong question. Without realising it, your gut estimates “what is the chance someone in the room matches me?” — but the real puzzle is “what is the chance any two people in the room match each other?” Those are very different questions.
Here is the trick. With 23 people, how many pairs are there? Pick any two people and you have a pair. The count is $23 \times 22/2 = 253$ 23 \times 22 / 2 = 253. So you are not doing 23 comparisons against the calendar — you are running 253 mini-comparisons between people, all at once. Now 365 days suddenly feels a lot less roomy.
To compute the probability we count the easy thing first: the chance that nobody matches anyone. Then collision probability is just 1 minus that. (This “count the complement” move is one of the handful of probability tricks worth memorising — the “bad” event is usually one tidy product; the “good” event is a messy union.)
Walk through 23 people one at a time. Person 1 picks any day — no conflict possible. Person 2 needs to avoid one taken day, so they have $364/365$ 364/365 of the calendar free. Person 3 needs to avoid two taken days: $363/365$ 363/365. Multiply them all the way to person 23.
That product is roughly $(364/365) (363/365) \dots (343/365) \approx 0.493$ (364/365)(363/365)\cdots(343/365) \approx 0.493. So the probability nobody shares is about 49.3%, and the probability that at least one pair does share is about 50.7%. That is where the magic “23 people gives 50%” number actually comes from — it is just this multiplication, no fancy math required.
The bigger lesson is the rule of thumb. Whenever you have D equally-likely options (calendars, hash outputs, fingerprint buckets), you start expecting matches after about $D$ \sqrt{D} samples, not D. The square root falls out of the pair-counting we just did: $n^{2} /2$ n^2/2 pairs need to fit into D slots, so a collision is likely once $n \approx D$ n \approx \sqrt{D}.
For birthdays, $365 \approx 19$ \sqrt{365} \approx 19, and the exact threshold for a 50/50 collision is 23 — close to the square-root estimate, exactly as predicted. The demo’s chart shows both the exact curve and the $1 - e^{- n^{2} / (2 D)}$ 1 - e^{-n^2/(2D)} approximation. Past about n = 10 they agree to within a percent, so feel free to use the simpler one for back-of-the-envelope estimates.
Try it yourself.

Switch to the “Hash collision (D=256)” preset. Now the calendar has only 256 slots, and $256 = 16$ \sqrt{256} = 16. Sure enough, the collision probability crosses 50% around n = 19. This is exactly why a “128-bit” hash function is only safe to about $2^{64}$ 2^{64} queries when an attacker is hunting for any collision — the security budget effectively halves.
So the trick is this. Linear intuition asks “out of 365 days, how many will match me?” Pair intuition asks “out of 253 pairs, how many will match each other?” Pairs grow as $n^{2}$ n^2, days stay flat at D, and the crossover where pairs catch up to days is right at $D$ \sqrt{D}. That single shift in question explains every “birthday-style” collision result you will ever meet.
The featured problem “Birthday paradox — why 23 people gives a 50% collision probability” below walks the full algebra and derives the $2 ln 2$ \sqrt{2 \ln 2} constant for the exact 50% crossover. For a different counting trick where indicator variables turn a messy “how long until all coupons collected?” problem into a one-line sum, see the coupon collector featured problem in the same section.

n = 1

P (collision) = 0.000

Birthday paradox with 365 days in the year: the probability of at least one shared birthday hits 50% at n = 23 people.

Days in year D365

Size of the birthday alphabet. 365 = real calendar; 256 = one-byte hash.

Max n50

Upper bound of people swept on the x-axis.

Target probability0.50

Horizontal reference line; n* is the smallest n that meets this.

Step delay (ms)200

Pause between successive n increments in the animation.

n = 1

Central limit theorem

Pick a base distribution, stack the sample means, watch a Gaussian emerge.

μ = 0.500

σ / n = 0.053

empirical σ = 0.052

Central Limit Theorem demo: drawing 2000 sample means of size n = 30 from a Uniform(0, 1) distribution. As n grows the histogram of sample means approaches a normal curve regardless of the underlying distribution.

Distribution

n (sample size)30

How many i.i.d. draws are averaged into each sample mean. CLT predicts the distribution of means approaches normal as n grows.

Number of samples2000

How many sample means are drawn for the histogram. More samples → smoother histogram.

Seed42

Seed for the deterministic PRNG so runs are reproducible and shareable.

n = 30, samples = 2000

Combinations & permutations

Calculator + visual enumeration for nPk and nCk with small n.

(2 6) = 15

P_{6, 2} = 30

(2 6) = \frac{6 !}{2 ! ( 6 - 2 )!}

Scenario "Pick 6 of 20". There are 15 ways to choose 2 items from 6 (unordered), or 30 ways if order matters.

n (total items)6

Total number of items to choose from.

k (chosen)2

How many items are chosen.

C(6, 2) = 15

Erdős-Rényi random graph

Tune the edge probability p and watch the giant component appear past the threshold.

n = 50

p = 0.050

mean deg = 2.36

largest comp = 45/50

Erdős-Rényi random graph G(n = 50, p = 0.050), seed 42: above the p ≈ 1/n threshold (supercritical, giant component). Each of the 50 nodes has expected degree 2.45.

n nodes50

Number of vertices in the graph. Edges are drawn between every unordered pair independently with probability p.

p (edge probability)0.050

Per-pair edge probability. The phase transition sits near p ≈ 1/n: below it the graph is dust; above it a giant component emerges.

Seed42

PRNG seed for reproducible sampling. Change the seed to draw a different random graph from the same G(n, p) distribution.

50 nodes, 59 edges

Gambler’s ruin / Monte Carlo

Biased coin walks to ruin or fortune; theoretical vs simulated absorption probabilities side by side.

P (ruin) = 0.500

Empirical = 0.000

E [T] = 625

Gambler's ruin random walk on {0..50} starting at wealth k=25 with win probability p=0.50. Simulating 50 walks; 0 completed so far. Analytical P(ruin) ≈ 0.500, empirical 0.000.

Target wealth N50

Upper absorbing barrier — the gambler wins by reaching N.

Starting wealth k25

Initial wealth; clamped to [1, N-1].

Win probability p0.50

Probability of a +1 step. p=0.5 is a fair walk.

Number of walks50

How many independent random walks to simulate.

Seed42

PRNG seed (mulberry32). Same seed → same walks.

walks 0 / 50

Hypothesis testing

Visualise α, β, p-values, and effect size for a one-sided z-test.

Scenario

z = 1.643

p = 0.1003

fail to reject H0

Testing H₀: μ = 0.00 vs two-sided alternative with x̄ = 0.30, σ = 1.00, sample size 30. Observed z = 1.643, p = 0.1003. At α = 0.05, fail to reject H0.

Alternative

x̄0.30

Observed sample mean of the data.

μ₀0.00

Hypothesised population mean under H₀.

σ1.0

Known population standard deviation.

n30

Number of observations. Bigger samples → smaller standard error.

n = 30

Monte Carlo π estimator

Drop random darts into a unit square; convergence of 4·hits/total to π.

\overset{π}{^} = 0.0000

S E = 0.0000

n = 0

Monte Carlo π estimation with seed 42: 0 of 10000 samples drawn so far, current estimate π̂ ≈ 0.0000 (true π ≈ 3.14159).

Seed42

PRNG seed (mulberry32). Same seed → same sequence of darts.

Target samples10000

How many darts to throw in total.

Speed100

Darts thrown per animation frame.

n 0 / 10000

PageRank toy

A tiny web graph where power iteration converges to the dominant eigenvector of the transition matrix.

iterations = 26

converged = yes

top rank = 0.308 at node 0

PageRank on the Web example graph (7 nodes, 13 edges), damping d = 0.85, up to 200 power-iteration steps. Node size and color encode each node's rank.

Damping d0.85

Probability of following a link; 1 − d is the teleport probability.

Max iterations200

Cap on power-iteration steps before bailing out.

Tolerance

7 nodes, 13 edges

Featured problems

Stationary distribution of a 3-state Markov chain via the eigenvalue 1

Given a row-stochastic transition matrix $P$ P for a three-state ergodic chain, find the stationary distribution $π$ \pi — the long-run fraction of time spent in each state — without simulating the chain.

By definition $π$ \pi is the left eigenvector of $P$ P with eigenvalue 1, normalised to a probability vector:

π P = π, i \sum π_{i} = 1

Take the concrete weather chain from the Markov-chain visualizer above (sun → rain → cloud transition probabilities):

P = 0.7 0.3 0.2 0.2 0.5 0.3 0.1 0.2 0.5

Writing out $π P = π$ \pi P = \pi gives three equations (only two are independent) plus the normalisation $π_{1} + π_{2} + π_{3} = 1$ \pi_1+\pi_2+\pi_3 = 1. Solving: $π_{1} = 0.485$ \pi_1 = 0.485, $π_{2} = 0.297$ \pi_2 = 0.297, $π_{3} = 0.218$ \pi_3 = 0.218 (rounded).

Two checks: the Perron–Frobenius theorem guarantees a unique positive stationary distribution for an irreducible aperiodic chain, and you can verify by iterating $P^{n}$ P^n for any starting row — it converges to a matrix whose rows are exactly $π$ \pi. Watch the demo: no matter which state you start in, the time-averaged occupancy approaches these numbers. This is the bridge from “Markov chain as random walk” to “Markov chain as linear operator on the simplex” — the eigenvalue structure is the dynamics.

Base-rate fallacy: a 99%-accurate test for a 1%-prevalent disease

A disease has prevalence $P (D) = 0.01$ P(D) = 0.01 in the screened population. A test has sensitivity $P (+ ∣ D) = 0.99$ P(+\!\mid D) = 0.99 and specificity $P (- ∣ \neg D) = 0.99$ P(-\!\mid \neg D) = 0.99. You test positive. What is the probability you actually have the disease?

Bayes’ rule, written for clarity:

P (D ∣ +) = \frac{P ( + ∣ D ) P ( D )}{P ( + ∣ D ) P ( D ) + P ( + ∣ \neg D ) P ( \neg D )}

Plug in the numbers:

P (D ∣ +) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0198} = 0.5

Half. A “99%-accurate” positive on a disease that only 1% of the population has gives you a coin flip. The intuition trap is that “99% accuracy” sounds like the answer should be near 99%, but the likelihood ratio $P (+ ∣ D) / P (+ ∣ \neg D) = 99$ P(+\!\mid D)/P(+\!\mid \neg D) = 99 only multiplies the prior odds. Prior odds were 1:99, so posterior odds are 99:99 = 1:1.

Drop the prevalence to 0.1% and the same test gives $P (D ∣ +) \approx 0.09$ P(D\mid +) \approx 0.09. This is exactly why screening guidelines target high-risk subgroups: shifting the prior by an order of magnitude shifts the posterior by an order of magnitude. The Bayes’ theorem visualizer above lets you slide the prior + sensitivity + specificity and watch the posterior move; it’s the most direct illustration I know of why “base rate matters” is one of the highest-leverage habits this course gave me.

Coupon collector — n·H_n draws via the indicator-variable trick

A cereal box contains one of $n$ n equally likely coupons. On average, how many boxes must you buy to collect the full set? Naïve intuition says “around $n$ n”, but the actual answer is $n H_{n} \approx n ln n + γ n$ n H_n \approx n \ln n + \gamma n, which grows noticeably faster.

Decomposition is the trick. Let $T_{k}$ T_k be the additional draws needed to pick up a new coupon once you already hold $k - 1$ k - 1 distinct types. Each draw is novel with probability $(n - k + 1) / n$ (n - k + 1)/n, so $T_{k}$ T_k is geometric with mean $n / (n - k + 1)$ n / (n - k + 1). Total draws:

T = T_{1} + T_{2} + \dots + T_{n}, E [T] = k = 1 \sum n \frac{n}{n - k + 1} = n j = 1 \sum n \frac{1}{j} = n H_{n}

Linearity of expectation collapses the sum without ever needing independence (although in this problem the $T_{k}$ T_k are in fact independent). For $n = 100$ n = 100, the expected wait is about 518 draws (versus 100 if every draw magically produced a new coupon); for $n = 1000$ n = 1000 it’s about 7485.

The variance is $π^{2} n^{2} /6$ \pi^2 n^2 / 6 to leading order, so the typical fluctuation is on the same order as the mean — coupon collection is a high-variance process. Monte Carlo it on a few thousand trials (the random-process visualizer above is the same indicator-variable pattern in motion) and the histogram is conspicuously right-skewed, with a long tail driven by the always-painful hunt for the last coupon. The result generalises immediately: any time you’re sampling-with-replacement until you cover a set, expect a $lo g n$ \log n blow-up relative to the sampling-without-replacement baseline.

Birthday paradox — why 23 people gives a 50% collision probability

Pick $k$ k people uniformly at random from a year with $n = 365$ n = 365 equally likely birthdays. What is the probability that at least two share a birthday? Most people guess “you’d need ~180 people to get to 50/50”. The real answer is 23, and the gap between intuition and arithmetic is the whole pedagogical payoff.

Solve the complement. Let $A_{k}$ A_k be the event “all $k$ k birthdays distinct”. Lay the people out one at a time: person 1 lands anywhere, person 2 must avoid one date, person 3 must avoid two, and so on, giving

Pr [A_{k}] = j = 0 \prod k - 1 \frac{n - j}{n} = \frac{n !}{( n - k )! n ^{k}}

For small $k / n$ k/n the product is well approximated by $exp (- \sum_{j = 0}^{k - 1} j / n) = exp (- k (k - 1) / (2 n))$ \exp\!\left(-\sum_{j=0}^{k-1} j/n\right) = \exp\!\left(-k(k-1)/(2n)\right). Setting that equal to $1/2$ 1/2 and solving gives the famous $k \approx 2 n ln 2 \approx 22.49$ k \approx \sqrt{2 n \ln 2} \approx 22.49.

Plugging the exact formula in: at $k = 23$ k = 23 the collision probability is 50.7%; at $k = 30$ k = 30 it is 70.6%; at $k = 50$ k = 50 it is 97.0%. The deeper lesson travels well beyond birthdays: any time you sample $k$ k items uniformly from $n$ n bins, expect a collision once $k = O (n)$ k = O(\sqrt{n}). That is exactly the $n$ \sqrt{n} “square-root attack” on hash functions and the reason a 256-bit hash gives 128-bit collision resistance — the same combinatorial accounting, re-skinned for cryptography.

What’s coming

The author’s written reflection lands alongside the v4 interview. The demos and featured problems above are wired and playable today.