11.1 Random variables and distributions

This chapter develops the probability and statistics that the rest of the bookshelf uses: random walks and Brownian motion for Sound 1.3, Poisson spike trains for Hearing Ch 5, Bayesian inference for Hearing 8, and the Gaussian distribution as the asymptotic shape that almost every noise model relaxes to.

This first lesson is the working vocabulary: what a random variable is, how to describe its distribution, and the five or six named distributions that appear repeatedly across the bookshelf.

What is a random variable?

A random variable is a quantity whose value is not known with certainty until you measure it. Examples:

A random variable is not the same as a single random outcome. It is the entire family of possible outcomes plus the probabilities of each. Two random variables are equal in distribution if they have the same probabilities of producing the same outcomes — they need not produce the same value in any particular trial.

Random variables come in two flavours:

Describing a distribution

For a discrete random variable XX, the probability mass function (PMF) gives the probability of each possible value:

p(k)    Pr(X=k),kp(k)=1.p(k) \;\equiv\; \mathrm{Pr}(X = k), \qquad \sum_k p(k) = 1.

For a continuous random variable, individual points have probability zero (a continuous variable could be exactly π\pi or exactly 2\sqrt{2}, but the probability of landing on any single value with infinite precision is zero). Instead we describe the distribution by a probability density function (PDF):

f(x),Pr(aXb)  =  abf(x)dx,f(x)dx=1.f(x), \qquad \mathrm{Pr}(a \leq X \leq b) \;=\; \int_a^b f(x)\, dx, \qquad \int_{-\infty}^{\infty} f(x)\, dx = 1.

The density f(x)f(x) has units of “probability per unit xx.” For voltage it would have units of 1/V1/\text{V}; for time, units of 1/s1/\text{s}.

Both kinds of distribution can be described equivalently by their cumulative distribution function (CDF):

F(x)    Pr(Xx).F(x) \;\equiv\; \mathrm{Pr}(X \leq x).

The CDF goes from 0 at -\infty to 1 at ++\infty, monotonically. For continuous XX, f(x)=F(x)f(x) = F'(x).

Expectation and variance

The expected value (or mean) of a random variable is the probability-weighted average of its possible values:

E[X]  =  kkp(k)orE[X]  =  xf(x)dx.\mathbb{E}[X] \;=\; \sum_k k\, p(k) \qquad \text{or} \qquad \mathbb{E}[X] \;=\; \int x\, f(x)\, dx.

The expected value is a deterministic number — a property of the distribution, not of any particular realisation. It is the long-run average if you repeated the experiment many times.

The variance measures how spread out the distribution is around the mean:

Var[X]    E[(Xμ)2]  =  E[X2]μ2,\mathrm{Var}[X] \;\equiv\; \mathbb{E}\bigl[(X - \mu)^2\bigr] \;=\; \mathbb{E}[X^2] - \mu^2,

where μ=E[X]\mu = \mathbb{E}[X]. The variance has units of [X]2[X]^2; the square root, called the standard deviation σ=Var[X]\sigma = \sqrt{\mathrm{Var}[X]}, has the same units as XX and is the more natural “typical spread” measure.

The shortcut formula Var = E[X²] − μ²

Expand the definition:

Var[X]  =  E[(Xμ)2]  =  E[X22μX+μ2].\mathrm{Var}[X] \;=\; \mathbb{E}\bigl[(X - \mu)^2\bigr] \;=\; \mathbb{E}\bigl[X^2 - 2\mu X + \mu^2\bigr].

Linearity of expectation (which holds because the integral or sum is linear in the integrand):

Var[X]  =  E[X2]2μE[X]+μ2  =  E[X2]2μμ+μ2  =  E[X2]μ2.\mathrm{Var}[X] \;=\; \mathbb{E}[X^2] - 2 \mu\, \mathbb{E}[X] + \mu^2 \;=\; \mathbb{E}[X^2] - 2 \mu \cdot \mu + \mu^2 \;=\; \mathbb{E}[X^2] - \mu^2.

The variance is therefore “the mean of the square minus the square of the mean.” Useful: it avoids computing the centred quantity XμX - \mu and then squaring, which can be slower than computing X2X^2 directly and subtracting.

Higher-order moments — E[X3]\mathbb{E}[X^3], E[X4]\mathbb{E}[X^4] — measure further aspects of the distribution. The third moment around the mean (normalised by σ3\sigma^3) is skewness, measuring asymmetry. The fourth moment around the mean (normalised by σ4\sigma^4, minus 3) is kurtosis, measuring tail heaviness. For most distributions in the bookshelf, mean and variance suffice.

Six distributions to know

Six named distributions appear repeatedly across the bookshelf. They are worth knowing by name, mean, variance, and shape.

1. Uniform on [a,b][a, b]

Continuous, PDF is constant on [a,b][a, b] and zero outside:

f(x)  =  1bafor x[a,b].f(x) \;=\; \frac{1}{b - a} \quad \text{for } x \in [a, b].

Mean (a+b)/2(a + b)/2, variance (ba)2/12(b - a)^2 / 12. The “default” distribution when you have no information beyond bounds.

2. Bernoulli(pp)

Discrete, two outcomes: 1 with probability pp, 0 with probability 1p1 - p. A single coin toss. Mean pp, variance p(1p)p(1 - p). The building block of the binomial and Poisson.

3. Binomial(n,pn, p)

Discrete, the number of successes in nn independent Bernoulli(pp) trials:

Pr(X=k)  =  (nk)pk(1p)nk,k=0,1,,n.\mathrm{Pr}(X = k) \;=\; \binom{n}{k}\, p^k (1 - p)^{n - k}, \qquad k = 0, 1, \ldots, n.

Mean npnp, variance np(1p)np(1 - p). As nn \to \infty with pp fixed, approaches a Gaussian (Central Limit Theorem). As nn \to \infty with npnp fixed (i.e. p0p \to 0), approaches a Poisson — see lesson 11.4.

4. Gaussian (normal), N(μ,σ2)\mathcal{N}(\mu, \sigma^2)

Continuous, the bell curve:

f(x)  =  1σ2πexp ⁣((xμ)22σ2).f(x) \;=\; \frac{1}{\sigma \sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right).

Mean μ\mu, variance σ2\sigma^2. The asymptotic shape of any sum of many independent identically-distributed (i.i.d.) random variables with finite variance — the Central Limit Theorem. Develop it in 11.2.

5. Exponential(λ)

Continuous, supported on [0,)[0, \infty):

f(t)  =  λeλt,t0.f(t) \;=\; \lambda\, e^{-\lambda t}, \qquad t \geq 0.

Mean 1/λ1/\lambda, variance 1/λ21/\lambda^2. The waiting-time distribution for a memoryless process — the inter-arrival times of a Poisson process. Develop it in 11.4.

6. Poisson(λ)

Discrete, supported on the non-negative integers:

Pr(X=k)  =  λkeλk!,k=0,1,2,\mathrm{Pr}(X = k) \;=\; \frac{\lambda^k\, e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, \ldots

Mean λ\lambda, variance λ\lambda (variance equals mean — a Poisson-process signature). The number of events in a fixed time interval for a Poisson process. Develop it in 11.4.

Joint, marginal, and conditional

When two random variables XX and YY appear together we need new tools.

The joint distribution p(x,y)p(x, y) or f(x,y)f(x, y) gives the probability or density of (X,Y)(X, Y) taking the pair (x,y)(x, y). The marginal distribution of XX alone is recovered by summing or integrating out YY:

p(x)  =  yp(x,y),f(x)  =  f(x,y)dy.p(x) \;=\; \sum_y p(x, y), \qquad f(x) \;=\; \int f(x, y)\, dy.

The conditional distribution of XX given Y=yY = y is

p(xy)  =  p(x,y)p(y),f(xy)  =  f(x,y)f(y).p(x \mid y) \;=\; \frac{p(x, y)}{p(y)}, \qquad f(x \mid y) \;=\; \frac{f(x, y)}{f(y)}.

Two random variables are independent if p(x,y)=p(x)p(y)p(x, y) = p(x)\, p(y) — i.e. learning the value of one tells you nothing about the other.

These three concepts — joint, marginal, conditional — are the algebraic infrastructure of Bayesian inference. Bayes’ rule rearranges them.

Two more facts that will come up

Linearity of expectation holds with or without independence: E[aX+bY]=aE[X]+bE[Y]\mathbb{E}[aX + bY] = a \mathbb{E}[X] + b \mathbb{E}[Y] for any constants a,ba, b.

Variance is not linear in general — but for independent random variables it adds: Var[X+Y]=Var[X]+Var[Y]\mathrm{Var}[X + Y] = \mathrm{Var}[X] + \mathrm{Var}[Y]. The same identity rescaled: standard deviations do not add for independent sums, but their squares do. This is the algebraic content of “errors add in quadrature.”

What we use this for

The vocabulary of this lesson is the prerequisite for everything else in the chapter:

Across the rest of the bookshelf: