11.1 Random variables and distributions

A random variable is a quantity whose value is uncertain — the outcome of a die roll, the velocity of a single molecule, the spike count of an auditory-nerve fibre over a fixed interval. Its distribution specifies the probabilities of each possible value. This lesson develops the vocabulary — random variable, distribution, expected value, variance — and the five or six named distributions (uniform, Bernoulli, binomial, Gaussian, exponential, Poisson) that recur throughout the chapter.

What is a random variable?

A random variable is a quantity whose value is not known with certainty until you measure it. Examples:

The number of times a tossed coin lands heads in 10 trials.
The arrival time of the next photon from a faint star.
The voltage measured across a noisy resistor at a particular instant.
The displacement of a Brownian particle after one second.

A random variable is not the same as a single random outcome. It is the entire family of possible outcomes plus the probabilities of each. Two random variables are equal in distribution if they have the same probabilities of producing the same outcomes — they need not produce the same value in any particular trial.

Random variables come in two flavours:

Discrete — takes values from a countable set: $\{0, 1, 2, 3, \ldots\}$ or $\{H, T\}$ or the integers. Examples: coin tosses, spike counts, photon counts.
Continuous — takes values from an uncountable interval, typically $\mathbb{R}$ or $[0, 1]$ or some subset. Examples: position of a Brownian particle, voltage, arrival time.

Describing a distribution

For a discrete random variable $X$ , the probability mass function (PMF) gives the probability of each possible value:

p(k) \;\equiv\; \mathrm{Pr}(X = k), \qquad \sum_k p(k) = 1.

For a continuous random variable, individual points have probability zero (a continuous variable could be exactly $\pi$ or exactly $\sqrt{2}$ , but the probability of landing on any single value with infinite precision is zero). Instead we describe the distribution by a probability density function (PDF):

f(x), \qquad \mathrm{Pr}(a \leq X \leq b) \;=\; \int_a^b f(x)\, dx, \qquad \int_{-\infty}^{\infty} f(x)\, dx = 1.

The density $f(x)$ has units of “probability per unit $x$ .” For voltage it would have units of $1/\text{V}$ ; for time, units of $1/\text{s}$ .

Both kinds of distribution can be described equivalently by their cumulative distribution function (CDF):

F(x) \;\equiv\; \mathrm{Pr}(X \leq x).

The CDF goes from 0 at $-\infty$ to 1 at $+\infty$ , monotonically. For continuous $X$ , $f(x) = F'(x)$ .

Expectation and variance

The expected value (or mean) of a random variable is the probability-weighted average of its possible values:

\mathbb{E}[X] \;=\; \sum_k k\, p(k) \qquad \text{or} \qquad \mathbb{E}[X] \;=\; \int x\, f(x)\, dx.

The expected value is a deterministic number — a property of the distribution, not of any particular realisation. It is the long-run average if you repeated the experiment many times.

The variance measures how spread out the distribution is around the mean:

\mathrm{Var}[X] \;\equiv\; \mathbb{E}\bigl[(X - \mu)^2\bigr] \;=\; \mathbb{E}[X^2] - \mu^2,

where $\mu = \mathbb{E}[X]$ . The variance has units of $[X]^2$ ; the square root, called the standard deviation $\sigma = \sqrt{\mathrm{Var}[X]}$ , has the same units as $X$ and is the more natural “typical spread” measure.

▶ The shortcut formula Var = E[X²] − μ² Derivation

Expand the definition:

\mathrm{Var}[X] \;=\; \mathbb{E}\bigl[(X - \mu)^2\bigr] \;=\; \mathbb{E}\bigl[X^2 - 2\mu X + \mu^2\bigr].

Linearity of expectation (which holds because the integral or sum is linear in the integrand):

\mathrm{Var}[X] \;=\; \mathbb{E}[X^2] - 2 \mu\, \mathbb{E}[X] + \mu^2 \;=\; \mathbb{E}[X^2] - 2 \mu \cdot \mu + \mu^2 \;=\; \mathbb{E}[X^2] - \mu^2.

The variance is therefore “the mean of the square minus the square of the mean.” Useful: it avoids computing the centred quantity $X - \mu$ and then squaring, which can be slower than computing $X^2$ directly and subtracting.

Higher-order moments — $\mathbb{E}[X^3]$ , $\mathbb{E}[X^4]$ — measure further aspects of the distribution. The third moment around the mean (normalised by $\sigma^3$ ) is skewness, measuring asymmetry. The fourth moment around the mean (normalised by $\sigma^4$ , minus 3) is kurtosis, measuring tail heaviness. For most distributions in the bookshelf, mean and variance suffice.

Six distributions to know

Six named distributions appear repeatedly across the bookshelf. They are worth knowing by name, mean, variance, and shape.

1. Uniform on $[a, b]$

Continuous, PDF is constant on $[a, b]$ and zero outside:

f(x) \;=\; \frac{1}{b - a} \quad \text{for } x \in [a, b].

Mean $(a + b)/2$ , variance $(b - a)^2 / 12$ . The “default” distribution when you have no information beyond bounds.

2. Bernoulli( $p$ )

Discrete, two outcomes: 1 with probability $p$ , 0 with probability $1 - p$ . A single coin toss. Mean $p$ , variance $p(1 - p)$ . The building block of the binomial and Poisson.

3. Binomial( $n, p$ )

Discrete, the number of successes in $n$ independent Bernoulli( $p$ ) trials:

\mathrm{Pr}(X = k) \;=\; \binom{n}{k}\, p^k (1 - p)^{n - k}, \qquad k = 0, 1, \ldots, n.

Mean $np$ , variance $np(1 - p)$ . As $n \to \infty$ with $p$ fixed, approaches a Gaussian (Central Limit Theorem). As $n \to \infty$ with $np$ fixed (i.e. $p \to 0$ ), approaches a Poisson — see lesson 11.4.

4. Gaussian (normal), $\mathcal{N}(\mu, \sigma^2)$

Continuous, the bell curve:

f(x) \;=\; \frac{1}{\sigma \sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right).

Mean $\mu$ , variance $\sigma^2$ . The asymptotic shape of any sum of many independent identically-distributed (i.i.d.) random variables with finite variance — the Central Limit Theorem. Develop it in 11.2.

5. Exponential(λ)

Continuous, supported on $[0, \infty)$ :

f(t) \;=\; \lambda\, e^{-\lambda t}, \qquad t \geq 0.

Mean $1/\lambda$ , variance $1/\lambda^2$ . The waiting-time distribution for a memoryless process — the inter-arrival times of a Poisson process. Develop it in 11.4.

6. Poisson(λ)

Discrete, supported on the non-negative integers:

\mathrm{Pr}(X = k) \;=\; \frac{\lambda^k\, e^{-\lambda}}{k!}, \qquad k = 0, 1, 2, \ldots

Mean $\lambda$ , variance $\lambda$ (variance equals mean — a Poisson-process signature). The number of events in a fixed time interval for a Poisson process. Develop it in 11.4.

▶ Mean and variance of a Poisson spike count Worked Example

An auditory-nerve fibre fires at a mean rate $\lambda = 80\,\text{spikes/s}$ . In a $T = 50\,\text{ms}$ window, the spike count $N$ is Poisson with parameter $\mu = \lambda T = 80 \times 0.05 = 4$ .

Mean: $\mathbb{E}[N] = \mu = 4$ spikes.

Variance: $\mathrm{Var}[N] = \mu = 4$ (variance equals mean for Poisson).

Standard deviation: $\sigma = \sqrt{4} = 2$ spikes.

The coefficient of variation is $\sigma/\mu = 2/4 = 0.5$ . Doubling the window to 100 ms gives $\mu = 8$ , $\sigma = 2\sqrt{2} \approx 2.83$ , and CV drops to $\approx 0.35$ — longer counting windows yield more reliable rate estimates.

Joint, marginal, and conditional

When two random variables $X$ and $Y$ appear together we need new tools.

The joint distribution $p(x, y)$ or $f(x, y)$ gives the probability or density of $(X, Y)$ taking the pair $(x, y)$ . The marginal distribution of $X$ alone is recovered by summing or integrating out $Y$ :

p(x) \;=\; \sum_y p(x, y), \qquad f(x) \;=\; \int f(x, y)\, dy.

The conditional distribution of $X$ given $Y = y$ is

p(x \mid y) \;=\; \frac{p(x, y)}{p(y)}, \qquad f(x \mid y) \;=\; \frac{f(x, y)}{f(y)}.

Two random variables are independent if $p(x, y) = p(x)\, p(y)$ — i.e. learning the value of one tells you nothing about the other.

These three concepts — joint, marginal, conditional — are the algebraic infrastructure of Bayesian inference. Bayes’ rule rearranges them.

Two more facts that will come up

Linearity of expectation holds with or without independence: $\mathbb{E}[aX + bY] = a \mathbb{E}[X] + b \mathbb{E}[Y]$ for any constants $a, b$ .

Variance is not linear in general — but for independent random variables it adds: $\mathrm{Var}[X + Y] = \mathrm{Var}[X] + \mathrm{Var}[Y]$ . The same identity rescaled: standard deviations do not add for independent sums, but their squares do. This is the algebraic content of “errors add in quadrature.”

⏳ The history — Kolmogorov's axioms for probability

For two centuries after Pascal and Fermat’s 1654 correspondence on games of chance, probability was treated as a collection of computational recipes — useful, intuitive, and logically unmoored. The frequentist interpretation (“probability is long-run frequency”) and the Laplacian interpretation (“probability is equally likely cases”) each worked in specific settings but failed in others, and there was no agreed answer to what “probability” meant.

Andrey Kolmogorov’s 1933 monograph Grundbegriffe der Wahrscheinlichkeitsrechnung (Foundations of the Theory of Probability) gave the axiomatic definition that is now standard: a probability is a non-negative, countably-additive measure on a sigma-algebra of events, normalised so that the certain event has measure 1. Random variables are measurable functions on this space; expectations are Lebesgue integrals.

The axiomatisation did three things at once. It unified frequentist and subjective probabilities into a single mathematical object (only the interpretation differs). It connected probability to measure theory and so to the rest of 20th-century analysis. And it provided the formal ground for the convergence theorems (law of large numbers, central limit theorem, martingale convergence) that underwrite all of modern statistical inference.

What we use this for

The vocabulary of this lesson is the prerequisite for everything else in the chapter:

11.2 develops the Gaussian and the CLT.
11.3 builds random walks and Brownian motion from sums of i.i.d. Bernoulli or Gaussian steps.
11.4 develops the Poisson process from a memoryless-arrivals argument.
11.5 does Bayesian inference, signal detection theory, and ROC curves.

Across the rest of the bookshelf:

Thermal noise voltage on a resistor is Gaussian with $\mathrm{Var}[V] = 4 k_B T R \Delta f$ (Johnson–Nyquist).
Spike counts in an auditory-nerve fibre are Poisson-distributed with mean equal to the firing rate times the window.
Brownian motion is the continuum limit of an i.i.d. random walk.
Bayesian perception models the brain’s posterior over stimuli given sensory evidence.