11.2 The Gaussian and the central limit theorem

The Gaussian distribution (or normal distribution) is the single most-used probability distribution in science. There is a reason: it is the asymptotic distribution of any sum of many independent, identically-distributed random variables with finite variance — the Central Limit Theorem. This means almost every “noise” we measure in physics, biology, finance, and engineering is approximately Gaussian, because most noises are sums of many independent micro-fluctuations.

This lesson develops the Gaussian, derives why it has the form it does, states the Central Limit Theorem, and demonstrates the CLT with an interactive. We also touch the multivariate Gaussian, which becomes central in Bayesian inference.

The Gaussian PDF

The one-dimensional Gaussian (or normal) distribution with mean μ\mu and variance σ2\sigma^2 is

  f(x)  =  1σ2πexp ⁣((xμ)22σ2).  \boxed{\;f(x) \;=\; \frac{1}{\sigma\sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right).\;}

It is the familiar “bell curve” centred at μ\mu with width σ\sigma. The notation N(μ,σ2)\mathcal{N}(\mu, \sigma^2) denotes “a Gaussian with mean μ\mu and variance σ2\sigma^2.” A standard normal is N(0,1)\mathcal{N}(0, 1).

Three features encode almost everything about the Gaussian’s behaviour:

The normalisation constant 1/(σ2π)1/(\sigma\sqrt{2\pi}) is fixed by the requirement that f(x)dx=1\int f(x)\, dx = 1:

Why the normalisation involves √(2π)

Compute the integral

I  =  ex2/2dx.I \;=\; \int_{-\infty}^{\infty} e^{-x^2/2}\, dx.

There is no elementary antiderivative. The trick is to square it:

I2  =  (ex2/2dx)(ey2/2dy)  =   ⁣ ⁣R2e(x2+y2)/2dxdy.I^2 \;=\; \left( \int_{-\infty}^{\infty} e^{-x^2/2}\, dx \right) \left( \int_{-\infty}^{\infty} e^{-y^2/2}\, dy \right) \;=\; \int\!\!\int_{\mathbb{R}^2} e^{-(x^2 + y^2)/2}\, dx\, dy.

Switch to polar coordinates (r,θ)(r, \theta) with r2=x2+y2r^2 = x^2 + y^2 and dxdy=rdrdθdx\, dy = r\, dr\, d\theta:

I2  =  02πdθ0er2/2rdr  =  2π[er2/2]0  =  2π.I^2 \;=\; \int_0^{2\pi} d\theta \int_0^\infty e^{-r^2/2}\, r\, dr \;=\; 2\pi \cdot \left[ -e^{-r^2/2} \right]_0^\infty \;=\; 2\pi.

So I=2πI = \sqrt{2\pi}. For a Gaussian with general μ\mu and σ\sigma, substitute u=(xμ)/σu = (x - \mu)/\sigma, du=dx/σdu = dx/\sigma:

e(xμ)2/(2σ2)dx  =  σeu2/2du  =  σ2π.\int e^{-(x - \mu)^2 / (2 \sigma^2)}\, dx \;=\; \sigma \int e^{-u^2/2}\, du \;=\; \sigma \sqrt{2\pi}.

Hence the normalisation factor 1/(σ2π)1/(\sigma\sqrt{2\pi}).

Why the Gaussian is everywhere: the Central Limit Theorem

The Gaussian’s ubiquity is not a coincidence. The Central Limit Theorem says:

Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent, identically distributed random variables with mean μ\mu and finite variance σ2\sigma^2. Define the sample sum Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n. Then as nn \to \infty,

Snnμσn    N(0,1)\frac{S_n - n\mu}{\sigma \sqrt{n}} \;\longrightarrow\; \mathcal{N}(0, 1)

in distribution.

In words: the sum of nn independent samples is approximately Gaussian with mean nμn\mu and variance nσ2n\sigma^2, regardless of the underlying distribution of the XiX_i‘s. The convergence is in distribution — the CDF approaches the Gaussian CDF point-by-point.

This is one of the deepest theorems in probability. It does not depend on what the XiX_i‘s actually are — Bernoulli, exponential, uniform, an irregular bimodal mixture — only on the finite-variance condition. The Gaussian is the attractor under summation, the way the heat equation is the attractor under time evolution.

The CLT, made visible

histogram of sum of N = 2 samples (20,000 trials) — Gaussian fit in redsum valuecountempirical histogramCLT prediction: 𝒩(μ = 1.00, σ = 0.408)
distribution:

The Central Limit Theorem: regardless of the distribution we sample from, the *sum* of N independent samples approaches a Gaussian as N grows. At N = 1 the histogram traces the underlying distribution itself — uniform, exponential, or bimodal. By N = 5 the shape is already nearly Gaussian; by N = 10 it is indistinguishable from one with mean Nμ and variance Nσ², where μ and σ² are the mean and variance of a single sample. The red curve is the theoretical CLT prediction. The convergence happens for any distribution with finite variance — the only thing that changes is how fast.

Pick an underlying distribution — uniform, exponential, or bimodal — and slide the number of summands NN from 1 to 30. The histogram is the empirical distribution of i=1NXi\sum_{i=1}^N X_i over 20,000 trials; the red curve is the theoretical CLT prediction N(Nμ,Nσ2)\mathcal{N}(N\mu, N\sigma^2).

A few things to take from playing with this:

This “everyone’s the same after enough adding” is what makes the Gaussian so dominant in physics. Almost every noise we measure is a sum of many tiny independent fluctuations — molecular collisions, photon counts, thermal motions — and the CLT says the sum is Gaussian regardless of the underlying micro-distribution.

History

The history — From de Moivre to Laplace to Gauss

The bell curve’s first appearance was in 1733, when Abraham de Moivre computed the limiting shape of the binomial distribution as nn \to \infty. He derived (nk)pk(1p)nk\binom{n}{k} p^k (1-p)^{n-k} as an approximate Gaussian for large nn, what we’d now call a special case of the Central Limit Theorem. The result was buried in an obscure pamphlet; few people read it.

The curve was rediscovered and popularised by Pierre-Simon Laplace, who derived a more general central-limit result in his 1812 Théorie analytique des probabilités. Laplace argued that sums of many independent measurement errors should be Gaussian-distributed, regardless of the individual error distributions — the modern CLT framing.

Carl Friedrich Gauss developed the distribution from a completely different angle in 1809: he asked, what distribution makes the sample mean the maximum-likelihood estimator of the true value? The unique answer is the Gaussian. This is why we call it Gaussian today, even though de Moivre had the curve a century earlier and Laplace had the limit theorem.

The proof of the CLT in its modern form is due to Aleksandr Lyapunov in 1901 and Jarl Waldemar Lindeberg in 1922. The Lindeberg condition — a precise statement of “no individual XiX_i should dominate the sum” — is what makes the theorem rigorous.

Multivariate Gaussian

For a vector-valued random variable X=(X1,,Xd)\mathbf{X} = (X_1, \ldots, X_d) in Rd\mathbb{R}^d, the multivariate Gaussian with mean μ\boldsymbol{\mu} and covariance matrix Σ\Sigma has PDF

f(x)  =  1(2π)ddetΣexp ⁣(12(xμ)TΣ1(xμ)).f(\mathbf{x}) \;=\; \frac{1}{\sqrt{(2\pi)^d\, \det \Sigma}}\, \exp\!\left( -\frac{1}{2}\, (\mathbf{x} - \boldsymbol{\mu})^T\, \Sigma^{-1}\, (\mathbf{x} - \boldsymbol{\mu}) \right).

The covariance matrix Σij=E[(Xiμi)(Xjμj)]\Sigma_{ij} = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)] encodes both the spread of each component (diagonal entries) and the linear correlation between components (off-diagonal entries). A diagonal Σ\Sigma means the components are independent.

The contours of constant density are ellipsoids aligned with the eigenvectors of Σ\Sigma — directly the eigenvalue analysis from Linear Algebra. The principal axes are the eigenvectors, and the principal lengths are λi\sqrt{\lambda_i} where λi\lambda_i are the eigenvalues. Principal-component analysis (PCA) is exactly the eigendecomposition of an empirical covariance matrix.

The multivariate Gaussian is the natural workhorse for Bayesian inference on multi-dimensional parameters: conjugate Gaussian priors and Gaussian likelihoods produce Gaussian posteriors via a closed-form update, allowing entire belief states to be passed through algorithms as (μ,Σ)(\boldsymbol{\mu}, \Sigma) pairs. The Kalman filter, the Gauss–Markov theorem, and most of “linear filtering theory” live in this corner of the world.

Standard error and confidence

A practical corollary of the CLT. The sample mean of nn independent samples is

Xˉ  =  1ni=1nXi.\bar X \;=\; \frac{1}{n} \sum_{i=1}^n X_i.

By the CLT, Xˉ\bar X is approximately Gaussian with mean μ\mu (the true population mean) and variance σ2/n\sigma^2 / n. The standard error of the sample mean is σ/n\sigma / \sqrt{n}.

This is the famous ”1/n1/\sqrt{n} scaling” of measurement uncertainty. Averaging four independent measurements halves the uncertainty. Averaging 100 measurements reduces it by a factor of 10. Averaging a million measurements reduces it by 1000.

The confidence interval for the true mean, given a sample mean Xˉ\bar X, is Xˉ±zσ/n\bar X \pm z\, \sigma / \sqrt{n}, where zz depends on the desired confidence level: z=1.96z = 1.96 for 95% confidence, z=2.58z = 2.58 for 99%, z=3z = 3 for 99.7%\sim 99.7\% — exactly the ±2σ\pm 2\sigma / ±3σ\pm 3\sigma rule from earlier.

When σ\sigma is unknown and must be estimated from the same sample, the Gaussian is replaced by Student’s t-distribution, which has slightly heavier tails to account for the additional uncertainty in σ\sigma. For sample sizes n>30n > 30 the t-distribution is indistinguishable from the Gaussian and most practical work uses the Gaussian approximation throughout.

What we use this for

Gaussians and the CLT show up wherever many small noises add:

The next lesson, 11.3, develops random walks — the sum-of-i.i.d. picture of the CLT used to derive Brownian motion and the diffusion equation.