11.2 The Gaussian and the central limit theorem

The Gaussian distribution (or normal distribution) is the single most-used probability distribution in science. There is a reason: it is the asymptotic distribution of any sum of many independent, identically-distributed random variables with finite variance — the Central Limit Theorem. This means almost every “noise” we measure in physics, biology, finance, and engineering is approximately Gaussian, because most noises are sums of many independent micro-fluctuations.

This lesson develops the Gaussian, derives why it has the form it does, states the Central Limit Theorem, and demonstrates the CLT with an interactive. We also touch the multivariate Gaussian, which becomes central in Bayesian inference.

The Gaussian PDF

The one-dimensional Gaussian (or normal) distribution with mean $\mu$ and variance $\sigma^2$ is

\boxed{\;f(x) \;=\; \frac{1}{\sigma\sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2 \sigma^2} \right).\;}

It is the familiar “bell curve” centred at $\mu$ with width $\sigma$ . The notation $\mathcal{N}(\mu, \sigma^2)$ denotes “a Gaussian with mean $\mu$ and variance $\sigma^2$ .” A standard normal is $\mathcal{N}(0, 1)$ .

Three features encode almost everything about the Gaussian’s behaviour:

Symmetric about the mean. $f(\mu + t) = f(\mu - t)$ for any $t$ .
Exponentially small tails. $f(\mu + k\sigma)$ falls as $e^{-k^2/2}$ . At $|x - \mu| = 3\sigma$ the density is already $\sim 1\%$ of its peak. By $5\sigma$ , $\sim 10^{-6}$ .
The 68-95-99.7 rule. About 68% of the probability lies within $\pm 1\sigma$ of the mean, 95% within $\pm 2\sigma$ , 99.7% within $\pm 3\sigma$ .

The normalisation constant $1/(\sigma\sqrt{2\pi})$ is fixed by the requirement that $\int f(x)\, dx = 1$ :

▶ Why the normalisation involves √(2π) Derivation

Compute the integral

I \;=\; \int_{-\infty}^{\infty} e^{-x^2/2}\, dx.

There is no elementary antiderivative. The trick is to square it:

I^2 \;=\; \left( \int_{-\infty}^{\infty} e^{-x^2/2}\, dx \right) \left( \int_{-\infty}^{\infty} e^{-y^2/2}\, dy \right) \;=\; \int\!\!\int_{\mathbb{R}^2} e^{-(x^2 + y^2)/2}\, dx\, dy.

Switch to polar coordinates $(r, \theta)$ with $r^2 = x^2 + y^2$ and $dx\, dy = r\, dr\, d\theta$ :

I^2 \;=\; \int_0^{2\pi} d\theta \int_0^\infty e^{-r^2/2}\, r\, dr \;=\; 2\pi \cdot \left[ -e^{-r^2/2} \right]_0^\infty \;=\; 2\pi.

So $I = \sqrt{2\pi}$ . For a Gaussian with general $\mu$ and $\sigma$ , substitute $u = (x - \mu)/\sigma$ , $du = dx/\sigma$ :

\int e^{-(x - \mu)^2 / (2 \sigma^2)}\, dx \;=\; \sigma \int e^{-u^2/2}\, du \;=\; \sigma \sqrt{2\pi}.

Hence the normalisation factor $1/(\sigma\sqrt{2\pi})$ .

Why the Gaussian is everywhere: the Central Limit Theorem

The Gaussian’s ubiquity is not a coincidence. The Central Limit Theorem says:

Let $X_1, X_2, \ldots, X_n$ be independent, identically distributed random variables with mean $\mu$ and finite variance $\sigma^2$ . Define the sample sum $S_n = X_1 + X_2 + \cdots + X_n$ . Then as $n \to \infty$ ,

$\frac{S_n - n\mu}{\sigma \sqrt{n}} \;\longrightarrow\; \mathcal{N}(0, 1)$

in distribution.

In words: the sum of $n$ independent samples is approximately Gaussian with mean $n\mu$ and variance $n\sigma^2$ , regardless of the underlying distribution of the $X_i$ ‘s. The convergence is in distribution — the CDF approaches the Gaussian CDF point-by-point.

This is one of the deepest theorems in probability. It does not depend on what the $X_i$ ‘s actually are — Bernoulli, exponential, uniform, an irregular bimodal mixture — only on the finite-variance condition. The Gaussian is the attractor under summation, the way the heat equation is the attractor under time evolution.

The CLT, made visible

distribution:

number of summands N = 2

The Central Limit Theorem: regardless of the distribution we sample from, the *sum* of N independent samples approaches a Gaussian as N grows. At N = 1 the histogram traces the underlying distribution itself — uniform, exponential, or bimodal. By N = 5 the shape is already nearly Gaussian; by N = 10 it is indistinguishable from one with mean Nμ and variance Nσ², where μ and σ² are the mean and variance of a single sample. The red curve is the theoretical CLT prediction. The convergence happens for any distribution with finite variance — the only thing that changes is how fast.

Pick an underlying distribution — uniform, exponential, or bimodal — and slide the number of summands $N$ from 1 to 30. The histogram is the empirical distribution of $\sum_{i=1}^N X_i$ over 20,000 trials; the red curve is the theoretical CLT prediction $\mathcal{N}(N\mu, N\sigma^2)$ .

A few things to take from playing with this:

At $N = 1$ , the histogram traces the underlying distribution itself.
By $N = 5$ the shape is already nearly Gaussian.
By $N = 10$ it is indistinguishable from the theoretical Gaussian in any visible feature.
The mean of the sum grows linearly: $\mathbb{E}[S_N] = N \mu$ . The standard deviation grows as the square root: $\mathrm{std}(S_N) = \sigma \sqrt{N}$ . The relative width $\sigma_{S_N} / |\mathbb{E}[S_N]|$ therefore shrinks as $1/\sqrt{N}$ — the sum becomes proportionally tighter around its mean.
For the bimodal distribution, the convergence is slower (the histogram retains a slight bumpiness at small $N$ ) but still occurs. The exponential converges fastest among the three.

This “everyone’s the same after enough adding” is what makes the Gaussian so dominant in physics. Almost every noise we measure is a sum of many tiny independent fluctuations — molecular collisions, photon counts, thermal motions — and the CLT says the sum is Gaussian regardless of the underlying micro-distribution.

History

⏳ The history — From de Moivre to Laplace to Gauss

The bell curve’s first appearance was in 1733, when Abraham de Moivre computed the limiting shape of the binomial distribution as $n \to \infty$ . He derived $\binom{n}{k} p^k (1-p)^{n-k}$ as an approximate Gaussian for large $n$ , what we’d now call a special case of the Central Limit Theorem. The result was buried in an obscure pamphlet; few people read it.

The curve was rediscovered and popularised by Pierre-Simon Laplace, who derived a more general central-limit result in his 1812 Théorie analytique des probabilités. Laplace argued that sums of many independent measurement errors should be Gaussian-distributed, regardless of the individual error distributions — the modern CLT framing.

Carl Friedrich Gauss developed the distribution from a completely different angle in 1809: he asked, what distribution makes the sample mean the maximum-likelihood estimator of the true value? The unique answer is the Gaussian. This is why we call it Gaussian today, even though de Moivre had the curve a century earlier and Laplace had the limit theorem.

The proof of the CLT in its modern form is due to Aleksandr Lyapunov in 1901 and Jarl Waldemar Lindeberg in 1922. The Lindeberg condition — a precise statement of “no individual $X_i$ should dominate the sum” — is what makes the theorem rigorous.

Read the original: Théorie analytique des probabilités (Pierre-Simon Laplace, 1812) — French

Multivariate Gaussian

For a vector-valued random variable $\mathbf{X} = (X_1, \ldots, X_d)$ in $\mathbb{R}^d$ , the multivariate Gaussian with mean $\boldsymbol{\mu}$ and covariance matrix $\Sigma$ has PDF

f(\mathbf{x}) \;=\; \frac{1}{\sqrt{(2\pi)^d\, \det \Sigma}}\, \exp\!\left( -\frac{1}{2}\, (\mathbf{x} - \boldsymbol{\mu})^T\, \Sigma^{-1}\, (\mathbf{x} - \boldsymbol{\mu}) \right).

The covariance matrix $\Sigma_{ij} = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)]$ encodes both the spread of each component (diagonal entries) and the linear correlation between components (off-diagonal entries). A diagonal $\Sigma$ means the components are independent.

The contours of constant density are ellipsoids aligned with the eigenvectors of $\Sigma$ — directly the eigenvalue analysis from Linear Algebra. The principal axes are the eigenvectors, and the principal lengths are $\sqrt{\lambda_i}$ where $\lambda_i$ are the eigenvalues. Principal-component analysis (PCA) is exactly the eigendecomposition of an empirical covariance matrix.

The multivariate Gaussian is the natural workhorse for Bayesian inference on multi-dimensional parameters: conjugate Gaussian priors and Gaussian likelihoods produce Gaussian posteriors via a closed-form update, allowing entire belief states to be passed through algorithms as $(\boldsymbol{\mu}, \Sigma)$ pairs. The Kalman filter, the Gauss–Markov theorem, and most of “linear filtering theory” live in this corner of the world.

▶ Probability within an interval using the 68-95-99.7 rule Worked Example

Hearing thresholds across a population are approximately Gaussian with $\mu = 20\,\text{dB HL}$ and $\sigma = 8\,\text{dB}$ . What fraction of the population has thresholds between 4 dB and 36 dB?

Convert to standard deviations: the interval $[4, 36]$ is $[\mu - 2\sigma,\; \mu + 2\sigma] = [20 - 16,\; 20 + 16]$ .

By the 68-95-99.7 rule, approximately 95% of the population falls within $\pm 2\sigma$ of the mean. So about 95% have thresholds in $[4, 36]\,\text{dB HL}$ .

Equivalently: about 2.5% have thresholds below 4 dB (exceptionally sensitive), and 2.5% above 36 dB (mild hearing loss by this criterion).

Standard error and confidence

A practical corollary of the CLT. The sample mean of $n$ independent samples is

\bar X \;=\; \frac{1}{n} \sum_{i=1}^n X_i.

By the CLT, $\bar X$ is approximately Gaussian with mean $\mu$ (the true population mean) and variance $\sigma^2 / n$ . The standard error of the sample mean is $\sigma / \sqrt{n}$ .

This is the famous ” $1/\sqrt{n}$ scaling” of measurement uncertainty. Averaging four independent measurements halves the uncertainty. Averaging 100 measurements reduces it by a factor of 10. Averaging a million measurements reduces it by 1000.

The confidence interval for the true mean, given a sample mean $\bar X$ , is $\bar X \pm z\, \sigma / \sqrt{n}$ , where $z$ depends on the desired confidence level: $z = 1.96$ for 95% confidence, $z = 2.58$ for 99%, $z = 3$ for $\sim 99.7\%$ — exactly the $\pm 2\sigma$ / $\pm 3\sigma$ rule from earlier.

When $\sigma$ is unknown and must be estimated from the same sample, the Gaussian is replaced by Student’s t-distribution, which has slightly heavier tails to account for the additional uncertainty in $\sigma$ . For sample sizes $n > 30$ the t-distribution is indistinguishable from the Gaussian and most practical work uses the Gaussian approximation throughout.

What we use this for

Gaussians and the CLT show up wherever many small noises add:

Thermal noise — Johnson–Nyquist voltage fluctuations across a resistor, Gaussian with variance $4 k_B T R \Delta f$ .
Measurement error — any reading of a noisy instrument is a Gaussian about the true value if the underlying physics has many independent error sources.
Diffusion — the spatial distribution of a Brownian particle is Gaussian with variance growing linearly in time. Developed properly in 11.3.
Photon shot noise at high intensity — Poisson with large mean is Gaussian (CLT applied to the Poisson distribution).
Bayesian inference with Gaussian priors and likelihoods — closed-form Gaussian posteriors. Developed in 11.5.
Modal density and statistical room acoustics — the random superposition of many room modes has a Gaussian envelope by the CLT. Connects to Sound 7.8.

The next lesson, 11.3, develops random walks — the sum-of-i.i.d. picture of the CLT used to derive Brownian motion and the diffusion equation.