Probability is half the story. The other half is inference: given observed data, what can we conclude about the underlying state of the world? This lesson develops Bayes’ rule — the single most-used identity in inferential statistics — and applies it to two settings the bookshelf needs: Bayesian inference (continuous-parameter estimation) and signal detection theory (discriminating signal from noise). Both are central to perception, which is the bridge to Hearing Ch 8.
Bayes’ rule
The joint probability Pr(A and B) of two events can be factored two ways:
Pr(A,B)=Pr(A∣B)Pr(B)=Pr(B∣A)Pr(A).
Equating the two right-hand sides and rearranging:
Pr(A∣B)=Pr(B)Pr(B∣A)Pr(A).
This is Bayes’ rule. It’s a one-line consequence of the definition of conditional probability. The novelty is the interpretation — and a whole school of statistics is built on it.
In an inferential setting, replace A with a hypothesisH (the parameter being estimated) and B with the dataD (the observation). Bayes’ rule becomes
PriorPr(H) — your belief about the hypothesis before seeing the data. Encodes background knowledge, default expectations, or “ignorance” if you want to be uninformative.
LikelihoodPr(D∣H) — the probability of observing the data if the hypothesis were true. This is what the physical/biological model gives you (Gaussian noise, Poisson statistics, etc.).
PosteriorPr(H∣D) — your updated belief about the hypothesis after seeing the data. The output of inference.
Evidence (or marginal likelihood) Pr(D)=∑HPr(D∣H)Pr(H) — a normalising constant making the posterior sum to 1.
The structure is symmetric: data updates the prior to a posterior via the likelihood. If you then collect more data, the current posterior becomes the new prior for the next update. This sequential property — Bayesian updating — is what makes Bayes’ rule the inferential workhorse of online estimation, Kalman filtering, and modern probabilistic-machine-learning systems.
A continuous example: estimating a Gaussian mean
The textbook starter example. We want to estimate the mean μ of a Gaussian distribution with known variance σobs2. We observe n independent samples x1,x2,…,xn. We have a Gaussian prior on μ: N(μ0,σ02).
The posterior turns out to be Gaussian (the Gaussian is self-conjugate under Gaussian likelihoods — a beautiful algebraic gift), with parameters
In words: the posterior precision (reciprocal of variance) is the sum of the prior precision and the data precision. The posterior mean is the precision-weighted average of the prior mean and the sample mean.
▶Why Gaussian × Gaussian = Gaussian
The prior is f(μ)∝exp(−(μ−μ0)2/(2σ02)). The likelihood for observations x1,…,xn is
The posterior (up to normalisation) is the product:
Pr(μ∣x)∝exp(−2σ02(μ−μ0)2−2σobs2∑i(xi−μ)2).
The exponent is a quadratic in μ; any quadratic-in-μ exponent is a Gaussian in μ. Complete the square: write the exponent as −(μ−μpost)2/(2σpost2) plus a μ-independent constant. Matching coefficients of μ2 and μ gives
Multiplying out gives the boxed formulas. The Gaussian is closed under conjugate Gaussian updates — a property called conjugacy — and the algebraic update is just precision-weighted addition.
observations:
The blue prior is your belief about parameter μ before seeing any data. The red likelihood peaks at the empirical mean of the observations and is sharpened by both larger sample size and lower observation noise. The green posterior is Bayes' rule: prior × likelihood, renormalised. With many observations, the posterior tracks the likelihood (data dominates). With few observations or high noise, the prior pulls the posterior toward itself. This is the inferential engine underwriting [Hearing 8 — perception as Bayesian inference](/hearing/meaning/bayes).
Drag the prior mean and width; drag the observations; watch the posterior reform as the precision-weighted compromise. Three things to feel for:
With a tight prior, the posterior tracks the prior. Even when the data say otherwise, a confident prior pulls hard. You’d need many strong observations to overcome it.
With a flat prior, the posterior tracks the data.σ0→∞ makes the prior uninformative; the posterior becomes the likelihood, centred at the sample mean.
More observations sharpen the posterior. The posterior variance shrinks like 1/n — a direct consequence of Gaussian conjugacy and the precision-additivity rule.
A note on priors
The Bayesian framework requires you to specify a prior — what you believed before seeing the data. This is sometimes felt as a weakness (“subjective!”), and various attempts have been made to extract “objective” priors from symmetry or invariance arguments. In practice the prior matters most when data is scarce. Once you have many observations, the data dominates and the choice of prior becomes irrelevant — which is exactly what one would want.
The Bayesian and frequentist schools of statistics differ chiefly in whether they treat the parameter as having a distribution. To a Bayesian, μ is a random variable with a posterior; to a frequentist, μ is a fixed (unknown) number and the data is random. The two formalisms produce numerically identical answers in many practical settings — the disagreement is philosophical, not arithmetic.
Signal detection theory
A related but distinct inferential setting: given a single noisy observation, decide between two hypotheses. Is there a signal, or is it just noise? This is the signal detection problem, and it underpins all of psychophysics (including hearing thresholds), radar processing, medical screening, and audio compression artefact detection.
The classical setup: under the “noise-only” hypothesis H0, the observation X is drawn from a distribution f0. Under the “signal-plus-noise” hypothesis H1, X is drawn from a distribution f1 (typically the same shape as f0 shifted by the signal amplitude). The optimal Bayesian decision rule is to compute the likelihood ratio
L(x)=f0(x)f1(x)
and compare to a threshold. If L>c, declare signal; otherwise declare noise. The threshold c encodes the costs of the two types of error and the prior probabilities of the hypotheses.
Four outcomes
The decision produces one of four outcomes:
Truth: H0 (no signal)
Truth: H1 (signal)
Declare H0
True negative
Miss (false negative)
Declare H1
False alarm (false positive)
Hit (true positive)
Two summary statistics matter:
Hit rate (or true positive rate, or sensitivity) H=Pr(declare H1∣H1).
As you lower the decision threshold, both H and F go up: you catch more signals but also flag more noise. The ROC curve (Receiver Operating Characteristic) plots H versus F as the threshold sweeps. A useless detector — one whose output is independent of the true class — lies on the diagonal H=F. A perfect detector reaches the upper-left corner H=1,F=0. The area under the curve (AUC) measures detector quality; AUC = 0.5 is chance, AUC = 1 is perfect.
The d′ statistic
When the signal-plus-noise and noise-only distributions are both Gaussian with the same variance σ but means differing by Δμ, the detector’s quality is summarised by the sensitivity index
d′=σΔμ.
d′ is the signal amplitude in units of the noise standard deviation. d′=1 is a marginal detector (typical psychophysical threshold); d′=3 is comfortable; d′=5 or more is essentially unambiguous. The AUC of the ROC curve and d′ are equivalent measures: d′=2Φ−1(AUC).
In psychophysics, the experimentally-measured d′ tells you the signal-to-noise ratio at which a perceptual system can discriminate. The auditory-nerve threshold for detecting a tone in noise has d′≈1 at the just-noticeable level, by definition.
History
⏳The history— Bayes 1763, Laplace 1774, and a 200-year argument
Thomas Bayes was a Presbyterian minister and amateur mathematician in 18th-century England. He wrote An Essay towards solving a Problem in the Doctrine of Chances sometime before his death in 1761, but never published it. The manuscript was found among his papers by Richard Price, who edited and submitted it to the Royal Society; it appeared in the Philosophical Transactions in 1763, two years after Bayes had died.
The paper introduced what we now call Bayes’ rule — initially as a special case for the binomial distribution — and applied it to the problem of estimating an unknown probability from observed successes and failures. The crucial conceptual move was to treat the unknown parameter (the probability of success) as itself having a distribution. This was philosophically radical: parameters were generally thought of as fixed unknowns, not as random variables.
Pierre-Simon Laplace independently rediscovered and generalised the rule in his 1774 Mémoire sur la probabilité des causes par les événements. Laplace took it much further — using Bayesian arguments throughout his career to tackle problems from celestial mechanics (determining the orbits of comets) to demography (estimating population sizes from birth-rate data).
The Bayesian / frequentist split crystallised in the early 20th century, with Ronald Fisher, Jerzy Neyman, and Karl Pearson on the frequentist side arguing for objective, parameter-free statistics, and Harold Jeffreys, Bruno de Finetti, and L. J. Savage on the Bayesian side defending the subjective-probability interpretation. The argument lasted decades; modern statistics largely shrugs and uses both. The rise of computational Bayesian methods (Markov-chain Monte Carlo, variational inference) in the 1990s tipped the practical balance toward Bayesian methods for complex models, and machine-learning’s adoption of probabilistic-programming languages (Stan, PyMC, Pyro) has made Bayes the default for most inference today.
What we use this for
Bayesian inference and signal detection appear repeatedly:
Bayesian perception (Hearing 8.2) — the brain combines a prior over stimuli with sensory likelihoods to compute a perceptual posterior. The McGurk effect, phonemic restoration, and the Shepard tone are all consequences of this inferential structure.
Predictive coding (Hearing 8.4) — a neural-circuit-level implementation of approximate Bayesian inference.
Psychophysical thresholds — measured by d′, plotted as ROC curves, fitted with signal-detection-theory models. The 50%-correct threshold of a 2-alternative-forced-choice task corresponds to a particular d′ value.
Speech perception in noise — every speech-in-noise audiometric test is a signal-detection problem.
Bayesian inference in modern engineering — Kalman filters, particle filters, ensemble Kalman filters, probabilistic-graphical-model algorithms — all built on Bayes’ rule and conjugate Gaussian updates.
Closing the chapter
That closes Foundations 11. The five lessons developed the working subset of probability and statistics the bookshelf uses: random variables and the named distributions (11.1), the Gaussian and the Central Limit Theorem that makes it ubiquitous (11.2), random walks and Brownian motion (11.3), Poisson processes (11.4), and Bayesian inference and signal detection theory (this lesson).
The arc of the chapter, in one paragraph: the world is statistical because most physical signals are sums of many independent micro-fluctuations (the CLT picture); random walks and Poisson processes are the two canonical stochastic models that capture most of what physical noise looks like at the macroscale; Bayes’ rule is the inferential engine for going from data back to underlying state. Almost everything probabilistic in the rest of the bookshelf is an instance of one of these.