8.2 Spectrograms and the time-frequency picture

Most interesting sounds are not time-invariant. Speech is a sequence of phonemes; music is a sequence of notes; a slamming door is a transient. The Fourier transform of the whole signal flattens this temporal structure into a single spectrum — useful but lossy. The fix is the short-time Fourier transform (STFT): chop the signal into overlapping windows, Fourier-transform each window, and plot the magnitude as a 2-D image with time on one axis and frequency on the other.

The result is a spectrogram, and it is the single most useful visualisation in audio analysis.

Construction

For a signal $x(t)$ and a window function $w(t)$ (e.g., a Gaussian or a Hamming window) of width $T_w$ ,

X(t, \omega) \;=\; \int_{-\infty}^\infty x(\tau)\, w(\tau - t)\, e^{-i\omega \tau}\, d\tau.

This is a function of two variables: time $t$ (where the window is centred) and frequency $\omega$ . Take its magnitude squared, $|X(t, \omega)|^2$ , and you have a 2-D heatmap — the spectrogram.

A small interactive synthesizer

a₁

1.00

a₂

0.00

a₃

0.00

a₄

0.00

a₅

0.00

a₆

0.00

presets:

Adjust the amplitudes of the first six harmonics of a 220 Hz fundamental, watch the time-domain waveform and the line spectrum side by side, and press play sound to hear it. The presets exhibit the canonical waveforms whose Fourier series are textbook material: pure sine, square wave (odd harmonics with $1/n$ falloff), sawtooth (all harmonics, alternating signs), triangle ( $1/n^2$ falloff in odd harmonics), and a stylised vowel-like spectrum.

This synthesizer doesn’t yet show a spectrogram (a time-series of such spectra) — that’s a more elaborate visualisation we’ll build in a later iteration. What it does show is the equivalence: the time-domain waveform and the frequency-domain spectrum are two views of the same object, neither more complete than the other.

The window-width tradeoff

The width $T_w$ of the analysis window sets the resolution. From the uncertainty principle:

\Delta t \cdot \Delta \omega \;\geq\; \tfrac12.

Narrow window ( $T_w$ small): good time resolution; poor frequency resolution. A 5 ms window resolves frequencies only to $\Delta f \sim 200$ Hz.
Wide window ( $T_w$ large): good frequency resolution; poor time resolution. A 100 ms window resolves frequencies to $\Delta f \sim 10$ Hz but smears the time axis.

There is no window that gives sharp resolution in both. The choice is the central design decision in spectrogram analysis. Different applications want different windows:

Speech analysis: 25 ms windows, ~40 Hz frequency resolution, ~25 ms time resolution. Captures phonemes (~50 ms duration) and formants (~hundreds-of-Hz separated).
Music transcription: 50 ms windows for chord detection, 10 ms for onset detection — often computed in parallel.
Bird-call analysis: 5–10 ms windows to capture sweep rate; long-window passes for harmonic structure.
Vibrational analysis of machinery: 0.5–1 s windows for sub-Hz frequency resolution.

What spectrograms reveal

A spectrogram of human speech makes phonetic structure visible: vowels appear as horizontal bars at the formant frequencies; fricatives as broadband high-frequency noise; stops as silences followed by bursts; voiced segments show the harmonic ladder of the vocal folds.

A spectrogram of music shows: the harmonic series of pitched notes (vertical stacks of evenly-spaced lines), chord changes (when one harmonic stack disappears and another appears), rhythm (the temporal periodicity), and timbre (the relative strengths of harmonics for each note).

A spectrogram of birdsong reveals sweeps, trills, and species-specific frequency patterns. A spectrogram of underwater sound shows ship noise, marine mammal calls, and seismic activity — each in their own frequency band.

The same tool, applied differently, becomes the foundation for: speech recognition (input features for ASR systems), music information retrieval, bioacoustic monitoring, mechanical fault diagnosis, ultrasonic medical imaging. The list does not end.

What we use this for in the rest of the book

The cochlea performs an approximate STFT in hardware (Hearing chapter 4). Each place on the basilar membrane is tuned to a different frequency; the membrane’s local vibration amplitude as a function of time and place is a kind of biological spectrogram.
The auditory cortex’s spectro-temporal receptive fields (STRFs — Hearing 7.3) are 2-D filters operating on the cochlear spectrogram.
Speech, music, and animal vocalisations carry information that is much more compactly described in the spectrogram than in the time-domain waveform.

⏳ The history — Shannon, Nyquist, and the sampling theorem

Claude Shannon’s 1948 paper “A Mathematical Theory of Communication” established information theory and, along the way, provided the definitive statement of the sampling theorem: a band-limited signal with maximum frequency $B$ is completely determined by samples taken at rate $2B$ or higher. The result had precursors — Harry Nyquist stated a version in 1928, and the Soviet mathematician Vladimir Kotelnikov proved a similar theorem independently in 1933 — but Shannon’s formulation embedded sampling in a complete theory of communication and gave it the form used today.

The sampling theorem is the bridge between continuous acoustics and digital signal processing. Every digital audio recording, every cochlear-implant stimulation strategy, every FFT computation on a finite data record rests on it. The theorem also establishes the fundamental tradeoff: higher sample rates preserve more bandwidth but require more data. CD audio’s 44.1 kHz sample rate captures frequencies up to 22.05 kHz — just above the nominal upper limit of human hearing.

The next lesson treats the room and the system that filters this spectrum on its way to the listener.