8.2 Spectrograms and the time-frequency picture
Most interesting sounds are not time-invariant. Speech is a sequence of phonemes; music is a sequence of notes; a slamming door is a transient. The Fourier transform of the whole signal flattens this temporal structure into a single spectrum — useful but lossy. The fix is the short-time Fourier transform (STFT): chop the signal into overlapping windows, Fourier-transform each window, and plot the magnitude as a 2-D image with time on one axis and frequency on the other.
The result is a spectrogram, and it is the single most useful visualisation in audio analysis.
Construction
For a signal and a window function (e.g., a Gaussian or a Hamming window) of width ,
This is a function of two variables: time (where the window is centred) and frequency . Take its magnitude squared, , and you have a 2-D heatmap — the spectrogram.
A small interactive synthesizer
Adjust the amplitudes of the first six harmonics of a 220 Hz fundamental, watch the time-domain waveform and the line spectrum side by side, and press play sound to hear it. The presets exhibit the canonical waveforms whose Fourier series are textbook material: pure sine, square wave (odd harmonics with falloff), sawtooth (all harmonics, alternating signs), triangle ( falloff in odd harmonics), and a stylised vowel-like spectrum.
This synthesizer doesn’t yet show a spectrogram (a time-series of such spectra) — that’s a more elaborate visualisation we’ll build in a later iteration. What it does show is the equivalence: the time-domain waveform and the frequency-domain spectrum are two views of the same object, neither more complete than the other.
The window-width tradeoff
The width of the analysis window sets the resolution. From the uncertainty principle:
- Narrow window ( small): good time resolution; poor frequency resolution. A 5 ms window resolves frequencies only to Hz.
- Wide window ( large): good frequency resolution; poor time resolution. A 100 ms window resolves frequencies to Hz but smears the time axis.
There is no window that gives sharp resolution in both. The choice is the central design decision in spectrogram analysis. Different applications want different windows:
- Speech analysis: 25 ms windows, ~40 Hz frequency resolution, ~25 ms time resolution. Captures phonemes (~50 ms duration) and formants (~hundreds-of-Hz separated).
- Music transcription: 50 ms windows for chord detection, 10 ms for onset detection — often computed in parallel.
- Bird-call analysis: 5–10 ms windows to capture sweep rate; long-window passes for harmonic structure.
- Vibrational analysis of machinery: 0.5–1 s windows for sub-Hz frequency resolution.
What spectrograms reveal
A spectrogram of human speech makes phonetic structure visible: vowels appear as horizontal bars at the formant frequencies; fricatives as broadband high-frequency noise; stops as silences followed by bursts; voiced segments show the harmonic ladder of the vocal folds.
A spectrogram of music shows: the harmonic series of pitched notes (vertical stacks of evenly-spaced lines), chord changes (when one harmonic stack disappears and another appears), rhythm (the temporal periodicity), and timbre (the relative strengths of harmonics for each note).
A spectrogram of birdsong reveals sweeps, trills, and species-specific frequency patterns. A spectrogram of underwater sound shows ship noise, marine mammal calls, and seismic activity — each in their own frequency band.
The same tool, applied differently, becomes the foundation for: speech recognition (input features for ASR systems), music information retrieval, bioacoustic monitoring, mechanical fault diagnosis, ultrasonic medical imaging. The list does not end.
What we use this for in the rest of the book
- The cochlea performs an approximate STFT in hardware (Hearing chapter 4). Each place on the basilar membrane is tuned to a different frequency; the membrane’s local vibration amplitude as a function of time and place is a kind of biological spectrogram.
- The auditory cortex’s spectro-temporal receptive fields (STRFs — Hearing 7.3) are 2-D filters operating on the cochlear spectrogram.
- Speech, music, and animal vocalisations carry information that is much more compactly described in the spectrogram than in the time-domain waveform. The next lesson treats the room and the system that filters this spectrum on its way to the listener.