7.4 Streaming and auditory objects

The big perceptual problem the cortex solves is grouping. In any natural auditory scene, the cochlea is being driven by many sound sources at once: voices, footsteps, traffic, music, your own voice. The cochlea has no idea which spectral components go together as a single source. The brain has to figure that out.

This is called auditory scene analysis, and Albert Bregman’s 1990 book of the same name is the foundational text. The cortex’s job — and especially the job of regions beyond A1, in the surrounding “belt” and “parabelt” — is to take the bottom-up spectro-temporal representation and segregate it into perceptual objects called streams: one stream per sound source.

The bistable galloping triplet from movement 1, replayed below, is the classical demonstration of streaming in action. The cortex receives the same auditory-nerve input regardless of how you perceive it, but the cortical representation differs depending on whether you organize the input into one stream (gallop) or two (separated). At parameters where the percept is bistable, neural recordings in A1 of monkeys show that cortical activity itself flips with the percept — even though the stimulus is identical. The cortex is constructing the streams, and what we hear is the construction.

tone A frequency

700 Hz

A − B (semitones)

4 st

rate

6.0 /s

slow rate + small ΔF → galloping (one stream). fast rate + large ΔF → two streams.

Streaming relies on multiple grouping cues. Frequency proximity is one (the cue varied by the slider above). Temporal coherence is another (components that turn on and off together belong to the same source). Harmonicity is a third (components that are integer multiples of a fundamental belong to one harmonic source — this is how you hear a violin’s fundamental and overtones as a single note). Common spatial location (from movement 7’s ITDs/ILDs) is a fourth. The cortex integrates all of these and arrives at a segregation that, most of the time, matches the actual sound sources in the world.

The streaming step is constructive. There is no privileged “true” segregation in the acoustic signal; the brain decides what counts as one source and what counts as two. Different listeners can hear the same acoustic input as one stream or two, depending on their attention, their expectations, and the bistable region they happen to occupy. The percept is a commitment, not a discovery.

⏳ The history — Bregman and auditory scene analysis

Albert Bregman’s 1990 monograph Auditory Scene Analysis synthesised decades of psychophysical research into a unified framework for how the auditory system parses a complex acoustic mixture into separate perceptual objects — voices, instruments, environmental sounds. Bregman identified two classes of grouping process: primitive (bottom-up, driven by physical regularities like harmonicity, common onset, and frequency proximity) and schema-based (top-down, driven by learned templates and attention).

The framework gave the field a vocabulary and an experimental program. The streaming phenomena explored in this lesson — the bistable ABA_ triplet, the role of frequency separation, the build-up of streaming over time — are all experiments motivated by Bregman’s taxonomy. Modern computational models of auditory scene analysis, including Bayesian and deep-learning approaches, remain organised around the primitive/schema-based distinction he articulated.