7.4 Streaming and auditory objects

The big perceptual problem the cortex solves is grouping. In any natural auditory scene, the cochlea is being driven by many sound sources at once: voices, footsteps, traffic, music, your own voice. The cochlea has no idea which spectral components go together as a single source. The brain has to figure that out.

This is called auditory scene analysis, and Albert Bregman’s 1990 book of the same name is the foundational text. The cortex’s job — and especially the job of regions beyond A1, in the surrounding “belt” and “parabelt” — is to take the bottom-up spectro-temporal representation and segregate it into perceptual objects called streams: one stream per sound source.

The bistable galloping triplet from movement 1, replayed below, is the classical demonstration of streaming in action. The cortex receives the same auditory-nerve input regardless of how you perceive it, but the cortical representation differs depending on whether you organize the input into one stream (gallop) or two (separated). At parameters where the percept is bistable, neural recordings in A1 of monkeys show that cortical activity itself flips with the percept — even though the stimulus is identical. The cortex is constructing the streams, and what we hear is the construction.

A (high)B (low)700 Hz556 Hztime → (6 repetitions of the A-B-A-rest pattern)
700 Hz
4 st
6.0 /s
slow rate + small ΔF → galloping (one stream). fast rate + large ΔF → two streams.

Streaming relies on multiple grouping cues. Frequency proximity is one (the cue varied by the slider above). Temporal coherence is another (components that turn on and off together belong to the same source). Harmonicity is a third (components that are integer multiples of a fundamental belong to one harmonic source — this is how you hear a violin’s fundamental and overtones as a single note). Common spatial location (from movement 7’s ITDs/ILDs) is a fourth. The cortex integrates all of these and arrives at a segregation that, most of the time, matches the actual sound sources in the world.

It is worth pausing here. The streaming step is constructive. There is no privileged “true” segregation in the acoustic signal; the brain decides what counts as one source and what counts as two. Different listeners can hear the same acoustic input as one stream or two, depending on their attention, their expectations, and the bistable region they happen to occupy. The percept is a commitment, not a discovery.