TR#252: A simulation of vowel segregation based on across-channel glottal-pulse synchrony

D. P. W. Ellis

Presented to:
126th meeting of the Acoustical Society of America
Denver CO, October 1993

As part of the broader question of how it is that human listeners can be so successful at extracting a single voice of interest in the most adverse noise conditions (the `cocktail-party effect'), a great deal of attention has been focused on the problem of separating simultaneously presented vowels, primarily by exploiting assumed differences of fundamental frequency (f0) (see [A. de Cheveigne, "Separation of concurrent harmonic sounds: Fundamental frequency estimation and a time-domain cancellation model of auditory processing," J. Acous. Soc. Am. 93(6), 1993] for a review).

While acknowledging the very good agreement with experimental data achieved by some of these models (e.g. [R. Meddis and M. J. Hewitt, "Modeling the identification of concurrent vowels with different fundamental frequencies," J. Acous. Soc. Am. 91(1), 1992]), we propose a different mechanism taht does not rely on the different period of the two voices, but rather on the assumption that, in the majority of cases, their glottal pitch pulses will occur at distinct instants.

A modification of the Meddis & Hewitt model is proposed that segregates the regions of spectral dominance of the different vowels by detecting their synchronization to a common underlying glottal pulse train, as will be the case for each distinct human voice. Although phase dispersion from numerous sources complicates this approach, our results show that with suitable integration across time, it is possible to separate vowels on this basis alone.

The possible advantages of such a mechanism include its ability to exploit the period fluctuations due to frequency modulation and jitter in order to separate voices whose f0s may otherwise be close and difficult to distinguish. Since small amounts of modulation do indeed the prominence of voices [S. McAdams, "Segregation of concurrent sounds. I: Effects of frequency modulation coherence," J. Acous. Soc. Am. 86(6), 1989], we suggest that human listeners may be employing something akin to this strategy when pitch-based cues are absent or ambiguous.