» The story of AudioSapiana

Welcome to The Neuromorphic Engineer

The story of AudioSapiana

PDF version | Permalink

Antje Ihlefeld and Malcolm Slaney

1 December 2005

Though thwarted on race day, this auditory robot successfully used salience and binaural hearing to reach its goal.

AudioSapiana is a listening and walking robot designed by the audio group at the 2005 Neuromorphic Workshop. We added ears to a RoboSapien robot from Wow-Wee Entertainment to help it navigate in a difficult, obstacle-strewn environment. Previously, robots that oriented themselves towards a target beacon were designed to work in quiet environments. Audiosapiana, however, was designed so she could navigate in a noisy, multi-source acoustic environment.

The underlying algorithm was based on a model of human auditory perception: specifically, how listeners identify and attend to novel sounds. We combined a simple model of auditory saliency with a model of sound localization that allowed AudioSapiana to attend to novel sounds, turn towards her mating call, and to navigate around obstacles. Our approach is novel in that information from both monaural and binaural pathways is integrated in a psychophysically-plausible way. This is an important step towards a model of how listeners understand speech in a cocktail-party environment.¹

Motivation

The processing architecture and time-scale for perceptual integration of monaural and binaural properties into primitive groups are still open research questions. One possible strategy that the brain may use is to analyze the location of each frequency channel separately and attend to the frequency components that have common interaural cues. Alternatively, the brain may initially group the auditory scene into different objects based on harmonicity, onset cues etc., and then determine the location of each object.

Traditionally, speech recognition algorithms that use both monaural and binaural information filter the signal first through a binaural pathway followed by a monaural recognizer. However, some psychophysical data suggest that monaural stream segregation precedes the binaural processing part when the signals are very short.² This idea was implemented in our computational model.

Segregation using monaural and binaural integration

Upon entering the ears, the ongoing sound signal is segregated into auditory objects by monaural pathways (see also Figure 2). Simultaneously, the auditory scene is analyzed by a binaural processor that groups the scene into frequency patches of coherent location and similar onset. The monaural processor sends the estimated target frequency channels to the binaural processor. The latter determines the corresponding target locations, detecting spatial mismatches within the proposed frequency channels, and feeds these back into the monaural processor along with all non-target frequencies that originate from the estimated target location.

Figure 1.

AudioSapiana on race day, ready to search for her mate.

Figure 2.

Two models of monaural and binaural processing. AudioSapiana implemented the model on the right.

Only the first 10msec after the detected onset of the target are used for location estimation. This is advantageous in two ways. Firstly, it limits the processing time. Secondly, it avoids reverberation. Human listeners use a similar processing strategy. When localizing sounds in reverberant environments, more weight is typically given to transient than steady-state cues (the precedence effect).

The binaural processor also has a decay-constant that smoothes the estimated target location across time. Perceptually this corresponds to the binaural sluggishness effect.

In summary, the monaural and binaural pathways work together and enhance the estimated target frequency channels, allowing more processing power of the central processor to be used for detailed analysis of the target signal.

Practical implementation

AudioSapiana was designed to navigate towards her mating call: an acoustic beacon provided by Andreas Andreou. Two small microphone ears were mounted on the robot's feet. This unusual placement permitted detection of acoustic shadows behind short obstacles as well as tall ones.

The neuromorphic algorithms were implemented in MATLAB. The mating call detection was accomplished using Gaussian mixture models (GMMs) trained on some of the calls. Time-frequency bins with significant acoustic-call energy were passed to the binaural system.

Audiosapiana navigated through the environment based on interaural level difference (ILD) cues based on a simple decision maker: for ILDs with a magnitude of 1dB or larger, the robot turned by 45° towards the louder ear. For smaller ILDs, the robot continued to walk straight. ILDs were averaged across the five estimated target frequency bands with the highest signal-to-noise ratio, and these bands were weighted by the average signal energy in each channel.

We chose ILD cues over interaural phase differences (IPD) for four reasons. Firstly, ILDs are easier to calculate. Secondly, the robot noise was mostly in the low-frequency range, therefore masking the IPDs. Thirdly, the different phase delays in the microphones caused an IPD bias that we could not remove with our limited recording equipment. Lastly, for navigation of AudioSapiana through the maze, ILD cues are advantageous because they contain information about the acoustic shadows from the obstacles.

In the real-time implementation, AudioSapiana listened for her mating call, made a binaural decision, and then moved a few steps towards the left, right or straight. While moving, AudioSapiana listened again, and repeated the process. AudioSapiana did not have any route-planning or other artificial intelligence. Instead, when she bounced into an obstacle, she backed up a few steps (a pre-programmed feature of RoboSapien) and listened again for a new direction. This strategy proved successful, enabling AudioSapiana to consistently finding a path to her mate.

In the future, the audio team hopes to investigate more realistic implementations of the auditory-saliency binaural model. We intend to close the feedback loops, and study how this model fits human perception.

The people listed below all contributed to Audio-Sapiana's success.

Dave Anderson (Gatech): Noise suppression model.

Jay Kadis (Stanford): Microphones.

Jonathan Tapson (Cape Town): Hardware baking

Mark Tilden (WowWee): Father of our girl.

Mounya Elhilali (Maryland): Crew chief and audio wrangler

Nima Mesgarani (Maryland): Wireless audio.

Sue Denham (Plymouth)—Salient object detector

Shihab Shamma (Maryland): Random (mostly great:-) ideas.

Steven David (Maryland): Control and hardware.

Tar a Hamilton (Sydney): Wireless and hardware.

TobyDelbruck (ETH): Wireless and control.

Malcolm Slaney (Yahoo!): Primitive auditory front end.

Antje Ihlefeld (Boston Univ.): Binaural Model.

Authors

Antje Ihlefeld

Malcolm Slaney
Yahoo! Research

References

A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, 1990.
C. J. Darwin and R. W. Hukin, Auditory objects of attention: the role of interaural time-differences, J. Exp. Psychol. 25, pp. 617-629, 1999.

DOI: 10.2417/1200512.0029

Tell us what to cover!

If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. E-mail the and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and e-mail.