Welcome to The Neuromorphic Engineer
Methods » Learning

Audio-visual sensor fusion for object localization

PDF version | Permalink

Vincent Chan

8 June 2009

Using the onset time of stimuli, a biologically-inspired system learns to identify the sources of sounds.

Sound localization is the ability to identify the direction from which sound comea and is a key to survival in the animal world. It is used by many predators to hunt effectively, sometimes in complete darkness. While sound localization has received considerably less attention than vision in robotics, it is expected to become increasingly important as machines are required to operate in the real world and respond to both visual and audio stimuli.

In this article, we describe a novel two-microphone neuromorphic sound-localization system based on interaural time difference (ITD), the difference in sound arrival time because of the spatial separation of the two ears. The system is biologically realistic as it employs a pair of silicon cochleae and all subsequent processing are inspired by biology. Unlike previous implementations involving cochleae,1,2 our system is adaptive and does not require any prior knowledge of the ITD models. More importantly, it can learn sound localization based on self motion and visual feedback, just as biological systems do.3,4

The architecture of the localization system is shown in Figure 1. Sound is separated into different frequency bands and then converted into electrical signals at the cochleae. The signals then go through a network of coincidence detectors and delay elements that perform cross correlation as suggested in Jeffress's model.5 The cross-correlation results are sharpened by a soft winner-take-all (WTA) network6 before matrix multiplication is applied. This transforms the WTA result into a function that represents the probability of the source coming from each discrete direction. Lastly, the results from the different frequency bands are combined to produce a global estimate. The matrix multiplication is essentially a single-layer neural network, and by updating the weights in this network, the system can be trained to localize sound.

Block diagram of the sound-localization system. The left and right audio signals are broken into different frequency bands and processed independently, before being re-combined at the end. The block arrows represent signals in multiple bands.

We tested our sound-localization architecture in an audio-visual source localization experiment, the setup for which is shown in Figure 2. For each trial, the robot was moved in a random direction and a stimulus played from the loudspeaker. The direction of the source was computed using the algorithm described and the robot then moved to face the estimated direction of the source. A flashing light attached to the loudspeaker, which could be easily picked up by the vision sensor attached to the front of the robot, allowed it to recognize if there was an error and adjusted the weights according to a simple update rule based on gradient decent.7

Top: The setup for our audio-visual sensor fusion experiment. Bottom: The actual robot.

As we can see from Figure 3, initially, localization accuracy was poor, with the robot consistently underestimating the direction of the source. As time went on, however, we saw a gradual improvement and the average error is less 5° after 40 training epochs (where every training sample has been applied once in each epoch).

Localization performance before and after training: after 40 training epochs, the estimates are much closer to the target.

Though, in this simple experiment, we used a flashing light to allow the robot to identify the source visually, this is unrealistic for actual application. A more plausible method would be to bind audio and visual sources based on their onset time. This is based on the assumption that events that occur at the same time are likely to be at the same place. We tested the effectiveness of this method in a second experiment, where the system is asked to visually locate the audio-visual source (which emits light and sound in-sync) with lots of flashing lights in the background. For each visual object detected, its onset is captured and compared with the onset of the audio stimulus, and the visual object with the best match is selected (see Figure 4). Despite the number of background activities, the correct object was selected 75% of the time: an encouraging result given the simplicity of our method.

The outputs of the onset detector for each visual object detected (numbered 1 to 7), relative to the onset detected from the audio signals (dashed line). In this example, object 3 shows the best match and will be selected as the audio-visual source.

More information about this bio-inspired audio-visual source localization system can be found elsewhere.8 To the best of our knowledge, this is the first example of neuromorphic multi-modal sensor fusion. However, this is only a starting point and many challenges remain. Future developments should include dealing with multiple sound sources, more robust methods of binding audio and visual objects, and improving the speed of learning. All of these are important when operating in a real-world environment where many sources can emit sound simultaneously and training data may not be explicitly provided. Such improvements may ultimately enable researchers to develop a robot that can, as evident in biology, learn sound localization simply by listening and watching its own surroundings.


Vincent Chan
School of Electrical and Information Engineering, University of Sydney

Vincent recently finished his PhD at the University of Sydney. His research interest includes neuromorphic engineering, vision sensors, sound localization, and analog integrated-circuit design.

  1. J. Lazzaro and C. A. Mead, A silicon model of auditory localization, Neural Computation 1 (1), pp. 47-57, 1989.

  2. N. A Bhadkamkar and B. Fowler, Sound localization system based on biological analogy, Proc. IEEE Int'l Conf. on Neural Networks 3, pp. 1902-1907, 1993.

  3. E. I. Knudsen and P. F. Knudsen, Vision calibrates sound localization in developing barn owls, J. Neuroscience 9 (9), pp. 3306-3313, 1989.

  4. A. J. King, J. W. Schnupp and I. D. Thompson, Signals from the superficial layers of the superior colliculus enable the development of the auditory space map in the deeper layers, J. Neuroscience 18 (22), pp. 9394-9408, 1998.

  5. L. A. Jeffress, A place theory of sound localization, J. Comparative and Physiological Psychology 41, pp. 35-39, 1948.

  6. G. Indiveri and T. Delbruck, Current-mode circuits Series Analog VLSI: Circuits and Principles , pp. 145-175, MIT Press, Cambridge, MA; London, 2002.

  7. J. A. Anderson, Gradient Descent Algorithms Series An Introduction to Neural Networks , pp. 239-279, MIT Press, Cambridge, MA; London, 1995.

  8. V. Y. Chan, Audio-Visual Sensor Fusion for Robotic Source Localisation, 2008. PhD thesis, School of Electrical and Information Engineering, University of Sydney, Australia

DOI:  10.2417/1200906.1640


Tell us what to cover!

If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. E-mail the editor and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and e-mail.