|Welcome to The Neuromorphic Engineer|
Methods » Learning
Audio-visual sensor fusion for object localization
PDF version | Permalink
Sound localization is the ability to identify the direction from which sound comea and is a key to survival in the animal world. It is used by many predators to hunt effectively, sometimes in complete darkness. While sound localization has received considerably less attention than vision in robotics, it is expected to become increasingly important as machines are required to operate in the real world and respond to both visual and audio stimuli.
In this article, we describe a novel two-microphone neuromorphic sound-localization system based on interaural time difference (ITD), the difference in sound arrival time because of the spatial separation of the two ears. The system is biologically realistic as it employs a pair of silicon cochleae and all subsequent processing are inspired by biology. Unlike previous implementations involving cochleae,1,2 our system is adaptive and does not require any prior knowledge of the ITD models. More importantly, it can learn sound localization based on self motion and visual feedback, just as biological systems do.3,4
The architecture of the localization system is shown in Figure 1. Sound is separated into different frequency bands and then converted into electrical signals at the cochleae. The signals then go through a network of coincidence detectors and delay elements that perform cross correlation as suggested in Jeffress's model.5 The cross-correlation results are sharpened by a soft winner-take-all (WTA) network6 before matrix multiplication is applied. This transforms the WTA result into a function that represents the probability of the source coming from each discrete direction. Lastly, the results from the different frequency bands are combined to produce a global estimate. The matrix multiplication is essentially a single-layer neural network, and by updating the weights in this network, the system can be trained to localize sound.
We tested our sound-localization architecture in an audio-visual source localization experiment, the setup for which is shown in Figure 2. For each trial, the robot was moved in a random direction and a stimulus played from the loudspeaker. The direction of the source was computed using the algorithm described and the robot then moved to face the estimated direction of the source. A flashing light attached to the loudspeaker, which could be easily picked up by the vision sensor attached to the front of the robot, allowed it to recognize if there was an error and adjusted the weights according to a simple update rule based on gradient decent.7
As we can see from Figure 3, initially, localization accuracy was poor, with the robot consistently underestimating the direction of the source. As time went on, however, we saw a gradual improvement and the average error is less 5° after 40 training epochs (where every training sample has been applied once in each epoch).
Though, in this simple experiment, we used a flashing light to allow the robot to identify the source visually, this is unrealistic for actual application. A more plausible method would be to bind audio and visual sources based on their onset time. This is based on the assumption that events that occur at the same time are likely to be at the same place. We tested the effectiveness of this method in a second experiment, where the system is asked to visually locate the audio-visual source (which emits light and sound in-sync) with lots of flashing lights in the background. For each visual object detected, its onset is captured and compared with the onset of the audio stimulus, and the visual object with the best match is selected (see Figure 4). Despite the number of background activities, the correct object was selected 75% of the time: an encouraging result given the simplicity of our method.
More information about this bio-inspired audio-visual source localization system can be found elsewhere.8 To the best of our knowledge, this is the first example of neuromorphic multi-modal sensor fusion. However, this is only a starting point and many challenges remain. Future developments should include dealing with multiple sound sources, more robust methods of binding audio and visual objects, and improving the speed of learning. All of these are important when operating in a real-world environment where many sources can emit sound simultaneously and training data may not be explicitly provided. Such improvements may ultimately enable researchers to develop a robot that can, as evident in biology, learn sound localization simply by listening and watching its own surroundings.
Tell us what to cover!
If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. E-mail the editor and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and e-mail.