Applications: Language » Can spike-based speech recognition systems outperform conventional approaches?

Welcome to The Neuromorphic Engineer

Applications » Language

Can spike-based speech recognition systems outperform conventional approaches?

PDF version | Permalink

Ismail Uysal, Harsha Sathyendra, and John G. Harris

1 March 2007

Spiking systems prove robust to noise in a simplified experimental domain.

The field of automatic speech recognition (ASR) has advanced far enough in the past decade to produce numerous commercial applications such as the speech-driven telephone customer service menus now deployed by many companies. Unfortunately, these and other state-of-the-art ASR systems still pale in comparison to human performance, particularly in the presence of noise. Researchers have long been aware of this discrepancy in performance and have often turned to biology seeking clues to the robustness of the human auditory system. As a matter of fact, the most commonly employed features for ASR applications are still the Mel frequency cepstral coefficients (MFCC), which mimic the logarithmic distribution of channels throughout the frequency of hearing as observed in the cochlea.

Nonetheless, today's ASR systems are designed with a window-based mindset using Hidden Markov Models (HMMs) and have little resemblance to neurobiological computation. As is well known, neurons in the brain use all-or-nothing action potentials to communicate timing information. These spike trains code sensory inputs and all levels of processing throughout the brain. Rather than being artifacts of biology, we believe that spike trains provide a key to the wonderful noise robustness of the auditory system and can be exploited in man-made machine recognition systems.

Recently, we proposed a spike-based classification scheme for simple acoustic signals that exploits the phase synchrony between the parallel streams of spike trains produced by the cochlea followed by a time-to-first-spike rank-order decoder for classification.¹ A more recent version of our system replaces the rank-order decoder with a spiking neural network for improved classification. Comparisons with a typical ASR engine show improved performance under the presence of noise. According to the results, spike firing times reveal a phase synchrony among tonotopically distributed auditory nerve fibers, which varies with the spectral properties of the input signal. Other researchers have proposed spike-based ASR systems but none have taken advantage of phase-synchrony coding. We found out that the degree of such synchrony (DoS) constitutes a highly noise robust feature set for classification purposes by having little variation in response to changing noise levels.

Spike-based classification architecture

The proposed system is composed of three main blocks: speech-to-spike conversion, feature extraction via phase-synchrony coding, and classification via liquid state machine (LSM). For speech-to-spike conversion, we use an up-to-date cochlear simulation employing an improved inner-hair-cell model with auditory nonlinearities such as adaptation and temporal dynamics.² Human empirical data is used for various cochlear parameters, such as the distribution of the channels throughout the frequency of hearing.

For phase-synchrony coding, one has to look at the inter-spike time interval (ISI) histogram for each channel, which is defined as the total number of spikes falling within specified bins of time intervals. Our definition of the DoS for a particular channel is the magnitude of the first non-zero peak in the spectrum of its ISI histogram. As shown in Figure 1, even with a very noisy vowel input signal, the fibers with characteristic frequencies (437Hz) close to the first peak (426Hz) in the vowel's log-magnitude spectral plot are still able to phase lock very close to that particular frequency. They also have a higher DoS than other channels, such as the one shown in the bottom plot with a characteristic frequency (519Hz) further from the first formant peak.

Figure 1.

Log-magnitude spectral envelope for /uh/ and the corresponding degree of phase synchrony for two sets of hair cells centered at 437Hz and 519Hz (computed for a noisy utterance with 5dB SNR).

Finally, for classification, the system employs an LSM with a randomly connected recurrent neural circuit.³ The idea is to map the input vector to a higher dimension where the distance metric between prospective classes is larger. For our system, the input vector—which is comprised of the degrees of synchrony for each channel—is passed on to the neural circuit as the membrane potentials of input neurons that make dynamic spiking synapses with the circuit using spike-timing-dependent plasticity. The state of the circuit is low-pass filtered and sampled to be associated with a target class (different types of vowels) by the help of a trainable readout function. Figure 2 shows the overall system design, as well as some of the important system parameters.

Figure 2.

The overall spike-based classification. The degree of synchrony is extracted from spike trains generated in each individual cochlear channel. This feature set is then used with an LSM with supervised learning for classification.

Results and discussion

We tested the algorithm on a noisy, multi-speaker, multi-gender vowel dataset. We compared the algorithm to a typical speech recognition engine employing the well-known MFCCs and an HMM. The percentage correct results are shown in Table 1.

Table 1.

At high signal-to-noise ratio (SNR) values, both systems perform comparably well, but the proposed system using phase-synchrony coding is able to outperform the MFCC-HMM algorithm by 12% at 5dB SNR. In regards to the question raised in the title, though applied to a simplified domain, spike-based recognition is clearly more noise robust when compared to a conventional ASR system. This performance is mainly due to the phase synchrony maintaining capabilities of tonotopic neuron populations even under the presence of large amounts of noise.

Future work involves extrapolation of these findings to more complex signals and multi-syllable words by the help of relational networks as observed in the cortex.

Authors

Ismail Uysal
Computational NeuroEngineering Laboratory, University of Florida

Harsha Sathyendra
Computational NeuroEngineering Laboratory, University of Florida

John G. Harris
Computational NeuroEngineering Laboratory, University of Florida

References

I. Uysal, H. Sathyendra and J. G. Harris, A biologically plausible system approach for noise robust vowel recognition, IEEE Proc. of MWSCAS CD-ROM, 2006.
C. J. Sumner, E. A. Lopez-Poveda, L. P. O'Mard and R. Meddis, Adaptation in a revised inner-hair cell model, J. Acoust. Soc. Am. 113 (2), pp. 893-901, 2003.
W. Maass, T. Natschlager and H. Markram, Real-time computing without stable states: A new framework for neural computation based on perturbations, Neural Computation 14 (11), pp. 2, 2002.

DOI: 10.2417/1200703.0048

Tell us what to cover!

If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. E-mail the and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and e-mail.