|Welcome to The Neuromorphic Engineer|
Applications » Language
Dealing with unexpected words
PDF version | Permalink
Insperata accidunt magis saepe quam quae speres, i.e. things you do not expect happen more often than things you do expect, warns Plautus (circa 200 BC). Most readers would agree with Plautus that surprising sensory input data could be important since they could represent a new danger or new opportunity. A hypothesized cognitive process involved in the processing of such inputs is illustrated in Figure 1.
In machine recognition, low-probability items are unlikely to be recognized. For example, in automatic speech recognition (ASR), the linguistic message in speech data X is coded in a sequence of speech sounds (phonemes) Q. Substrings of phonemes represent words, sequences of words form phrases. A typical ASR attempts to find the linguistic message in the phrase. This process relies heavily on prior knowledge in text-derived language model and pronunciation lexicon. Unexpected lexical items (words) in the phrase are typically replaced by acoustically acceptable in-vocabulary items.1
Our laboratory is working on identification and description of low-probability words as a part of the large multinational DI-RAC project (Detection and Identification of Rare Audio-Visual Cues), recently awarded by the European Commission. Principles of our approach are briefly described here.
To emulate the cognitive process shown in Figure 1, contemporary ASR could provide the predictive information stream. Next we need to estimate similar information without the heavy use of prior knowledge. For the estimation of context-constrained and context-unconstrained phoneme posterior probabilities, we have used a continuous digit recognizer based on a hybrid Hidden-Markov-Model Neural-Network (HMM-NN) technique,1 shown schematically in Figure 2. First, the context-unconstrained phoneme probabilities are estimated. These are subsequently used in the search for the most likely stochastic model of the input utterance. A by-product of this search is a number of context-constrained phoneme probabilities.2
The basic principles of deriving the context-unconstrained posterior probabilities of phonemes are illustrated in Figures 3 and 4. A feed-forward artificial neural network is trained on phoneme-labelled speech data and estimates unconstrained posterior probability density function pi(Q| X).3 This uses as an input a segment xi of the data X that carries the local information about the identity of the underlying phoneme at the instant i. This segment is projected on 448 time-spectral basis. As seen in the middle part of Figure 5, the estimate from the NN can be different from the estimate from the context-constrained stream since it is not dependent on the constraints L.
The context-unconstrained phoneme probabilities can be used in a search for the most likely Hidden Markov Model (HMM) sequence that could have produced the given speech phrase. As a side product, the HMM can also yield, for any given instant i of the message, its estimates of posterior probabilities of the hypothesized phonemes pi(Q|X,L) ‘corrected’ by a set of constraints L implied by the training-speech data, model architecture, pronunciation lexicon, and the applied language model.4 When it encounters an unknown item in the phoneme string (e.g. the word ‘three’ in Figure 5), it assumes it is one of the well known items. Note that these ‘in context’ posterior probabilities, even when wrong, are estimated with high confidence.
An example of a typical result4 is shown in Figure 5. As seen in the lower part of the figure, an inconsistency between these two information streams could indicate an unexpected out-of-vocabulary word.
Being able to identify which words are not in the lexicon of the recognizer, and being able to provide an estimate of their pronunciation, may allow for inclusion of these new words in the pronunciation dictionary, thus leading to an ASR system that would be able to improve its performance as it is used over time, i.e. that is able to learn. However, the inconsistency between in-context and out-of-context probability streams need not indicate the presence of unexpected lexical item but could indicate other inadequacies of the model. Further, this inconsistency might also indicate corrupted input data if the in-context probability estimation using the prior L yields more reliable estimate than the unconstrained out-of-context stream. Thus, providing a measure of confidence in the estimates from both streams would be desirable when corrupted input is a possibility.
Tell us what to cover!
If you'd like to write an article or know of someone else who is doing relevant and interesting stuff, let us know. E-mail the editor and suggest the subject for the article and, if you're suggesting someone else's work, tell us their name, affiliation, and e-mail.