Speech is a succession of voiced sounds that originate from the vocal cords, interspersed with consonant sounds such as 's' which originates from the hissing of air between teeth and 't' which is produced by an explosive release of air pressure by the tongue. Speech waveforms are often very complicated, and tend to be roughly periodic during short (e.g. 10-millisecond) periods.
Automatic speech recognition is usually a multistage process, in which the first stage is intended to yield a representation of speech that is simpler and less repetitive than the original acoustic waveform. The first stage typically performs various measurements on each successive 10-millisecond portion of the acoustic waveform. For example, in the Fourier spectrum for each such portion, the energy in various frequency bands spanning 200–6,000 hertz may be measured. Alternatively, linear predictive coding coefficients may be taken as measurements. Zero-crossing rates, glottal frequency, and total energy are further important examples of measurements.
For the simplest kind of automatic speech recognition systems, speakers are required to leave silences before and after isolated words. These isolated-word recognition systems usually work by matching the sequence of measurements, obtained from a spoken word against various sequences of measurements stored in memory. These stored sequences of measurements are known as speech templates, and there is at least one such template for each different word that the machine can recognize. A spoken word is recognized as being the same as the template word that it matches best. Preferably the template words are obtained from the same speaker whose speech is to be recognized. A template can be obtained by having this speaker pronounce the word several times, and in some way averaging the resulting sequences of measurements to make a template. The process of obtaining a stored template from several utterances of one word can be regarded as a learning process.
A spoken word is generally a sequence of phonemes, and, for example, the third phoneme of a spoken word should be matched against the third phoneme of that word's stored template. The total duration of a spoken word may be different each time the word is uttered, and elongation or compression of the time scale may impair the alignment of phonemic data. For short words this problem can be mitigated by measuring the duration of a spoken word and then elongating or compressing the time scale to standardize the duration of the word before matching it with templates that have been similarly standardized. Speech recognition machines working on these principles became commercially available in the 1970s, and could learn to recognize about 32 words spoken quite carefully by a single speaker.
More sophisticated template-matching technology allows elongation or compression of a word's time scale to vary while the word is being spoken. Time-scale variations can be accommodated by dynamic time warping, which became popular in the 1980s because of the decreasing cost of computation, and because it yields more accurate recognition of larger vocabularies than can be recognized by less sophisticated matching techniques. Dynamic time-warping techniques have been developed to recognize whole sentences composed of words not separated by silences. Templates for whole sentences are composed of single-word templates, and the time-warping technique that time-aligns phoneme-like parts of a single word has been developed to time-align whole-word parts of a single sentence. After matching, it is easy to find which part of the spoken sentence corresponds to which part of a template sentence, and thus find when each spoken word begins and ends.
Except for specialized applications, it is not practical to store or synthesize sentence templates, and radically different techniques are required for automating the work of a typist who types unrestricted text that is dictated without silences between successive words. Instead of attempting to recognize whole words directly by template matching, it is usual to attempt to classify successive subword portions, known as segments. A segment may, for instance, be a portion during which the results of measurements on the acoustic waveform do not change by more than a threshold amount. Alternatively, successive 10-millisecond portions of the utterance may be regarded as segments. Segments may be classified, sometimes erroneously, by means of classical pattern recognition techniques, which yield one or more plausible labels for each segment. A speech recognition machine contains a dictionary which, for each recognizable word, stores one or more than one combination of segment labels for segments of that word. This stored lexical knowledge is generally not sufficient to cope with erroneous subdivision of speech into segments and erroneous classification of these segments. To bring further order out of this chaos it is usual to employ knowledge of syntax and semantics in addition to lexical knowledge.
(Published 1987)
— Julian R. Ullman
- Bibliography
- Fallside, F., and Woods, W. A. (1985). Computer Speech Processing.
- Lea, W. A. (1980). Trends in Speech Recognition.
- Witten, I. (1982). Principles of Computer Speech.




