The generation of synthetic speech signals in order to convey information to listeners, usually based upon a verbal or textual request by the users. This speech synthesis typically employs a computer program and requires access to storage of portions of speech previously spoken by humans. The naturalness of the synthetic voice depends on several factors, including the vocabulary of words to pronounce, the amount of stored speech, and the complexity of the synthesis programs. The most basic voice response simply plays back appropriate short verbal responses, which are only copies of human speech signals stored using digital sampling technology. The most universal systems are capable of transforming any given text into comprehensible speech for a given language. These latter systems so far exist for only 20 or so of the world's major languages, and are flawed in producing speech that, while usually intelligible, sounds unnatural.
Voice response is also known as text-to-speech synthesis (TTS) because the task usually has as input a textual message (to be spoken by the machine). The text could be in tabular form (for example, reading aloud a set of numbers), or, more typically, formatted as normal sentences. Speech synthesizers are much more flexible and universal than their speech-recognition counterparts, for which human talkers must significantly constrain their verbal input to the machines in order to achieve accurate recognition. In TTS, a computer database usually determines the text to be synthetically spoken, following an automatic analysis of each user request. The user may pose the request in response to a menu of inquiries (for example, by an automated telephone dialogue, by pushing a sequence of handset keys, or by a series of brief verbal responses). Thus, the term “voice response” is used to describe the synthetic speech as an output to a user inquiry. The value of such a synthetic voice is the capability of efficiently receiving information from a computer without needing a computer screen or printer. Given the prevalence of telephones, as well as the difficulty of reading small computer screens on many portable computer devices, voice response is a convenient way to get data. See also Speech recognition.
The simplest approach to voice response is to digitally sample natural speech and output the samples later as needed. A common Nyquist sampling rate is 10,000 samples per second, which preserves sound frequencies up to almost 5 kHz, allowing quite natural speech. High-frequency energy in fricative sounds is severely attenuated (but less so than on telephone lines), but this usually has little impact on intelligibility. Straightforward sampling requires 12 bits per sample, which requires memory at 120 kbits/s. Such high data rates are prohibitive except for applications with very small vocabularies. Even in cases with more limited bandwidth (for example, 8000 samples per second in telephone applications) and more advanced coding schemes, the straightforward playback approach is unacceptable for general TTS. Despite rapidly decreasing costs for computer memory, it will remain impossible to store all the necessary speech signals except for applications with very restricted vocabulary needs. See also Compact disk; Data compression; Information theory; Pulse modulation.
A voice response system which minimizes memory needs generates synthetic speech from sequences of brief basic sounds and has great flexibility. Since most languages have only 30–40 phonemes (distinct linguistic sounds), storing units of such size and number is trivial. However, the spectral features of these short concatenated sounds (lasting 50–200 ms) must be adjusted at their frequent boundaries to avoid severely discontinuous speech. Normal pronunciation of each phoneme in an utterance depends heavily on its phonetic context (for example, on neighboring phonemes, intonation, and speaking rate). The adjustment process and the need to calculate an appropriate intonation for each context lead to complicated synthesizers with correspondingly less natural output speech.
Current synthesizers usually compromise between the extremes of minimizing storage and complexity. One approach is to store thousands of speech units of varying size, which can be automatically extracted from natural speech. In contrast to automatic speech recognition, where segmentation of speech into pertinent units is very difficult, TTS training exploits prior knowledge of the text (the training speaker reads a furnished text).
Synthesizers that accept general text as input need a linguistic processor to convert the text into phonetic symbols in order to access the appropriate stored speech units. One task is to convert letters into phonemes. This may be as simple as a table look-up: a computer dictionary with an entry for each word in the chosen language, noting its pronunciation (including syllable stress), syntactic category, and possibly some semantic information. Many systems also have language-dependent rules, which examine the context of each letter in a word to determine how it is pronounced; for example, the letter [p] in English is pronounced /p/, except before the letter [h] (for example, in “telephone”; however, it has normal pronunciation in “cupholder”). English needs hundreds of such rules. TTS often employs these rules as a backup procedure to handle new words, foreign words, and typographical mistakes (that is, cases not in the dictionary). See also Phonetics.
The problem of determining an appropriate intonation for each input text continues to confound TTS. In simple voice response, the stored units are large (for example, phrases), and pitch and intensity are usually stored explicitly with the spectral parameters or implicitly in the signals of waveform synthesizers. However, when smaller units are concatenated, the synthetic speech sounds unnatural unless the intonation is adjusted for context. Intonation varies significantly among languages. Although automatic statistical methods show some promise, intonation analysis has mostly been manual.
Simple voice-response systems work equally well for all languages since they just play back previously stored speech units. For general TTS, however, major synthesizer components are highly language-dependent. The front end of TTS systems, dealing with letter-to-phoneme rules, the relationship between text and intonation, and different sets of phonemes, is language-dependent. The back end, representing simulation of the vocal track via digital filters, is relatively invariant across languages. Even languages with sounds (for example, clicks) other than the usual pulmonic egressives require only simple modifications.
Commercial synthesizers are widely available for about 10 languages. They often combine software, memory, and processing chips, and range from expensive systems providing close-to-natural speech to inexpensive personal computer programs. General digital signal processing chips are widely used for TTS. Current microprocessors can easily handle the speeds for synthesis, and indeed synthesizers exist entirely in software. Memory requirements can still be a concern, especially for some of the newer waveform concatenation systems. See also Microprocessor; Speech.