A set of audible sounds produced by disturbing the air through he integrated movements of certain groups of anatomical structures. Humans attach symbolic values to these sounds for communication. There are many approaches to the study of speech.
Speech production
The physiology of speech production may be described in terms of respiration, phonation, and articulation. These interacting processes are activated, coordinated, and monitored by acoustical and kinesthetic feedback through he nervous system.
Most of the speech sounds of the major languages of the world are formed during exhalation. Consequently, during speech the period of exhalation is generally much longer than that of inhalation. The aerodynamics of the breath stream influence the rate and mode of the vibration of the vocal folds. This involves interactions between the pressures initiated by thoracic movements and the position and tension of the vocal folds. See also Respiration.
The phonatory and articulatory mechanisms of speech may be regarded as an acoustical system whose properties are comparable to those of a tube of varying cross-sectional dimensions. At the lower end of the tube, or the vocal tract, is the larynx. It is situated directly above the trachea and is composed of a group of cartilages, tissues, and muscles. The upper end of the vocal tract may terminate at the lips, at the nose, or both. The length of the vocal tract averages 6.5 in. (16 cm) in men and may be increased by either pursing the lips or lowering the larynx.
The larynx is the primary mechanism for phonation, that is, the generation of the glottal tone. The vocal folds consist of connective tissue and muscular fibers which attach anteriorly to the thyroid cartilage and posteriorly to the vocal processes of the arytenoid cartilages. The vibrating edge of the vocal folds measures about 0.92– 1.08 in. (23–27 mm) in men and considerably less in women. The aperture between the vocal folds is known as the glottis. The tension and position of the vocal folds are adjusted by the intrinsic laryngeal muscles, primarily through movement of the two arytenoid cartilages. See also Larynx.
When the vocal folds are brought together and there is a balanced air pressure to drive them, they vibrate laterally in opposite directions. During phonation, the vocal folds do not transmit the major portion of the energy to the air. They control the energy by regulating the frequency and amount of air passing through he glottis. Their rate and mode of opening and closing are dependent upon the position and tension of the folds and the pressure and velocity of airflow. The tones are produced by the recurrent puffs of air passing through he glottis and striking into the supralaryngeal cavities.
Speech sounds produced during phonation are called voiced. Almost all of the vowel sounds of the major languages and some of the consonants are voiced. In English, voiced consonants may be illustrated by the initial and final sounds in the following words: “bathe,” “dog,” “man,” “jail.” The speech sounds produced when the vocal folds are apart and are not vibrating are called unvoiced; examples are the consonants in the words “hat,” “cap,” “sash,” “faith.” During whispering all the sounds are unvoiced.
The rate of vibration of the vocal folds is the fundamental frequency of the voice (F0). It correlates well with the perception of pitch. The frequency increases when the vocal folds are made taut. Relative differences in the fundamental frequency of the voice are utilized in all languages to signal some aspects of linguistic information.
Many languages of the world are known as tone languages, because they use the fundamental frequency of the voice to distinguish between words. Chinese is a classic example of a tone language. There are four distinct tones in Chinese speech. Said with a falling fundamental frequency of the voice, ma means “to scold.” Said with a rising fundamental frequency, it means “hemp.” With a level fundamental frequency it means “mother,” and with a dipping fundamental frequency it means “horse.” In Chinese, changing a tone has the same kind of effect on the meaning of a word as changing a vowel or consonant in a language such as English.
The activity of the structures above and including the larynx in forming speech sound is known as articulation. It involves some muscles of the pharynx, palate, tongue, and face and of mastication.
The primary types of speech sounds of the major languages may be classified as vowels, nasals, plosives, and fricatives. They may be described in terms of degree and place of constriction along the vocal tract. See also Phonetics.
The only source of excitation for vowels is at the glottis. During vowel production the vocal tract is relatively open and the air flows over the center of the tongue, causing a minimum of turbulence. The phonetic value of the vowel is determined by the resonances of the vocal tract, which are in turn determined by the shape and position of the tongue and lips.
The nasal cavities can be coupled onto the resonance system of the vocal tract by lowering the velum and permitting airflow through he nose. Vowels produced with the addition of nasal resonances are known as nasalized vowels. Nasalization may be used to distinguish meanings of words made up of otherwise identical sounds, such as bas and banc in French. If the oral passage is completely constricted and air flows only through he nose, the resulting sounds are nasal consonants. The three nasal consonants in “meaning” are formed with the constriction successively at the lips, the hard palate, and the soft palate.
Plosives are characterized by the complete interception of airflow at one or more places along the vocal tract. The places of constriction and the manner of the release are the primary determinants of the phonetic properties of the plosives. The words “par,” “bar,” “tar,” and “car” begin with plosives. When the interception is brief and the constriction is not necessarily complete, the sound is classified as a flap. By tensing the articulatory mechanism in proper relation to the airflow, it is possible to set the mechanism into vibrations which quasiperiodically intercept the airflow. These sounds are called trills.
These are produced by a partial constriction along the vocal tract which results in turbulence. Their properties are determined by the place or places of constriction and the shape of the modifying cavities. The fricatives in English may be illustrated by the initial and final consonants in the words “vase,” “this,” “faith,” “hash.”
The ability to produce meaningful speech is dependent in part upon the association areas of the brain. It is through them that the stimuli which enter the brain are interrelated. These areas are connected to motor areas of the brain which send fibers to the motor nuclei of the cranial nerves and hence to the muscles. Three neural pathways are directly concerned with speech production, the pyramidal tract, the extrapyramidal, and the cerebellar motor paths. It is the combined control of these pathways upon nerves arising in the medulla and ending in the muscles of the tongue, lips, and larynx which permits the production of speech. See also Nervous system (vertebrate).
Six of the 12 cranial nerves send motor fibers to the muscles that are involved in the production of speech. These nerves are the trigeminal, facial, glossopharyngeal, vagus, spinal accessory, and the hypoglossal. See also Psychoacoustics; Psycholinguistics.
Development
In the early stages of speech development the child's vocalizations are quite random. The control and voluntary production of speech are dependent upon physical maturation and learning.
It is possible to describe the development of speech in five stages. In the first stage the child makes cries in response to stimuli. These responses are not voluntary but are part of the total bodily expression. The second stage begins between the sixth and seventh week. The child is now aware of the sounds he or she is making and appears to enjoy this activity. During the third stage the child begins to repeat sounds heard coming from himself or herself. This is the first time that the child begins to link speech production to hearing. During the ninth or tenth month the child enters the fourth stage and begins to imitate without comprehension the sounds that others make. The last stage begins between the twelfth and eighteenth month, with the child intentionally employing conventional sound patterns in a meaningful way. The exact time at which each stage may occur varies greatly from child to child.
Speech technology
Speech technology has been developing within three areas. One has to do with identifying a speaker by analyzing a speech sample. Since the idea is analogous to that of identifying an individual by fingerprint analysis, the technique has been called voice print. However, fingerprints have two important advantages over voice prints: (1) they are based on extensive data that have accumulated over several decades of use internationally, whereas no comparable reference exists for voice prints; and (2) it is much easier to alter the characteristics of speech than of fingerprints. Consequently, this area has remained largely dormant. Most courts in the United States, for instance, do not admit voice prints as legal evidence.
The two other areas of speech technology, synthesis and recognition, have seen explosive growth. In many applications where a limited repertoire of speech is required, computer-synthesized speech is used instead of human speakers. A common technology currently used in speech synthesis involves an inventory of pitch-synchronized, prestored human speech. These prestored patterns are selected according to the particular requirements of the application and recombined with some overlap into the desired sentence by computer, almost in real time. The quality of synthesized speech for English is remarkably good, though it is limited at present to neutral, emotionless speech. Many other languages are being synthesized with varying degrees of success.
The recognition of speech by computer is much more difficult than synthesis. Instead of just reproducing the acoustic wave, the computer must understand something of the semantic message that the speech wave contains, in order to recognize pieces of the wave as words in the language. Humans do this easily because they have a great deal of background knowledge about the world, because they are helped by contextual clues not in the speech wave, and because they are extensively trained in the use of speech. Nonetheless, given various constraints, some of the existing systems do remarkably well. These constraints include (1) stable acoustic conditions in which speech is produced, (2) a speaker trained by the system, (3) limited inventory of utterances, and (4) short utterances. The research here is strongly driven by the marketplace, since all sorts of applications can be imagined where spoken commands are required or highly useful. See also Speech disorders.