A statistical language model assigns a probability to a sequence of m words
by means of a probability distribution.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.
In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md).
Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models.
Contents |
N-gram models
In an n-gram model, the probability
of observing the sentence w1,...,wm is approximated as

Here, it is assumed that the probability of observing the ith word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words (nth order Markov property).
The conditional probability can be calculated from n-gram frequency counts: 
The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively.
Example
In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as

whereas in a trigram (n=3) language model, the approximation is

Note, that the context of the first n-1 ngrams is filled start-of-sentence markers, typically denoted <s>.
See also
References
- J M Ponte and W B Croft (1998). "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval. pp. 275–281. http://citeseer.ist.psu.edu/ponte98language.html.
- F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280. http://citeseer.ist.psu.edu/song99general.html.
| This artificial intelligence-related article is a stub. You can help Wikipedia by expanding it. |
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)




