| This article may contain original research or unverified claims. Please improve the article by adding references. See the talk page for details. (March 2009) |
In probability theory, a categorical distribution is a discrete probability distribution distribution whose sample space is the set of 1-of-N encoded[nb 1][1] random vectors x of dimension n having the property that
, i.e. exactly one element is 1 and the others are zero. The probability mass function f is:
where pi represents the probability of seeing element i and
. This is the formulation adopted by Bishop [1][nb 2].
The distribution can be transformed into a special case of the multinomial distribution in which the parameter n of the multinomial distribution is fixed at 1.
Contents |
Properties
- The distribution is completely given by the probabilities associated with each number k: pk = P(X = xk), k = 1,...,n, where
. The possible probabilities are exactly the standard (n − 1)-dimensional simplex; for n = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1-simplex,
.
- The distribution is a special case of a "multivariate Bernoulli distribution"[2] in which exactly one of the n 0-1 variables takes the value one.
- Let
be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:
-
- where I is the indicator function. Then Y has a distribution which is a special case of the multinomial distribution with parameter n = 1. The sum of n independent and identically distributed such random variables Y constructed from a categorical distribution with parameter
is multinomially distributed with parameters n and 
- The sufficient statistic from n independent observations is the set of counts (or, equivalently, proportion) of observations in each category, where the total number of trials (=n) is fixed.
- The conjugate prior is the Dirichlet distribution.
- The indicator function of an observation, xk, is Bernoulli distributed with parameter pk.
Chi-square statistic
Definition
The goodness of fit of a categorical data set of size k with a theoretical model categorical distribution with n categories with probabilities pi can be computed by the following chi-square statistic:
where
- X2 = the test statistic; this asymptotically approaches a χ2 distribution, hence is denoted by X, not χ.
- Oi = the observed frequency of category i;
= the model frequency of the theoretical distribution.
The sampling distribution of this statistic is asymptotically a chi-square distribution, hence it is called a "chi-square test".
Descriptive statistics
By comparing the value of a statistic with the chi-square distribution, one may, in descriptive statistics terms, model how well the model fits the data:
- if X2 is small (one would rarely get such a small value from a χ2 distribution), then the frequencies are closer to the theoretical values than would be expected by random trials;
- if X2 is large (one would rarely get such a large value from a χ2 distribution), then the frequencies are further from the theoretical values than would be expected by random trials.
For example, if one were modeling a coin flip by a fair coin, if the coin alternated head/tail/head/tail, then X2 would be zero (after an even number of trials), which is unusually low – the data are more uniform than one would predict.
Conversely, if the coin always came up heads, then X2 would be unusually large – the data are more concentrated than the model would predict.
Frequentist statistics
In frequentist statistics terms, this statistic is widely used for statistical hypothesis testing, in the Pearson's chi-square test – one tests the null hypothesis that the data are samples from a given categorical distribution, generally a uniform distribution, by computing the p-value of a given value of the X2 and then rejecting or not the null hypothesis at a given level of statistical significance.
Bayesian statistics
In Bayesian statistics, one may interpret value of the chi-square statistic, given different models, as measuring the likelihood of these models being correct. However in practice one instead uses the Dirichlet distribution as conjugate prior: one begins with a given distribution on the space of possible pi (often uniform), and then updates it based on observations. That is, if the frequency of each outcome is Ei and one begins with a uniform prior, then the posterior distribution is the function Dir(E1,...,En) – the maximum likelihood estimate for the frequencies is the observed frequencies, and by integrating the function one may give credible regions for distributions.
See also
Related distributions
Notes
References
- ^ a b Bishop, C. 2006. Pattern Recognition and Machine Learning. Springer.
- ^ Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions, Wiley. ISBN 0-471-12844-9 (p.105)
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)


![\mathbb{E} \left[ \mathbf{x} \right] = \boldsymbol{p}](http://wpcontent.answers.com/math/a/6/1/a611709c883c977127d0ffec8f55b55f.png)





