Share on Facebook Share on Twitter Email
Answers.com

Ordinary least squares

 
Wikipedia: Ordinary least squares
Okun’s law in macroeconomics states that in an economy the GDP growth depends linearly on the changes in the unemployment rate. Here the ordinary least squares method is used to construct the regression line describing this law.

In statistics and econometrics, ordinary least squares (OLS) is a technique for estimating the unknown parameters in a linear regression model. This method minimizes the sum of squared distances between the observed responses in a set of data, and the fitted responses from the regression model. The linear least squares computational technique provides simple expressions for the estimated parameters in an OLS analysis, and hence for associated statistical values such as the standard errors of the parameters. OLS can mathematically be shown to be an optimal estimator in certain situations, and is closely related to the generalized least squares (GLS) estimation approach that is optimal in a broader set of situations. OLS can be derived as a maximum likelihood estimator under the assumption that the data are normally distributed, however the method has good statistical properties for a much broader class of distributions.

Contents

Linear model

Suppose we observe a collection of data \{y_i,x_i\}_{i=1}^n on n statistical units, in which the data for each unit includes a scalar response y and a vector of predictors x. In a linear regression model, the conditional mean of the response given the predictors is modeled as a linear function of the predictors

E(y|x) = \beta^\prime x,

where β′x is the dot product between the vectors β and x. A concrete statistical model that gives this form of conditional expectation involves adding errors to the conditional mean

y_i = x'_i\beta + \varepsilon_i, \,

where εi is an unobserved scalar random variable with expected value zero given x, representing the errors in the data, and β is a 1 vector of unknown parameters. Generally an "intercept" or "constant term" is included in the set of regressors, for example by setting xi1 = 1 for all n units.

It is convenient to write this model in matrix notation as

 y = X\beta + \varepsilon, \,

where y and ε are 1 vectors, and X is an n×p matrix called the design matrix.

Two interpretations of this model are possible. In one interpretation, the regressors xi are treated as random variables, sampled together with the yi's from some population, as in an observational study. This approach is more intuitive when studying asymptotic behavior of the estimators. In the other interpretation, the regressors X are treated as known constants set by a design, and Y is sampled conditionally on the values of X as in an experiment. For practical purposes however both interpretations are indistinguishable as they lead to exactly same formulas.

Assumptions

  • The response variables are uncorrelated: corr(yi, yj) = 0 whenever i ≠ j. Since regression analysis is always carried out while conditioning on the predictor variables X, there is no need for the predictor variables to be uncorrelated, or representative of a population. When OLS is applied to linear time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies, correlations between the responses will often exist. Extensions of the OLS approach, including GLS, can be used in these situations.
  • Identifiability: random variables xi have second moments, and matrix Q_{xx}=\operatorname{E}[x_ix'_i] is non-singular. This assumption is equivalent to saying that regressors are linearly independent from each other. Note that if xi's do not have second moments (that is, matrix Qxx is infinite), then regular OLS estimators will be not only still consistent, but even superefficient.
  • Errors have conditionally zero mean: E[εi|xi] = 0.
  • Homoscedasticity: errors have second moments and E[εi2|xi] = σ2. Here σ2 is an additional nuisance parameter of the model, which we will also be estimating. Without this assumption the OLS estimator for β is still consistent, but no longer efficient even within the class of linear unbiased estimators. Also if errors do not have second moments (that is if σ2 = ∞) then OLS method cannot be applied. Other robust estimation techniques should be used in such case.

Estimation

Suppose b is a “candidate” value for an estimate of parameter β. Then expression yix'ib will be called the residual of i-th observation. Sum of squared residuals of the model is a measure of how well the line x'b fits the data. Ideally we want this sum to be as small as possible:

 S(b) = \sum_{i=1}^n (y_i - x'_ib)^2 = (y-Xb)'(y-Xb)

The value of b which minimizes this expression will be called the least squares estimator for β, and it is given by formula [proof]:

 \hat\beta = (X'X)^{-1}X'y \quad = \bigg(\frac{1}{n}\sum_{i=1}^n x_ix'_i\bigg)^{\!-1} \!\!\cdot\, \frac{1}{n}\sum_{i=1}^n x_iy_i .

It is easy to see that this estimator is unbiased [proof], and also linear in dependent variable y. Gauss-Markov theorem states that, provided the errors are homoscedastic, this estimator is best (in the sense of having smallest variance) in the class of all linear unbiased estimators. This “efficiency” result must be treated with caution however: unless the errors have normal distribution, other non-linear estimators exist which outperform the OLS estimator.

After we have estimated β, the vector of least squares residuals will be equal to

 \hat\varepsilon = y - X\hat\beta = \big(I-X(X'X)^{-1}X'\big)y = My,

where I is the identity matrix, and M is the projection matrix onto the space orthogonal to X. Using these residuals we can construct the least squares estimator for σ2:

 \hat\sigma^2 = \tfrac{1}{n}\;\hat\varepsilon'\hat\varepsilon = \tfrac{1}{n}\;y'My = \tfrac{1}{n}S(\hat\beta).

Note that this estimator is biased [proof]: its expected value is different from σ2 by a factor of (n–p)/n. However this bias should not be considered as deficiency of the estimator. When number of observations n becomes large, the bias tends to zero. The reason why we use factor 1/n instead of 1/(n–p) which would have made the estimator unbiased, is that this is how the formula emerges from MLE approach under the assumption of normality of the error term. In that case the mean squared error of this estimator is better than the MSE of its bias-corrected alternative.

It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. Pearson's coefficient of determination R2 is defined as a ratio of “explained” variance to the “total” variance of the dependent variable y:

R^2 = \frac{y'LPy}{y'Ly} = 1 - \frac{y'My}{y'Ly} = 1 - \frac{\sum (y_i-x'_i\hat\beta)^2}{\sum (y_i-\overline{y})^2}

where L = In − ιι' / n, and ι is an n-vector of ones; this projection matrix is equivalent to regression on a constant, it simply subtracts mean from a random variable. Note that in order for R2 to be meaningful, regressors X must contain a constant. In such case R2 will be a number between 0 and 1, with values close to 1 indicating good degree of fit.

Alternative derivations

In the previous section the least squares estimator \hat\beta has been derived as a value that minimizes the sum of squared residuals of the model. However it is also possible to obtain the same estimator from other approaches.

  • The least-squares is an M-estimator of ρ-type for ρ(r) = ½r 2.
The geometry of the ordinary least squares estimate.
  • The ordinary least squares estimates are closely related to the geometry of the data. If we view y as a vector in an n-dimensional Euclidean space, and similarly view each predictor variable (i.e. each column of the design matrix X) as a vector in the same space, then the least squares regression is equivalent to orthogonal projection of y onto the subspace spanned by the predictor variables (or equivalently, to the column space of X). The projected vector \hat{y}=Py is called the predictor of y (here P = X(X'X) − 1X' is the projection matrix), and OLS estimator \hat{\beta} will be equal to the coefficients of vector decomposition of \hat{y} along the basis of X. Such decomposition is unique since we have assumed that the columns of X are linearly independent.
  • The OLS estimator also arises as a maximum likelihood estimator under the assumption that error terms are i.i.d. and normally distributed [proof]. This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson. Regression analysis in this context is now known as classical linear regression. From the properties of the MLE, we can infer that the OLS estimator is asymptotically efficient if the assumptions of classical linear regression are satisfied.

Large sample properties

The least squares estimators are point estimates of the linear regression model parameters β. However generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.

Since we haven't made any assumption about the distribution of error term εi, it is impossible to infer the distribution of the estimators \hat\beta and \hat\sigma^2. Nevertheless, we can apply the law of large numbers and central limit theorem to derive their asymptotic properties as sample size n goes to infinity. Now of course in practice sample size doesn't go anywhere, however it is customary to pretend that n is “large enough” so that the true distribution of OLS estimator is close to its asymptotic limit, and the former may be approximately replaced by the latter.

We can show that under the model assumptions, least squares estimator for β is consistent (that is \hat\beta converges in probability to β) and asymptotically normal [proof]:

\sqrt{n}(\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2Q_{xx}^{-1}\big)

Using this asymptotic distribution, approximate confidence intervals for j-th component of vector \hat\beta can be constructed as

\beta_j \in \bigg[\ 
    \hat\beta_j \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2\big[\hat{Q}_{xx}^{-1}\big]_{jj}}
    \ \bigg]   at 1–α confidence level,

where q denotes quantile function of standard normal distribution, and [·]jj is the j-th diagonal element of a matrix.

Similarly, least squares estimator for σ2 is also consistent and asymptotically normal (provided that fourth moment of εi exists) with limiting distribution

\sqrt{n}(\hat\sigma^2-\sigma^2)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\operatorname{E}[\varepsilon_i^4]-\sigma^4\big)

These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc. As an example consider the problem of prediction. Suppose x0 is some point within the domain of distribution of regressors, and we want to know what would the response variable have been at that point. Mean response is the quantity y0 = x'0β, whereas predicted response is \hat{y}_0=x'_0\hat\beta. Clearly predicted response is a random variable, its distribution can be derived from the distribution of \hat\beta:

\sqrt{n}(\hat{y}_0 - y_0)\ \xrightarrow{d}\ \mathcal{N}\big(0,\;\sigma^2x'_0Q_{xx}^{-1}x_0\big),

which allows us to construct confidence intervals for mean response y0:

y_0\in\bigg[\ x_0'\hat\beta \pm q^{\mathcal{N}(0,1)}_{1-\alpha/2}\!\sqrt{\tfrac{1}{n}\hat\sigma^2x'_0\hat{Q}_{xx}^{-1}x_0}\ \bigg]   at 1–α confidence level.

Regression analysis

All the preceding discussion served to establish firm theoretical grounds for understanding the principles behind OLS regression, its uses and limitations. The formulas we derived can be used to compute the quantities of interest directly, however in practice this will rarely be needed since most modern statistical and mathematical software packages have already built-in tools to conduct the OLS regression. As such, the statistician's task becomes to analyze, correctly interpret, and draw necessary inference from regression output.

In this section we will try to explore which elements are most commonly seen in the output of OLS programs. It must be noted that due to the legacy of “normal errors assumption” which used to pervade the statistical science some 20–30 years ago, certain parts of standard regression output are either meaningless or of dubious utility.

Scatterplot of the data; at first glance the relationship is quadratic although close to linear.

To illustrate the various elements of regression, we consider an example. The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).

 Height (m):  1.47 1.50 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.70 1.73 1.75 1.78 1.80 1.83
 Weight (kg):  52.21 53.12 54.48 55.84 57.20 58.57 59.93 61.29 63.11 64.47 66.28 68.10 69.92 72.19 74.46

First step before running the regression is to look at the scatterplot of the data. Usually such plot will be able to suggest what kind of or even if there is any relationship between the dependent variable and regressors. It might also reveal any potential problems with the data: presence of outliers, heteroscedasticity, and others. The picture suggests that the relationship is pronounced and can probably be expressed as a quadratic function. Note that OLS technique can perfectly handle such “non-linear” relationships: all we need to do is to introduce another regressor HEIGHT2 and the regression turns into linear:

w_i = \beta_1 + \beta_2 h_i + \beta_3 h_i^2 + \varepsilon_i.

Now open your favorite statistical package and run the regression. The output will look more or less like this:

Fitted regression
Method: Least Squares
Dependent variable: WEIGHT
Included observations: 15

Variable Coefficient Std.Error t-statistic p-value

C 128.8128 16.3083 7.8986 0.0000
HEIGHT –143.1620 19.8332 –7.2183 0.0000
HEIGHT² 61.9603 6.0084 10.3122 0.0000

R2 0.9989     S.E. of regression 0.2516
Adjusted R2 0.9987 Model sum-of-sq 692.61
Log-likelihood 1.0890 Residual sum-of-sq 0.7595
Durbin-Watson stat. 2.1013 Total sum-of-sq 693.37
Akaike criterion 0.2548 F-statistic 5471.2
Schwarz criterion 0.3964 p-value (F-stat) 0.0000

In this table:

  • Coefficient column gives the least squares estimates of parameters βj
  • Std. errors column shows standard errors of each coefficient: \hat\sigma_j=\big(\tfrac{1}{n}\hat\sigma^2[\hat{Q}_{xx}^{-1}]_{jj}\big)^{1/2}
  • t-statistic and p-value columns are testing whether any of the coefficients might be equal to zero. The t-statistic is calculated simply as t=\hat\beta_j/\hat\sigma_j. Under normality assumption it has student-t distribution (hence the name), while without such assumption t is asymptotically normal. Large values of this statistic indicate that the hypothesis can be rejected and that corresponding coefficient is not zero. The second column, p-value, lets you tell at a glance how large the t is. If p-value is small (for example less than 0.05) then the corresponding coefficient is significantly different from zero, otherwise it is not significant and may potentially be even redundant.
  • R-squared is the coefficient of determination indicating goodness-of-fit of the regression. This statistic will be equal to one if fit is perfect, and to zero when regressors X have no explanatory power whatsoever. One problem with this quantity is that as you add more and more regressors, even irrelevant, R2 will always increase.
  • Adjusted R-squared is a slightly modified version of R2, designed to penalize for the excess number of regressors which do not add to the explanatory power of the regression. This statistic is always smaller than R2, can decrease as you add new regressors, and even be negative for poorly fitting models:
\overline{R}^2 = 1 - \tfrac{n-1}{n-p}(1-R^2)
  • Log-likelihood is calculated under the assumption that errors follow normal distribution. Even though the assumption is not very reasonable, this statistic may still find its use in conducting LR tests.
  • Durbin–Watson statistic tests whether there is any evidence of serial correlation between the residuals. As a rule of thumb, the value smaller than 2 will be an evidence of positive correlation. Note that there exist better tests than D-W for serial correlation / serial dependence.
  • Akaike information criterion and Schwarz criterion are both used for model selection. Generally when comparing two alternative models, larger values of one of these criteria will indicate a better model. However such comparison should be made with caution, since neither of these criteria has a firm theoretical foundation.
  • Standard error of regression is an estimate of σ, standard error of the error term.
  • Total sum of squares, model sum of squared, and residual sum of squares tell us how much of the initial variation in the sample were explained by the regression.
  • F-statistic tries to test the hypothesis that all coefficients (except the intercept) are equal to zero. This statistic has F(p–1,n–p) distribution under the null hypothesis and normality assumption, and its p-value indicates probability that the hypothesis is indeed true. Note that when errors are not normal this statistic becomes invalid, and other tests such as for example Wald test or LR test should be used.
Residuals plot

After you done the estimation, it is important to check model's assumptions. There should be no noticeable pattern in any of these plots:

  • Residuals against the explanatory variables in the model, as illustrated on the right. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  • Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  • Residuals against the fitted values, \hat{y}.
  • Residuals against the preceding residual.

Lastly, you should try to explore alternative models for your data set. Try adding / removing some of the regressors. Consider adding non-linear transforms of one of the regressors: powers or logarithms. Generally you want to find a reasonable compromise between model's parsimony and overall goodness-of-fit.

See also

References

  • Amemiya, Takeshi (1985). Advanced econometrics, Harvard University Press.
  • Rao, C.R. (1973). Linear statistical inference and its applications, 2nd ed. New York: John Wiley & Sons.

Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
 
 

 

Copyrights:

Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Ordinary least squares" Read more