## Transcribed Text

Please read each question carefully and answer it completely.
Please copy/paste software output to the end of the assignment, in an Appendix, and use the information from your output to answer these questions. (I will only critically look at your output if there is an issue with your answers to the estimation questions.) Note that I have highlighted STATA commands on this assignment to help you quickly identify them.
#1. “Partialling Out” in multiple regression. You are given the following regression model (with two X-regressors) and data set: question1.dta.
(a) Estimate this regression model:
y=β_0+β_1 x_1+β_2 x_2+u
Write out your estimated model and interpret the two slope coefficients (be sure to include the notion of ceteris paribus in your explanation).
(b) Recall from the textbook and video for this week, that you can also find the estimate for β ̂_1 using the following building blocks:
(i) Estimate the model:
x_1=γ_0+γ_1 x_2+τ
Write out your estimated model.
Note that to run the next regression you will need to save your estimated residuals from this regression.
To do this in STATA through the menus, you will need to save these residuals right after you run the model. Using the menus you: Statistics -> Postestimation -> Predictions -> Predictions and their SEs … This series of selections will open a menu where you name the variable where you want to save the residuals (I call mine resid2) and then select “Residuals (equation level scores). You should then see a column of variables with the name you have chosen as part of your data set.
This is a case where the command is much more direct: predict resid2, residuals The command is “predict” followed by the name of the variable where you want to save the residuals followed by a comma and the word “residuals.” Note, again, this command must be entered right after you run the regression.
(ii) Estimate the model:
y=β_0+β_1 τ ̂+u
Write out your estimated model.
(c) Show/explain how these results capture the original estimator of β ̂_1. Use specific values from your output.
(d) Explain in words how you have demonstrated the notion of “partialling out” with this example.
#2. Quadratic Models. As part of a research project on meat consumption and sustainability, a co-author and I hypothesized that meat consumption could have a “Kuznets relationship.” (see, for example: As part of this research we estimated the following cross-section model of 150 countries (for this question you can assume that the model is specified correctly, i.e. there are no important omitted X-regressors):
(meat) ̂=16.712+4.205(income)-.052〖(income)〗^2
Where “meat” is a country’s per capital meat consumption (in kgs) and income is a country’s per capita income (in $1,000s).
(a) Using this estimated model—and taking the first and second derivatives with respect to income—
discuss the features of this quadratic: is it concave or convex? What is the “peak” meat consumption in this estimated model? To what income value does this maximum value correspond? Discuss how these results suggest or refute the proposed “Kuznets” shape.
(b) Using this model what is the predicted per capita meat consumption for a country with a per capita income of $25,475?
(c) If income were reported not in $1,000s but rather simply in $1s, how would that change the values of the estimators in this model? Explain. Show that your results are correct by using your values for the estimators to predict meat consumption for a country with a per capita income of $25,475. Why would using income in $1,000 rather than $1 be used in this case? Note: you do not have to show rigorous mathematical proofs for this section but can utilize the discussion on pp. 36-37 discussing scaling and units of measurement.
#3. Multicollinearity and the “Variance Inflation Factor.” On p. 98 in your textbook you are introduced to the idea of the “variance inflation factor” which can be used to consider the presence of multicollinearity within a multiple regression model. We will return to the issue of multicollinearity when we begin to discuss inference (hypothesis testing) within the regression model but for now you will
work with the concept of the VIF.
You are given the data set: business.dta which comes from the “World Bank Doing Business” project. Note that the data you will be using is from 2009 and the variables are defined:
(a) Estimate the following model and write out your estimated model:
cost= β_0+β_1 documents+β_2 days+β_3 hiring+β_4 firing+u
(b) What is the value for VIF that the textbook suggests is often used to suggest whether multicollinearity is a problem in the regression? Interpret what this high value for the VIF means.
(c) For each of the X-regressors in the model please fill in the following table. You can use the STATA menus/command.
To do this in STATA, you will again need to compute these after you run the regression (STATA will use those identified X-regressors)
Using the menus you: Statistics -> Linear Models -> Regression Diagnostics -> Specification tests, etc. Then you see a menu that will allow you to choose VIF. (Do not select the “uncentered” option).
This is another case where the command may be easier: estat vif If you use this exact command after running a regression you will get a table of VIFs.
Variable
VIF
documents
days
hiring
firing
(d) Verify the VIF for “documents” by running the required regression to compute this on your own. Write out the estimated model after you have run the regression:
Report the important value from the output and show/verify the computation of the VIF.
(e) Note that another way to consider multicollinearity as a problem would be to consider the correlation among the X-regressions themselves. In Stata, create a correlation matrix of the X variables. (You can find the command through the Statistics -> Summaries … -> Summary or the command correlate documents days hiring firing )
Are these results what you would expect based on the VIFs? Explain, please use specific correlations and VIFs in your explanation where possible. Which of these, VIFs or correlations, would you consider more useful in “diagnosing” multicollinearity? Explain. (Hint: consider the important idea that correlation, by definition, captures a bivariate relationship in your answer.)
Software Output:
Please read each question carefully and answer it completely.
Comment on notation: your textbook refers to the natural log of variables as log(X), since most software programs use ln(X) to denote the natural log, I prefer to use ln(X).
Please copy/paste software output to the end of the assignment, in an Appendix, and use the information from your output to answer these questions. (I will only critically look at your output if there is an issue with your answers to the estimation questions.) Please make sure you construct a formal “four-step” hypothesis where required.
Note: to find exact critical values and p-values, consider using distribution commands in EXCEL (rather than just the tables in the textbook):
=t.dist( ) and =t.inv() and =f.dist() and =f.inv()
Please make sure you include both a critical value and p-value approach in all your hypothesis tests for this assignment. Consider using the formatted table I have provided to help structure your hypothesis tests. Note that the table does make use of “Equations” in Word.
#1. Please look at the information for Question #9 [p. 143]. (Question begins, “In Problem 3 in Chapter 3, we estimated the equation [sleep-hat] …”) Use information from that question to answer (i.e. you do NOT have to answer the questions as written in the textbook):
(a) Conduct a four-step hypothesis test for the significance of the coefficient on age. Use α = 0.05.
(b) Construct a 95% confidence interval for the coefficient on age. Discuss how this interval relates to the hypothesis test you conducted in (a).
(c) Conduct a four-step hypothesis test for the problem in textbook question (ii). (“Dropping educ and age from the equation gives …”)
#2. For this question you will use the data set: hprice1.dta. Please look at the information in Question C2 [p. 98] and Question C3 [p. 146]. (Both of these textbook questions are based on the data set HPRICE.) Use this information to answer the following questions:
(a) Write out the estimated equation for the model shown in C2 [p. 98].
(b) Please answer textbook question (vi) for C2 [p. 98]. Note that you will need information provided in (v).
(c) Please complete textbook questions (i) – (iii) for C3 [p. 146]. Note that you will need to create ln(price) in your data set to complete this question.
#3. For this question you will use the data set: discrim.dta. Please look at the information in Question C9 [p. 147]. Use this information to answer the following questions, note you will need to add logs (ln) for some variables to complete these questions:
(a) Write out the estimated equation for the model shown in C9 (i) [p. 147].
(b) Consider the coefficient on ln(income) in the model from C9 (i) [p. 147]. Comment on the sign, significance and magnitude of this coefficient. Conduct a formal four-step hypothesis of the null that this coefficient is ≤ 0.10.
(c) Write out your estimated model for the model described in (iii) C9 [p. 147].
(d) Summarize how the magnitude and significance of the coefficients has changed from the model you estimated in (a) to the model in (c). Explain why these changes have occurred using ideas from last chapter’s discussion on model selection (excluding/including regressors). Use specific quantitative evidence for your discussion where possible.
Software Output:
Please read each question carefully and answer it completely. Unless you are given a different value, please use α = 0.05 in your hypothesis tests.
Please copy/paste software output to the end of the assignment, in an Appendix, and use the information from your output to answer these questions.
IMPORTANT: I will only critically look at your Appendix output for actual grading purposes if there is an issue with your answers to the estimation questions, i.e. ONLY if I feel I may be able to give you some partial credit for what otherwise seem to be serious mistakes in STATA. I will NOT look at your Appendix output if your answer is incomplete so PLEASE do not expect me to hunt for answers in your Appendix output.
#1. F-test and LM test. Based on the data set [also used in HW #4] business.dta, please estimate the model
cost= β_0+β_1 documents+β_2 days+β_3 hiring+β_4 firing+u
(i) Report your estimated model and explain which coefficients are/are not significant for α = 0.05. (You do not have to conduct a formal four-step hypothesis for these t-tests.)
(ii) Conduct a four-step F-test for the joint significance of the coefficients on hiring and firing, i.e. the null that both of these population coefficients are equal to 0.
(iii) Conduct a four-step LM test for the joint significance of the coefficients on hiring and firing, i.e. the null that both of these population coefficients are equal to 0.
(iv) Briefly compare/contrast the tests and the hypothesis testing results from (ii) and (iii).
#2. A key theme for Chapters 4 and 5 is the issue of the normality of the population error terms. Note that although your textbook doesn’t discuss this notion in-depth, researchers concerned about this issue may choose to test the normality of the estimated residuals. (Again remember that this test on the sample generated residuals cannot prove or disprove the normality of the population error terms, however the results of this test may be used as evidence for potential concerns.)
(i) Consider this following quote from that entry: “Monte Carlo simulation has found that Shapiro–Wilk has the best power for a given significance. “ Explain what this quote means in your own words, please use “Type I” and “Type II” errors in your discussion. Why is this property desirable for a hypothesis test?
(ii) Save (or re-estimate to get) the residuals from the model you estimated in #1 (i) above, name this variable resid1. (Remember you have generated a column of residuals in an earlier homework assignment.) Using the Shapiro-Wilk test in STATA, conduct a four step hypothesis test on these residuals, note that you can find details of the test through a link in the Wikipedia discussion. [You can run this in STATA using the command: swilk resid1 or find in with the menus Statistics -> Summaries, tables, and tests -> Distributional tests …]
(iii) Do the results of this test disturb you? Why or why not?
(iv) Consider the specific issue of normality of the population error terms. Please fill in the following table with a summary of the concerns/results/issues related to normality under different conditions for “n.” Note that I have given you a table as format because I want your discussion in each block to be as concise as possible.
“small n”
“large n”
Estimation
Inference
#3. Consider the model described on p. 199, C3. Using that model, answer the following questions:
(i) Report the estimated model. Note that you will need to both create a variable for the ln(wage) and the interaction term in order to estimate this model. Explain how to interpret the coefficient of the interaction term.
(ii) Conduct a four step hypothesis for the significance of the interaction term.
(iii) Please answer question C3 (iv) from the textbook [p. 221].
Software Output:

This material may consist of step-by-step explanations on how to solve a problem or examples of proper writing, including the use of citations, references, bibliographies, and formatting. This material is made available for the sole purpose of studying and learning - misuse is strictly forbidden.