We will be using this data to illustrate several concepts in multiple regression.
(FEV) = b0 + b1(height) + b2(age)
Here is the output.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -4.6105 0.2243 -20.5576 0.0000
ht 0.1097 0.0047 23.2628 0.0000
age 0.0543 0.0091 5.9609 0.0000
Residual standard error: 0.4197 on 651 degrees of freedom
Multiple R-Squared: 0.7664
Look at a plot of the residuals versus fitted values.
Notice that there are patterns in the residual plot.
- There is a curve. This indicates that a nonlinear fit.
- The spread increases as the fitted values increase.
This indicates heteroscedasticity (different spread).
Fit #2
Now try transforming the response variable by taking logarithms.
This often helps when the spread increases with fitted values
(and sometimes gets rid of nonlinearity problems as well).
Here are the fitted values.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.9711 0.0783 -25.1639 0.0000
ht 0.0440 0.0016 26.7059 0.0000
age 0.0198 0.0032 6.2305 0.0000
Residual standard error: 0.1466 on 651 degrees of freedom
Multiple R-Squared: 0.8071
Also, look at the residual plot.
Notice that there are no obvious patterns.
The residuals have similar spread for all fitted values
and there are no trends.
Fit #3
We could also see if we could do better by adding a quadratic term for age.
Here is a summary of the fit.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.9809 0.0801 -24.7360 0.0000
ht 0.0435 0.0018 23.6722 0.0000
age 0.0271 0.0128 2.1239 0.0341
I(age^2) -0.0003 0.0005 -0.5915 0.5544
Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8072
Notice that the p-value for the quadratic term is very large.
This extra term did not add much to the quality of the fit.
Fit #4
We could also see if we could do better by adding an interaction term between age and height.
Here is a summary of the fit.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.9666 0.1878 -10.4733 0.0000
ht 0.0439 0.0032 13.6514 0.0000
age 0.0193 0.0207 0.9322 0.3516
ht:age 0.0000 0.0003 0.0267 0.9787
Residual standard error: 0.1467 on 650 degrees of freedom
Multiple R-Squared: 0.8071
Notice that the interaction term does not add much of value.
Our best model was Fit #2.
Fit #5
We can also add categorical variables to the multiple regression.
The variable sex can be represented by a numerical variable
with one value for male and another for female.
A similar thing holds for smoking status.
Here is a summary of the fit.
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.9524 0.0807 -24.1811 0.0000
ht 0.0428 0.0017 25.4893 0.0000
age 0.0234 0.0033 6.9845 0.0000
smoke -0.0230 0.0105 -2.2031 0.0279
sex 0.0147 0.0059 2.5020 0.0126
Residual standard error: 0.1455 on 649 degrees of freedom
Multiple R-Squared: 0.8106
All of these variables appear to be important
because they have small p-values.
Last modified: April 5, 2001
Bret Larget,
larget@mathcs.duq.edu