Chapter 8 Inference in Linear Models

I encourage you to review the cor(), lm(), and coef() functions in the R Guide for Chapter 2.

8.1 Inferences Using the Least-Squares Coefficients

The 8.1 notes start by reviewing chapter 2 and modeling auditory response time (\(y\)) as a linear function of visual response time (\(x\)). Here are those calculations (along with a few chapter 1 summary statistics we'll need later).

x = c(161, 203, 235, 176, 201, 188, 228, 211, 191, 178)
y = c(159, 206, 241, 163, 197, 193, 209, 189, 169, 201)
(model = lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      22.360        0.864
mean(x)
## [1] 197.2
sd(x)
## [1] 23.26
sd(y)
## [1] 24.62
cor(x, y)
## [1] 0.8159

We can see more detail on the model via summary():

summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.344 -11.117  -0.707   8.279  24.885 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   22.360     42.948    0.52    0.617   
## x              0.864      0.216    3.99    0.004 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.1 on 8 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.624 
## F-statistic: 15.9 on 1 and 8 DF,  p-value: 0.004

Notice under “Coefficients” that \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are shown, as are \(s_{\hat{\beta}_0}\) and \(s_{\hat{\beta}_1}\). Also the test statistic \(t\) and P-value for the tests “\(H_0: \beta_0 = 0\)” and “\(H_0: \beta_1 = 0\)” are given. (We did the \(\beta_1\) test in the 8.1 notes.)

Notice after “Residual standard error” that \(s\) is given as 15.1. We can get access to this number by saving the output of summary(), which is a list whose components are accessed with the $ operator and whose structure can be seen via str(). (You don't need to study the output of str() below.)

out = summary(model)
str(out)
## List of 11
##  $ call         : language lm(formula = y ~ x)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x417bfc8> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ residuals    : Named num [1:10] -2.43 8.29 15.65 -11.39 1.02 ...
##   ..- attr(*, "names")= chr [1:10] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 22.36 0.864 42.948 0.216 0.521 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ sigma        : num 15.1
##  $ df           : int [1:3] 2 8 2
##  $ r.squared    : num 0.666
##  $ adj.r.squared: num 0.624
##  $ fstatistic   : Named num [1:3] 15.9 1 8
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 8.08912 -0.040513 -0.040513 0.000205
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##  - attr(*, "class")= chr "summary.lm"

It looks like our 15.1 is the component called $sigma (which might have been named $s, since it's an estimate for \(\sigma\), not \(\sigma\) itself).

(s = out$sigma)
## [1] 15.1

Use confint() to get confidence intervals on the coefficients:

confint(model, conf.level = 0.95)
##                2.5 %  97.5 %
## (Intercept) -76.6779 121.398
## x             0.3647   1.363

Notice the interval for \(\beta_1\) that we found (under “e.g. (#6c)” in the notes).

Inferences on the Mean Response

We can handle this without any new R. e.g. Here's the confidence interval from #6d in the notes:

(n = length(x))
## [1] 10
(s.y.hat = s * sqrt(1/n + (200 - mean(x))^2/((n - 1) * var(x))))
## [1] 4.813
(beta.0 = coef(model)[1])
## (Intercept) 
##       22.36
(beta.1 = coef(model)[2])
##      x 
## 0.8638
(y.hat = beta.0 + beta.1 * 200)
## (Intercept) 
##       195.1
alpha = 1 - 0.95
(t = -qt(alpha/2, n - 2))
## [1] 2.306
(margin = t * s.y.hat)
## [1] 11.1
(interval = c(y.hat - margin, y.hat + margin))
## (Intercept) (Intercept) 
##       184.0       206.2

8.2 Checking Assumptions

Residual plot

The fitted values \(\hat{y}_i\) and residuals \(e_i\) are components of the (list) output of lm() (use str() to see other components, if you wish). Here's the residual plot of the points {(\(\hat{y}_i\), \(e_i\))} for the visual/auditory reaction times in the 8.2 notes.

y.hat = model$fitted.values
e = model$residuals
plot(y.hat, e, main = "Residual plot", xlab = "Residual", ylab = "Fitted value")
abline(a = 0, b = 0)  # add horizontal line with intercept 0 and slope 0

plot of chunk unnamed-chunk-8

Transformations

We don't need any new R to handle transformations. Here's the Hudson Bay rivers example, #6c in the 8.2 notes.

x = c(94.24, 66.57, 59.79, 48.52, 40, 32.3, 31.2, 30.69, 26.65, 22.75, 21.2, 
    20.57, 19.77, 18.62, 17.96, 17.84, 16.06, 14.69, 11.63, 11.19, 11.08, 10.92, 
    9.94, 7.86, 6.92, 6.17, 4.88, 4.49, 3.98, 3.74, 3.25, 3.15, 2.76, 2.64, 
    2.59, 2.25, 2.23, 0.99, 0.84, 0.64, 0.52, 0.3)  # annual discharge
y = c(4110.3, 4961.7, 10275.5, 6616.9, 7459.5, 2784.4, 3266.7, 4368.7, 1328.5, 
    4437.6, 1983, 1320.1, 1735.7, 1944.1, 3420.2, 2655.3, 3470.3, 1561.6, 869.8, 
    936.8, 1315.7, 1727.1, 768.1, 483.3, 334.5, 1049.9, 485.1, 289.6, 551.8, 
    288.9, 295.2, 500.1, 611, 1311.5, 413.8, 263.2, 490.7, 204.2, 491.7, 74.2, 
    240.6, 56.6)  # peak flow
(model = lm(log(y) ~ log(x)))
## 
## Call:
## lm(formula = log(y) ~ log(x))
## 
## Coefficients:
## (Intercept)       log(x)  
##       5.258        0.799
plot(log(x), log(y), main = "Model", xlab = "ln(Annual discharge)", ylab = "ln(Peak flow)")
abline(a = coef(model)[1], b = coef(model)[2])  # add regression line

plot of chunk unnamed-chunk-9

plot(model$fitted.values, model$residuals, main = "Residual plot", xlab = "Fitted value", 
    ylab = "Residual")
abline(a = 0, b = 0)

plot of chunk unnamed-chunk-9

8.3 Multiple Regression

We don't need much new R to handle 8.3. To get a model from lm() with y depending on several independent variables, separate those variables with “+”. e.g. For a linear model of \(y\) depending linearly on \(x_1\), \(x_2\), and \(x_3\), we can use lm(y ~ x.1 + x.2 + x.3).

e.g. Here's example #16 from the 8.3 notes. Notice that it uses “polynomial regression,” in which we trick the regression computation into treating \(x^2\) as a second independent variable. A more typical example would set x.2 (below) from raw data, not from the square of x.

x = c(1, 1.5, 2, 2.5, 3)
y = c(99, 96, 88, 76, 66)
x.1 = x
(x.2 = x^2)
## [1] 1.00 2.25 4.00 6.25 9.00
(model = lm(y ~ x.1 + x.2))
## 
## Call:
## lm(formula = y ~ x.1 + x.2)
## 
## Coefficients:
## (Intercept)          x.1          x.2  
##      101.40         3.37        -5.14
(y.hat = model$fitted.values)
##     1     2     3     4     5 
## 99.63 94.89 87.57 77.69 65.23
(e = model$residuals)
##       1       2       3       4       5 
## -0.6286  1.1143  0.4286 -1.6857  0.7714
summary(model)
## 
## Call:
## lm(formula = y ~ x.1 + x.2)
## 
## Residuals:
##      1      2      3      4      5 
## -0.629  1.114  0.429 -1.686  0.771 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   101.40       6.44   15.73    0.004 **
## x.1             3.37       7.01    0.48    0.678   
## x.2            -5.14       1.73   -2.97    0.097 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62 on 2 degrees of freedom
## Multiple R-squared:  0.993,  Adjusted R-squared:  0.986 
## F-statistic:  145 on 2 and 2 DF,  p-value: 0.00685
(SSE = sum(e^2))
## [1] 5.257
n = length(y)
(SST = (n - 1) * var(y))
## [1] 768
(SSR = sum((y.hat - mean(y))^2))
## [1] 762.7
SST - SSE  # check that this equals SSR, above
## [1] 762.7
(R.squared = SSR/SST)
## [1] 0.9932
p = 2
(F = (SSR/p)/(SSE/(n - (p + 1))))
## [1] 145.1
(p.value = 1 - pf(F, p, n - (p + 1)))
## [1] 0.006845

Rounding problems

Note that the last line of the last summary() call, above, shows “F-statistic: 145 ...”, while the “(F = (SSR/p) / (SSE/(n-(p+1))))” line gives “145.1”. R made different decisions on rounding \(F\) in these two cases, and neither gives very many digits. We can suggest (but not control) the number of digits to print via the options() function:

options(digits = 15)

Now when we re-run the previous two lines, we see more digits of \(F\):

summary(model)
## 
## Call:
## lm(formula = y ~ x.1 + x.2)
## 
## Residuals:
##               1               2               3               4 
## -0.628571428571  1.114285714286  0.428571428571 -1.685714285714 
##               5 
##  0.771428571429 
## 
## Coefficients:
##                    Estimate      Std. Error  t value  Pr(>|t|)   
## (Intercept) 101.40000000000   6.44448823193 15.73438 0.0040149 **
## x.1           3.37142857143   7.00833323222  0.48106 0.6779608   
## x.2          -5.14285714286   1.73322867293 -2.96721 0.0972878 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62128696676 on 2 degrees of freedom
## Multiple R-squared:  0.993154761905, Adjusted R-squared:  0.98630952381 
## F-statistic: 145.086956522 on 2 and 2 DF,  p-value: 0.00684523809524
(F = (SSR/p)/(SSE/(n - (p + 1))))
## [1] 145.086956521739