Simulations of linear model fits

Table of Contents

1 Simulating linear model fits

In a model-fitting function, such as lm or glm, a large proportion of the execution time can be spent on decoding and checking the argument values, creating the model frame (function model.frame) and the model matrix (function model.matrix) and preparing the arguments for the numerical operations, which are performed in lm.fit or glm.fit.

This is fine when we are fitting a single model or performing model building. The time we should spend thinking about the model fits dwarfs the time spent fitting the model, in most cases.

However, when we are going to perform a simulation, we don't want to perform all these operations of setting up the model when we are fitting the same model to tens of thousands of simulated response vectors in a simulation study.

1.1 Fitting a matrix of responses

Fortunately there is a very fast, but little known, way of fitting a large number of simulated responses to the same model. If you read the documentation for the lm function you will find that the response part of the formula can be a matrix with n rows and N columns. That is, you can fit all the simulated responses in a single call to lm

set.seed(1234321)            # ensure a reproducible stream
N <- 10000                   # number of replications
n <- 22                      # size of the sample
dat <- data.frame(x=1:n)     # x values for a simple linear regression
(betaTrue <- c(0, 4) + rnorm(2, sd = 0.1))

> > > [1] 0.1236908 4.1561623
muTrue <- as.vector(model.matrix(~ x, dat) %*% betaTrue)
str(Ymat <- matrix(rnorm(n*N, mean=0, sd=0.3), nr = n) + muTrue)

 num [1:22, 1:10000] 4.83 8.53 12.91 16.4 20.98 ...
system.time(bigMod <- lm(Ymat ~ x, dat))
   user  system elapsed 
   0.16    0.01    0.17

As you can see, this is very fast – less than 1 second which is much, much faster than any kind of looping or apply function could achieve.

The usual extractors applied to such model fits produce vectors or matrices.

str(cc <- coef(bigMod))
 num [1:2, 1:10000] 0.318 4.148 0.184 4.146 0.12 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "(Intercept)" "x"
  ..$ : NULL

To plot the distribution of the coefficient estimates we create a data frame of the transpose of this matrix.

coefFr <- as.data.frame(t(cc))
names(coefFr) <- c("Intercept", "slope")
library(ggplot2)
qplot(slope, data=coefFr, geom="density", xlab=expression(hat(beta[1])))

beta1_dens.png

qplot(Intercept, data=coefFr, geom="density", xlab=expression(hat(beta)[0]))

beta0_dens.png

In a scatter plot of a large number of points like this, we sometimes get a better impression of the density of the points by using "alpha blending". We set the parameter alpha to a value between 0 and 1 so the points are partially transparent. (There is a reason we give the value of alpha as a quoted string. Try without the quotes to see the difference.)

qplot(Intercept, slope, data=coefFr, alpha="0.2",
      xlab=expression(hat(beta)[0]),
      ylab=expression(hat(beta)[1]))

beta0_beta1.png

Author: Douglas Bates

Date: 2010-08-24 Tue

HTML generated by org-mode 7.01h in emacs 23