Simulations of linear model fits
Table of Contents
1 Simulating linear model fits
In a model-fitting function, such as lm
or glm
, a large proportion
of the execution time can be spent on decoding and checking the
argument values, creating the model frame (function model.frame
) and
the model matrix (function model.matrix
) and preparing the arguments
for the numerical operations, which are performed in lm.fit
or
glm.fit
.
This is fine when we are fitting a single model or performing model building. The time we should spend thinking about the model fits dwarfs the time spent fitting the model, in most cases.
However, when we are going to perform a simulation, we don't want to perform all these operations of setting up the model when we are fitting the same model to tens of thousands of simulated response vectors in a simulation study.
1.1 Fitting a matrix of responses
Fortunately there is a very fast, but little known, way of fitting a
large number of simulated responses to the same model. If you read
the documentation for the lm
function you will find that the
response part of the formula can be a matrix with n rows and N
columns. That is, you can fit all the simulated responses in a single
call to lm
set.seed(1234321) # ensure a reproducible stream N <- 10000 # number of replications n <- 22 # size of the sample dat <- data.frame(x=1:n) # x values for a simple linear regression (betaTrue <- c(0, 4) + rnorm(2, sd = 0.1))
> > > [1] 0.1236908 4.1561623
muTrue <- as.vector(model.matrix(~ x, dat) %*% betaTrue) str(Ymat <- matrix(rnorm(n*N, mean=0, sd=0.3), nr = n) + muTrue)
num [1:22, 1:10000] 4.83 8.53 12.91 16.4 20.98 ...
system.time(bigMod <- lm(Ymat ~ x, dat))
user system elapsed 0.16 0.01 0.17
As you can see, this is very fast – less than 1 second which is much, much faster than any kind of looping or apply function could achieve.
The usual extractors applied to such model fits produce vectors or matrices.
str(cc <- coef(bigMod))
num [1:2, 1:10000] 0.318 4.148 0.184 4.146 0.12 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "(Intercept)" "x" ..$ : NULL
To plot the distribution of the coefficient estimates we create a data frame of the transpose of this matrix.
coefFr <- as.data.frame(t(cc)) names(coefFr) <- c("Intercept", "slope")
library(ggplot2) qplot(slope, data=coefFr, geom="density", xlab=expression(hat(beta[1])))
qplot(Intercept, data=coefFr, geom="density", xlab=expression(hat(beta)[0]))
In a scatter plot of a large number of points like this, we sometimes
get a better impression of the density of the points by using "alpha
blending". We set the parameter alpha
to a value between 0 and 1 so
the points are partially transparent. (There is a reason we give the value of
alpha
as a quoted string. Try without the quotes to see the difference.)
qplot(Intercept, slope, data=coefFr, alpha="0.2",
xlab=expression(hat(beta)[0]),
ylab=expression(hat(beta)[1]))
Date: 2010-08-24 Tue
HTML generated by org-mode 7.01h in emacs 23