Speed of Simulations
Table of Contents
1 Speed of simulations
Because simulations can take a long time, they provide good examples for comparing speed of R code and developing good programming techniques.
1.1 Loops versus apply versus …
We will compare different methods of simulating 100,000 realizations of the sample mean from samples of size 9 from an exponential distribution with rate, λ = 1.
N <- 100000 n <- 9
For reference, the method suggested in Simulation studies using R is
set.seed(123)
system.time(resRepl <- replicate(N, mean(rexp(n))))
user system elapsed 3.630 0.000 3.671
We see that on the machine I am using it takes a few seconds to produce the result, which is
str(resRepl)
num [1:100000] 0.704 0.584 1.505 1.083 1.489 ...
1.1.1 Simulating in a loop
Programmers who have used compiled languages like C, C++, Fortran, or
Pascal, and even byte-compiled languages like Java or C# can quickly
adapt to using for
loops in R. The syntax in R is a bit different
from those languages but the ideas of looping are familiar.
Part of the folklore regarding R programming is that you should avoid
writing for
loops but the situation is more subtle than that. You
should avoid writing for
loops carelessly.
If you are going to produce a vector or a matrix (or, in general, an array) and you preallocate the structure then use replacement operations, looping is not bad.
set.seed(123) resLoop1 <- numeric(N) # preallocate the result of correct size system.time(for (i in 1:N) resLoop1[i] <- mean(rexp(n)))
> user system elapsed 3.880 0.000 3.904
We should, of course, check that the results are as expected
all.equal(resRepl, resLoop1)
[1] TRUE
The thing not to do in a loop is to grow the structure. R allows you to assign a value beyond the end of a vector and it will automatically extend the size of the vector when this happens. However, this operation is not without cost.
set.seed(123) resLoop2 <- numeric() # empty numeric vector system.time(for (i in 1:N) resLoop2[i] <- mean(rexp(n)))
> user system elapsed 42.480 0.050 43.018
This produces the same results
all.equal(resLoop2, resRepl)
[1] TRUE
but takes much longer.
Another idiom used by some is to initialize the result to NULL
and concatenate
each freshly simulated result onto the existing results. Again, this
involves a considerable amount of recopying of results and is slow.
set.seed(123) resLoop3 <- NULL # empty list system.time(for (i in 1:N) resLoop3 <- c(resLoop3, mean(rexp(n))))
> user system elapsed 40.930 0.060 41.073
str(resLoop3)
num [1:100000] 0.704 0.584 1.505 1.083 1.489 ...
1.2 Using arrays
At one time it was important to take into consideration the overhead of calls to R functions and avoid putting too many functions inside a loop. That is less of an urgent consideration on modern computers. Although the interpreter overhead is still present on modern computers it is not as big an issue.
Nevertheless it is good to see how such a simulation could be done by
creating all the random numbers from the exponential distribution in
one call and then reducing each subsample. We arrange the n*N
values into an n× N matrix. (We could make it N× n but
there is a slight advantage in having subsamples in columns rather
than rows.)
set.seed(123)
str(MM <- matrix(rexp(n*N), nr = n))
num [1:9, 1:100000] 0.8435 0.5766 1.3291 0.0316 0.0562 ...
1.2.1 Reducing the array using the colMeans
function
At this point we can make use of a very fast, built-in function called
colMeans
to evaluate the means of the columns. So that timing
comparisons with other methods are fair, we regenerate the matrix
before taking the column means.
set.seed(123)
system.time(resColMeans <- colMeans(matrix(rexp(n*N), nr = n)))
user system elapsed 0.150 0.000 0.143
As we can see, this method is the clear winner for speed and we can check with
all.equal(resColMeans, resRepl)
[1] TRUE
to see that it does produce the same set of simulated values. (This,
by the way, is another reason to use put subsamples in the columns and
to use colMeans
to reduce the matrix. If the matrix had been
configured as N
× n
and reduced by rowMeans
then the
results would not correspond to those from replicate
.)
1.2.2 Reducing the array using apply
The colMeans
function is quite fast but only useful if you want the
sample means from each of the subsamples. If we wanted the sample
median instead of the sample mean, we would be stuck.
The apply
function is a more general way of reducing an array
because it takes the name of the R function to apply. Of course, this
means there will be repeated calls to the R function being applied,
with the overhead of those calls, but that is the price to be paid for
generality. The call is of the form
resApply <- apply(MM, 2, mean)
all.equal(resApply, resRepl)
[1] TRUE
The value 2 for the second argument indicates that the mean
function
should be applied to the 2nd dimension (i.e. the columns) of the
array.
To get a fair comparative timing we generate the random numbers within
the call to apply
set.seed(123)
system.time(resApply <- apply(matrix(rexp(n * N), nr = n), 2, mean))
user system elapsed 4.000 0.000 4.058
We can see that this method's speed is comparable to the replicate
method.
1.3 Using sapply
, lapply
and vapply
The apply
function is one of a family of functions that apply other
functions to the elements of some structure. The most general of
these is lapply
, which applies a function to the elements of a list
(or any other vector structure, including numeric vectors).
A characteristic of lapply
is that it always returns a list. The
sapply
function is like lapply
except that it tries to "simplify"
the list that is returned.
In our case we just want to repeat the operation of simulating n
random values and determining their mean and do that N
times so the
function we want to apply doesn't use its argument. Nevertheless, we
must give it an argument, even though we don't use it, so that the
sapply
function stays happy.
set.seed(123) system.time(resSapply <- sapply(1:N, function(i) mean(rexp(n))))
user system elapsed 3.580 0.000 3.665
An alternative is to unlist
the result of lapply
.
set.seed(123)
str(unlist(lapply(1:N, function(i) mean(rexp(n)))))
num [1:100000] 0.704 0.584 1.505 1.083 1.489 ...
for which the timing is
system.time(unlist(lapply(1:N, function(i) mean(rexp(n)))))
user system elapsed 3.430 0.000 3.428
A slight variant on sapply
is vapply
which can be used if you know
you will be creating a vector. It takes a third argument which is a
template vector that indicates the desired type of the response. It
doesn't need to be as large are the response, it just needs to be the
right type of vector (technically, I should say "mode" instead of
"type" but you get the idea). Typical values are 1
for a numeric
vector, 1L
for an integer vector, and the empty string, "", for
character strings but more complex structures can be used. See the
examples in the help page ?vapply
set.seed(123)
str(vapply(1:N, function(i) mean(rexp(n)), 1))
num [1:100000] 0.704 0.584 1.505 1.083 1.489 ...
system.time(vapply(1:N, function(i) mean(rexp(n)), 1))
user system elapsed 3.230 0.000 3.253
Notice that the timings for the sapply
and lapply
methods are
similar to those for replicate
. This is not a coincidence. The
replicate
function
replicate
function (n, expr, simplify = TRUE) sapply(integer(n), eval.parent(substitute(function(...) expr)), simplify = simplify) <environment: namespace:base>
ends up being a call to sapply
. There are a few subtleties in there
about setting up the counter vector and changing the expression into
an anonymous function but basically it comes down to a call to
sapply
.
We won't bother discussing how the expression is converted to an
anonymous function but it is interesting to consider why the number of
replications, what we call N
but is called n
here, is converted to
a vector of length n
by integer(n)
, whereas we used 1:N
. The
integer(n)
call creates an integer vector of length n
filled with
zeros, whereas 1:N
creates an integer vector of length N
counting
from 1
to N
— most of the time. Because we are not actually
using the values it doesn't matter what the contents of the vector are
as long as it has the correct length.
As for that "most of the time" comment, the exception is when n=0
.
The construction 1:0
produces a vector of length 2 whereas
integer(0)
produces a vector of length 0. It may seem bizarre to
consider what the result should be when you replicate evaluation of an
expression 0 times but the defensive programmer is always cautious of
the "edge cases". So the construction 1:n
is dangerous because it
doesn't behave as expected when n=0
. If you want to get an integer
vector containing the values from 1 up to n
with correct behaviour
for n=0
use seq_len(n)
2 Benchmarking the speed of the methods
We have used system.time
to assess the execution time of a single
evaluation of a sample by each of the proposed methods. There are
many different factors that can affect the overall execution time and
we really should replicate the timings to get a better handle on the
overall speed. The rbenchmark
package provides a versatile
function, called benchmark
, to replicate timings of expressions and
create a table of results.
First we load the package
library(rbenchmark)
(if this produces an error you may need to install the package first).
We should decide which methods we wish to compare and what size of
samples to use. The replicate
, sapply
, etc. methods have taken 3
to 4 seconds for 100,000 realizations on this computer. To have the
benchmark test run in a reasonable length of time we will cut the
number of realizations to 1000 and run 100 replications of each
method. As for the methods, we will eliminate the methods based on
loops without preallocation of the results, as they are clearly not
competitive.
To make identification easier, we create functions for each of the simulation methods
fRepl <- function(n, N) replicate(N, mean(rexp(n))) fLoopPre <- function(n, N) { ans <- numeric(N) for(i in seq_len(N)) ans[i] <- mean(rexp(n)) ans } fColMeans <- function(n, N) colMeans(matrix(rexp(n * N), nr=n)) fApply <- function(n, N) apply(matrix(rexp(n * N), nr=n), 2, mean) fSapply <- function(n, N) sapply(integer(N), function(...) mean(rexp(n))) fLapply <- function(n, N) unlist(lapply(integer(N), function(...) mean(rexp(n)))) fVapply <- function(n, N) vapply(integer(N), function(...) mean(rexp(n)), 1) N <- 1000
benchmark(fColMeans(n,N), fVapply(n, N), fRepl(n, N), fLoopPre(n, N), fApply(n, N), fSapply(n, N), fLapply(n, N), columns = c("test", "elapsed", "relative", "user.self", "sys.self"), order = "elapsed")
+ + + + + + + test elapsed relative user.self sys.self 1 fColMeans(n, N) 0.135 1.00000 0.13 0 2 fVapply(n, N) 3.458 25.61481 3.40 0 7 fLapply(n, N) 3.460 25.62963 3.44 0 6 fSapply(n, N) 3.619 26.80741 3.55 0 3 fRepl(n, N) 3.628 26.87407 3.57 0 5 fApply(n, N) 3.853 28.54074 3.80 0 4 fLoopPre(n, N) 3.876 28.71111 3.79 0
So we see that there is very little difference between the various
looping or "apply"ing methods and, on this machine, the colMeans
function is about 20 to 30 times faster than using a loop or an apply
function.
3 Summary
-
Using
replicate
for simulations is conceptually the simplest approach and is competitive with other methods based on looping or various functions from theapply
family. -
For the particular case of evaluating sample means, the
colMeans
function can be an order of magnitude faster. It is worth remembering for that one special, but frequent, case. -
Using
for
loops is not, by itself, a bad practice. But you should avoid "growing", within the loop, the object containing the result.
Date: 2010-08-25 Wed
HTML generated by org-mode 7.01h in emacs 23