Algorithms for the Mean and Variance

Two common summaries of data are the mean and the variance.
Mean     = x-bar = sum x_i / n

Variance = s^2   = sum (x_i - x-bar)^2 / (n-1)
As written, computation of the variance requires two passes through the data, one to sum the data and compute the mean, followed by a second pass to find the sum of the squared deviations from the mean and the variance.

The Desk Calculator Algorithm

It is more efficient to find an algorithm which requires just a single pass through the data. The desk calculator algorithm is one such choice. As data is entered, this algorithm keeps track of the sum of the data and the sum of the squares of the data. Straightforward algebra shows that the above expression for the variance is equivalent to this expression.
s^2 = ( sum (x_i)^2 - (sum x_i)^2 / n ) / (n-1)
However, this algorithm gives incorrect answers for ill-conditioned data. For example, try using the statistics functions on your calculator to find the standard deviation of the numbers 10,000,001, 10,000,003, and 10,000,005. Exact computation gives the answer as 2 (the same as the standard deviation of 1, 3, and 5), but your calculator probably returns the number 0. The desk calculator algorithm will have problems whenever the coefficient of variation (s / x-bar) is quite small.

Computers do not store the values of real numbers exactly. The value called the machine EPSILON is the smallest number for which 1 + EPSILON is greater than 1 on the computer. If the square of the coefficient of variation is smaller than the machine EPSILON, the desk calculator algorithm will fail.

The Method of Provisional Means

Here is a one-pass algorithm which updates the values of the mean and the sum of the squared deviations from that mean every time a new data value is read. Define
m_k = mean of first k numbers

s_k = sum (x_i - m_k)^2, where sum is over first k numbers
By simplifying the expression for m_k - m_{k-1} and s_k - s_{k-1}, we arrive at these difference equations.
m_k = m_{k-1} + (x_k - m_{k-1}) / k

s_k = s_{k-1} + (x_k - m_{k-1}) * (x_k - m_k)
(See Homework #5, Problem 1 for a related question.) An algorithm built upon these difference equations can handle many data sets for which the desk calculator algorithm fails at a price of a small increase in storage and computation.

The Subtract the First Number Algorithm

An alternative algorithm is to simply subtract the first number from each observation and use the desktop algorithm on this modified data. The mean will be too small, and needs to be corrected by adding the first number back on. The variance is unaffected by the subtraction of a constant from every value.
s^2 = ( sum (x_i - x_1)^2 - ( sum (x_i - x_1) )^2 / n ) / (n-1)
This method also avoids many numerical difficulties because the first number is often close in size to most of the numbers in the data set.


Last modified: March 19, 1997

Bret Larget, larget@mathcs.duq.edu