Analysis of Variance

Analysis of variance is the standard statistical technique for modeling a quantitative response variable with categorical explanatory variables. Selecting an appropriate model will usually involve a graphical exploration of data, fitting different models of various complexity, and choosing one on the combined basis of numerical summaries of the fit and a graphical examination of several diagnostic plots.

To introduce these ideas, we analyze the data from a student experiment to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights. The student experimenters tested the effects of two different step heights crossed with three different stepping speeds, for a total of six different treatment combinations. In addition, they used six different subject/experimenter combinations as blocks. They used a balanced incomplete block design, meaning that every pair of factor levels appears in an equal number of blocks. For this experiment, each of the six height/frequency combinations appears in five blocks, where a block is a student exerciser and a pair of experimenters.

Experiment Details

There were two different step heights: 5.75 inches (coded as low), and 11.5 inches (coded as high). There were three rates of stepping: 14 steps/min. (coded as slow), 21 steps/min. (coded as moderate), and 28 steps/min. (coded as fast). This resulted in six possible height/frequency combinations. Each subject performed the activity for three minutes. Subjects were kept on pace by the beat of an electric metronome. One experimenter counted the subject's pulse for 20 seconds before and after each trial. The subject always rested between trials until her or his heart rate returned to close to the beginning rate. Another experimenter kept track of the time spent stepping. Each subject was always measured and timed by the same pair of experimenters to reduce variability in the experiment. Each pair of experimenters was treated as a block.

Number of cases: 30
Variable Names:
  1. Order: the overall performance order of the trial
  2. Block: the subject and experimenters' block
  3. Height: low (5.75") or high (11.5")
  4. Frequency: the rate of stepping. slow (14 steps/min), moderate (21 steps/min), or high (28 steps/min)
  5. RestHR: the resting heart rate of the subject before a trial, in beats per minute
  6. HR: the final heart rate of the subject after a trial, in beats per minute
  7. Time: the performance order within a block
This data is found in slightly different form at the Data and Story Library, part of StatLib at Carnegie Mellon University.

Getting the Data

Copy the file ~/larget/496/step.dat to your course directory, and load the data into S-PLUS.
% cp ~/larget/496/step.dat ~/496/step.dat
% Splus -e
> step.frame <- read.table("step.dat",header=T)
Any variable that takes on character values will be interpreted as a factor, and any variable with only numerical values will be interpreted as quantitative by default. For the data we have read in, Time and Frequency should also be ordered factors. This S-PLUS code changes the status of both variables.
> step.frame$Frequency <- ordered(step.frame$Frequency,
+ levels = c("slow","moderate","fast"))
> step.frame$Time <- ordered(step.frame$Time)
For Frequency, you need to explicitly tell S-PLUS the proper order. For Time, S-PLUS uses the numerical order by default.

You can check that all variables are correctly classified by using the function sapply, which is similar to apply, and applies a specified function to all objects in a list, or in the case of data frames, to each column. The function is.factor returns T if a column is a factor or an ordered factor. The function is.ordered returns T if a column is an ordered factor.

> sapply(step.frame,is.factor)
> sapply(step.frame,is.ordered)
The response variable we are interested in is the difference between heart rate (HR) and resting heart rate (RestHR). Create a variable called DiffHR for this difference and add it to the data frame.
> attach(step.frame)
> step.frame$DiffHR <- HR - RestHR
> attach(step.frame)

Graphical Exploration of the Data

Now that the data is all properly stored in a data frame, we are ready to begin the analysis. Any analysis should begin with a graphical exploration of the data. One good summary plot is side-by-side boxplots by level for each factor. S-PLUS will do this for all factors with this command:
> plot.factor(DiffHR ~ Block + Time + Height + Frequency)
Each of these pictures gives an informal indication as to whether a given factor should be included in the model. Another useful summary plot is created with plot.design.
> plot.design(step.frame)

Fitting an ANOVA Model

The textbook gives examples in Chapter 10 on fitting many statistical models including analysis of variance. The same basic technique for model fitting is followed, using a different function to produce the fit.

Put your data into a data frame with each column correctly specified as a quantitative variable, factor, or ordered factor. Section 10.1 in the textbook gives more details on the syntax for doing these tasks in S-PLUS.

Fit a model using the appropriate function. Analysis of variance models are fit with the S-PLUS function aov. This example continues the data introduced above. To fit a model that predicts the difference in heart rate based on the group of students (Block), number in sequence of a group of experiments (Time), height of the step (Height), and frequency with which the exerciser steps (Frequency),

> fit <- aov(DiffHR ~ Block + Time + Height + Frequency, data=step.frame)
This creates an object named fit which contains all necessary information about the fit. The textbook gives more details on changing the formula statement to fit alternative models.

Use diagnostic tools to examine the quality of the fitted model. Other functions may then be used to extract information about the model. For example,

> summary(fit)
produces the ANOVA table. If you wish to extract the fitted coefficients,
> coef(fit)
does the job. If you want to work with the fitted values or residuals,
> fv <- fitted.values(fit)
> r <- residuals(fit)
will do the trick. Notice that the functions used above are identical to those used to examine the object created by fitting a regression model and, in fact, will work for the several other models available in S-PLUS.

The function plot applied to fit produces a few diagnostic plots. Again, you may exercise more control over what you wish to view by working with the residuals, fitted values, and data directly. For example,

> plot(fitted.values(fit),residuals(fit),xlab="Fitted Values",
+ ylab="Residuals")
> abline(0,0)
produces a plot of the residuals versus the fitted values.

Just as with regression, finding the appropriate model to describe the data requires fitting several models and examining the fits interactively. S-PLUS gives you the tools to do this. This course attempts to show you how to use these tools computationally. Courses in applied statistics are necessary for you to learn more about why these tools can be effective.

The book Modern Applied Statistics with S-Plus by Venables and Ripley and the book Statistical Models in S edited by Chambers and Hastie (of which I have given you an excerpt) include much more detailed information on practical model fitting using S-PLUS.


Last modified: May 2, 1997

Bret Larget, larget@mathcs.duq.edu