Diagnostic Check of Assumptions

Next: References Up: DesignEnvironment and Previous: Environment Effects

Diagnostic Check of Assumptions

Let's return to the simpler situation of a completely randomized design in one environment. The analyses presented in this chapter make certain assumptions. This subsection examines the assumptions, in order of importance, and make suggestions about how to detect severe violations and adjust for these problems through transformations or other approaches. While there are formal tests, they tend to be rather weak, especially for samples of 20 or 50. It is usually enough to informally examine patterns in plots of the residuals from the model fit against the predicted values.

proc glm data=genes;   /* save prediced and residuals */
   class geno;
   model trait = geno / ss3;
   output out=diag r=rtrait p=ptrait;
proc plot data=diag;   /* check residual plot pattern */
   plot rtrait*ptrait=geno;

The most important assumption is that the model is correct, that there really is only one gene, and no systematic treatment or environmental effect. If there are other important factors, they should be included in the analysis from the begining. Much of this chapter is devoted to checking for multiple genes and adjusting for other factors believed to affect expression of the trait of interest.

The second most important assumption is that the trait measurements are independent. This can be examined by plotting trait measurements, or residuals from model fits, against run order or field position. For instance, look for trends in the measurements and residualas (overall, or by genotype), as follows:

proc plot data=diag;   /* check residual against run order */
   plot trait*order=geno;
   plot rtrait*order=geno;

It is best to address independence concerns at the design phase of the experiment. For instance, use border rows between plots, take measurements sufficiently far apart in time and space, etc. [These issues are considered again in the section on Experimental Design.] However, repeated measurements over time can have important information about agronomic processes under study. Repeated measurements, sometimes called longitudinal profiles, require extra care in data analysis beyond the scope of this chapter.

The third most important assumption is that the variance is constant across all trait measurements. This can be checked informally by plotting the residuals against predicted values, as shown at the begining of this subsection. Look for increasing or decreasing spread in the residuals across the range of predicted values. Draw horizontal lines at zero (0) residual and plus or minus 2 standard deviations to help the eye.

Certain types of measurements, such as counts, weight and length, tend to have increasing spread as the mean increases. Typically, either a square root or log transformation can stabilize the variance for such measurements. Percents require a slightly different transform, since the variance of percents decrease at either end (near 0% and 100%).

variance stabilizing transformations

   measurement  transformation
   count        sqrt(count)
   weight       log10(weight)
   percent      arsin(sqrt(percent))

Transformations cannot solve all problems. Some measurements do seem to improve after any transformation. For instance, a trait having 30% of the measurements being zero (0) will still have 30% of its values being the same! An alternative approach would consider this as two separate traits, one which turns expression and the second which determines the amount of expression. In this case, the first trait is qualitative, with the information being

   value     A   H   B
   zero     20  10   0
   nonzero   0  30  20
   total    20  40  20

This artificial example shows clear genetic control, with some dominance effect (H is not halfway between). Analysis of the quantitive trait, would only use the 50 non-zero measurements. There appears to be no information about this trait in the A genotype.

The least important assumption is that the residual errors are normal, having a ``bell-shaped'' histogram. The important feature is that the residuals should be symmetric about zero, and most within two standard deviations of zero. Many scientists make the mistake of checking the shape of the histogram of the trait measurements. Instead, one should examine the histogram of the residuals after fitting the model. Alternatively, view the spread of the residuals plotted against the predicted values. Is there evidence that measurements on one side of the zero) line are more spread than the other?

Transformations can help correct for a non-normal histogram of residuals. However, it can upset the more important assumption of equal variance. Further, transformations can alter the interpretation of model effects. For instance, a log transformation leads to a multiplicative, or relative, interpretation of effects.

Next: References Up: DesignEnvironment and Previous: Environment Effects

Brian Yandell
Sat May 20 19:25:47 CDT 1995