More Advanced Features in SAS

More Complicated Designs
Nested or Split Plot Designs
Multiple Responses (MANOVA and DA)
Repeated Measures
Multidimensional Preference
Categorical Data
Generalized Estimating Equation

Return to SAS Introduction or Information on SAS.

More Complicated Designs

Again, use proc glm if there is imbalance.

proc anova;		/* randomized complete block */
   class block trt;
   model y = block trt;

proc anova;		/* three-way anova with all interactions */
   class a b c;
   model y = a b c a*b a*c b*c a*b*c;

proc anova;		/* three-way anova--short hand */
   class a b c;
   model y = a | b | c;	/* use a vertical bar between factors a,b,c */

The nested design below uses the test phrase, which allows one to test the whole plot factor (a) against the first error (a*rep), while testing the split plot (or nested) factor (b) against the second error (the error part from fit). Here are three ways to do this:

proc anova;		/* nested anova for balanced design */
   class block var fert;
   model y = fert block(fert) var var*fert;
   test h = fert e = block(fert);

proc glm;		/* nested anova -- more general form */
   class block var fert;
   model y = fert block(fert) var var*fert;
   test h = fert e = block(fert);
   random block(fert) / test;
   lsmeans var var*fert / stderr pdiff;
   lsmeans fert / stderr pdiff e = block(fert);

The proc glm form almost works if there are missing data. However, the lsmeans for the whole-plot factor (fert) does not work properly; the standard errors of estimates are usually wrong for nested designs, and some combinations are deemed not-estimable when there is missing data. The random phrase with the test option uses the Satterthwaite approximation, which is somewhat better than the test phrase, as it matches expected mean squares.

The newer proc mixed (version 6.0.9) handles nested designs in a natural way, fixing many of the shortcomings of the proc glm approach. Here is the setup for the split plot used above.

proc mixed;
   class block var fert;
   model yield = fert var var*fert / outpm = blue;
   random block block*fert / solution;
   lsmeans fert | var;
   ods output SolutionR = random;
proc sort data = blue; by fert block;
proc plot data = blue; by fert block;
    plot Resid*Pred = var;

The preferred way to test fixed effects is with the anova tests that come naturally with proc mixed. However, inference for random effects should be done by comparing likelihood ratios with and without the variance component of interest. This involves running proc mixed twice. ThHere is a SAS macro called compmix that can assist in this process.

The above code includes a residual plot of the predicted (estimated fixed effects) against the residual (random effects). If you want to examine predictors of random effects (BLUPs) or separate residuals based on the size of experimental unit, here is some code that might help using proc mixed. If the equation is

    y = Xb + Zu + e

then Pred in dataset blue estimates Xb, Pred in dataset blup predicts Xb+Zu, and Resid in each of these is y-Pred. The dataset random has Estimate equal to the predictors of u. See the Syntax and Linear Models Theory for proc mixed for further details. Here is the code:

proc mixed;
   class block var fert;
   model yield = fert var var*fert / outpp = blup outpm = blue;
   random block block*fert / solution;
   ods output SolutionR = random;
/* Output data set random contains the random effects */
proc print data = random;
/* sort out whole plot and split plot residuals */
data blup; set blup; /* best linear unbiased predictors */
    blup = pred;
    spresid = resid; /* split plot residual */
    drop Pred L95 U95 Resid;
data blue; set blue; /* best linear unbiased estimators */
    blue = pred; /* estimate of fixed effects */
    wpresid = resid; 
    drop Pred L95 U95 Resid;
proc sort data = blup; by fert block var;
proc sort data = blue; by fert block var;
data both; merge blue blup; by fert block var;
    wpresid = wpresid - spresid; /* whole plot residual */
proc sort data = both; by fert block;
proc plot data = both; by fert block;
    plot spresid*wpresid = var;

Multiple Responses (MANOVA and DA)

It is possible to do the same analysis on multiple responses by stacking them up on the left side of the model. However, it may be wise to consider only one response until you are sure about the type of analysis you are doing -- save paper! This works for most procedures with a model phrase; here is a reg example:

proc reg;		/* fit multiple responses in regression */
   model y1 y2 y3 = x;

Sometimes multiple responses are correlated, and it is desireable to examine their relationship to each other adjusted by the experimental design. Alternatively, one may simply want to identify the "best" response, or linear combination of responses, which shows off treatment differences. Several related approaches and procedures are available. Note that these multivariate methods may be less powerfull than univariate analyses if responses are not correlated. Further, there is no guarantee that multivariate analysis leads to ready, intuitive interpretation.

Multivariate ANOVA (MANOVA)
Discriminant Analysis (DA)
Step-Down Analysis

Multivariate ANOVA, or MANOVA, includes several related methods for examining these questions. It is invoked by a manova phrase with either proc anova or proc glm.

proc glm;			/* multivariate analysis of variance */
   class a;
   model y1 y2 y3 = a;
   manova h = a / printe printh; /* see below for printe, printh */

The manova phrase results in several overall multivariate tests (Wilk's lambda, Pillai's trace, Hotelling-Lawley trace and Roy's greatest root) which, in different ways, reduce the responses on the left of the model to one dimension. The printe option produces the error covariance matrix (almost -- need to divide by degrees of freedom!) and the matrix of partial correlations among the responses. The printh option yields the hypothesis SS matrix for the factor (a).

Discriminant Analysis (DA) is a closely related technique which allows one to identify linear combinations of responses that do the "best" in discriminating among groups. The goal is to reduce "dimensionality" by either focusing on only a few responses (stepwise) or finding a few linear combinations (canonical). Often these two approaches are used in tandem.

Stepwise DA uses proc stepdisc to pick the most significant responses, in order, that explain group differences. It can also be performed using a sequence of analyses of covariance (ANCOVA) via proc glm and checking factors adjusted for covariates; however this produces a lot more output!.

proc stepdisc data=iris bsscp tsscp;	/* Stepwise DA */
   class species;
   var seplen sepwid petlen petwid;

proc glm;		/* Stepwise DA using PROC GLM (i.e. ANCOVA) */
   class species;
   model seplen sepwid petlen petwid = species / ss1;
proc glm;		/* check species adjusted for petlen */
   class species;
   model seplen sepwid petwid = petlen species / ss1;
proc glm;		/* check species adjusted for petlen+sepwid */
   class species;
   model petwid = petlen sepwid species / ss1;
proc glm;		/* check species adjusted for 3 covariates */
   class species;
   model seplen = petlen sepwid petwid species / ss1;

Canonical DA can use all multiple responses, or a subset determined by Stewise DA. The eigenvalues, in decreasing value, indicate the ratio of SS_H to SS_E for each linear combination of responses. Typically, one only examines the significant canonical variates, as assessed by an F-test. Canonical variate correlations (called Total Canonical Structure) are a good method to assess the "importance" of responses by noting how strongly they are correlated with canonical variates. The weights (called Raw Canonical Coefficients) are poor indicators of importance as they are greatly influenced by correlation among responses.

proc candisc out=can distance anova;	/* canonical DA */
   class species;
   var seplen sepwid petlen petwid;
proc plot data=can;               /* plot of first two canonical variates */
   plot can2*can1=species;

Step-Down Analysis assumes that the scientist has a predetermined idea of cause and effect to guide analysis of group differences. That is, groups may affect y1, which in turn affects y2 and so on. This can be analysed using a sequence of ANCOVA via proc glm and checking factors adjusted for covariates, much as in Stepwise DA above. The difference here is that the order of entry of responses is specified in advance by the scientist (based on knowledge from other experiments) rather than ordering based on data analysis.

proc glm;			/* Step-Down Analysis */
   class a;
   model y1 = a / ss1;
proc glm;
   class a;
   model y2 = y1 a / ss1;	/* check factor a adjusted for y1 */
proc glm;
   class a;
   model y3 = y1 y2 a / ss1;

Repeated Measures

Many experiments have repeated measurements on the same experimental unit (e.g., over time on humans, animals or plants). In these situations, there is some concern about correlations between successive measurements. Both proc anova and proc glm have repeated measures options, as shown below. However, they are correct only for balanced experiments.

proc anova;		/* repeated measures with polynomials */
   class trt;
   model t1 t2 t3 t4 = trt;
   means trt;
   repeated time 4 (1 2 3 4) polynomial / summary;

proc glm;		/* repeated measures with profile */
   class trt;
   model t1 t2 t3 t4 = trt;
   means trt;
   repeated time 4 (1 2 3 4) profile / summary;

The repeated phrase has several components: variable name (time), number of levels (4), values of levels (1,2,3,4), types of curves over time (polynomial or profile). The option summary insures that the polynomial or profile analyses are printed.

Several overall multivariate tests (Wilk's lambda, Pillai's trace, Hotelling-Lawley trace and Roy's greatest root) reduce, in different ways, the responses over time to one dimension. These tests are the same ones considered for the manova phrase discussed above.

Then new proc mixed (version 6.0.9) handles repeated measures in a very natural way. It is newer, more general and can handle missing data. However, its options and setup are different from the above. Best to check out (Technical Report P-229) for full details. Here are three ways to fit the compound symmetry model (with repeated measures over time on each unit):

proc mixed;		/* compound symmetry */
   class trt unit;
   model y = trt | time / solution;
   repeated / type = cs subject = unit;
   lsmean trt | time;

proc mixed;		/* compound symmetry */
   class trt unit;			/* note change in DDF */
   model y = trt | time / solution;
   random unit / solution;

proc mixed;		/* compound symmetry */
   class trt unit;			/* no ratio estimate */
   model y = trt | time;
   repeated intercept diag / subject = unit;

The solution option to the model (or random) phrase prints the solution for the fixed (or random) effects. The lsmeans phrase computes correct estimates and standard errors, but does not have the same options as found for proc glm. Note in addition the different form of the repeated phrase from that used in proc anova and proc glm. Notice also the difference between treating time as fixed (repeated phrase used) or random (random phrase used).

You can also fit autoregressive models of order p (type=ar(p)) or order n (type=toep). Now there are a wide variety of other options as well.

proc mixed;		/* auto-regression(1) */
   class trt unit;
   model y = trt | time;
   random intercept time / type = ar(1) subject = unit;

Here are two ways to fit a multivariate model with unstructured covariance matrix. The first treats effects over time as random, while the second treates time as fixed.

proc mixed;			/* unstructured (multivariate) */
   class trt unit;		/* random coefficients */
   model y = trt | time;
   random intercept time / type = un subject = unit;

proc mixed;			/* unstructured (multivariate) */
   class trt unit;		/* fixed coefficients */
   model y = trt | time;
   repeated / hlps hlm type = un subject = unit;

The multivariate statistic of Hotelling-Lawley-McKeon (hlm) or Hotelling-Lawley-Pillai-Samson (hlps) is availabe with the repeated phrase when type = un is specified. Both are superior to the F statistic for small samples; hlm appears to be slightly preferred Other MV test (Roy's greatest root, Wilk's lambda, Pillai trace) are apparently not available.

Multidimensional Preference

Multidimensional preference (or multidimensional unfolding, or MDPREF) is a principal components analysis which projects both rows and columns of a preference matrix onto a common "preference space" using a "biplot". This technique has strong connections to multidimensional scaling (MDS), which would consider projecting only the rows or the columns, and correspondence analysis, which explicitly measures preference in terms of counts of events. The procedure prinqual can be used for analysis (see example in SAS/STAT for details). Here is an abbreviated example that projects onto n=2 dimensions, replacing observations by transformed data and including PC scores and correlations. The _type_ variable distinguishes 'SCORE' from the row identifiers.

proc prinqual n=2 out=results replace standard scores correlations;
   id row;
   transform monotone(col1-col20);
proc sort data=results; by prin1;
proc print;
   var row _type_ prin1 prin2;

Procedures for categorical responses are inherently more complicated than linear models for continuous (roughly normal) responses. SAS procedures in this area have been evolving over the years. Initially there were freq and funcat. The categorical modelling procedure catmod replaced funcat in the late 1980s. The new proc genmod (version 6.0.9) offers a unified method to do generalized linear modesl (in the same fashion as GLIM and S).

Frequency Tables

Frequency tables can be produced with proc freq, along with the chi-square statistic and other test statistics. The catmod procedure operates similarly to reg or glm, allowing one to develop a model for each category (or cell) of the frequency table. It then computes a summary table like the analysis of variance table, but based on chi-square tests rather than F tests. Here are typical uses for frequency tables:

proc freq;			/* one entry for each subject */
   tables sex;			/* one-way table */

proc freq;			/* one entry for each cell */
   tables infilt*score;		/* two-way table */
   weight count;		/* cell weighted as "count" subjects */

Options, as for other procedure, can be combined. The first set request additional or reduced information by cell:

   tables infilt*score / expected;	/* expected values */
   tables infilt*score / deviation;	/* observed - expected */
   tables infilt*score / cellchi2;	/* contribution to chi-square */
   tables infilt*score / nocol;		/* no column percentage */
   tables infilt*score / norow;		/* no row percentage */
   tables infilt*score / nopercent;	/* no percentages */
   tables infilt*score / nofreq;	/* no frequencies */
   tables infilt*score / missprint;	/* show missing value freqs */
   tables infilt*score / sparse;	/* condense for sparse table */

The following options specify statistical tests:

   tables infilt*score / chisq;		/* chisquare homogeneity test */
   tables infilt*score / alpha=.05 chisq; /* chisquare at level .05 */
   tables infilt*score / cmh;		/* tests for association */
   tables infilt*score / exact;		/* Fisher's exact test */
   tables infilt*score / measures;	/* measures of association */
   tables infilt*score / all;		/* all of the above */

Categorical Modelling

Categorical modelling with proc catmodis tricky. Best to consult the SAS/STAT User's Guide. Here are two very simple examples. The first concerns fitting a straight line to the probability of infiltration regressed on score:

proc catmod;
   direct score;
   response means;
   model infilt = score;
   weight count;

The direct phrase states that score is a regression type predictor. The response phrases sets up the linear regression response.

The second example performs logistic regression of y (=1 for response, =0 for no response) on dose:

proc catmod;
   weight count;
   direct dose;
          response clogits;
   model y = dose / freq ml nogls;

It is possible to save predicted values using an out= phrase; however, the setup is rather different than for glm or reg. Again, it is best to consult the User's Guide and/or get consulting help.

Generalized Linear Models

The new proc genmod (version 6.0.9, TR P-243) offers a unified method to do generalized linear models (in the same fashion as GLIM and S). Anyone want to fill out details here?

Generalized Estimating Equation

Rezaul Karim and Scott Zeger wrote an IML macro to implement their generalized estimating equation approach to fitting GLIM type models for longitudinal data. This may be of interest to your users. It may give a syntax error message in the log file (all the results are correct). The relevant files are

gee.sas		test program
geemac.sas	GEE IML macro
gee1.dat	test data
gee.doc		documentation for Gee macro (including test output)

Contact Karim or Zeger if you want to obtain this. (It may be at StatLib)

Return to U WI Statistics Home Page

Last modified: Thu Apr 2 09:07:24 1998 by Brian Yandell Wed Mar 22 11:21:45 1995 by Stat Www (statwww@stat.wisc.edu)