SAS Data Steps

Data Input
Data Transformations
Data Manipulation
Random Numbers
Experimental Design Using SAS

Return to SAS Introduction or Information on SAS.

Data Input

Input may be done directly in your myfile.sas or may come from another file. For direct input, here is an example:

data direct;
   input x y;
   cards;
1	17.5
3	20.5
;

Here is an example of data input from another file:

data pulse;
   infile '/p/stat/Data/MJ/pulse.dat' missover;
   input x y;

The names direct and pulse are arbitrary, but can be used later in your SAS program to identify this particular data set. Details of input phrases (use either infile or cards, but not both):

data a;				 create new data set named "a"
   input x y z;			 input 3 numbers at a time as variables x,y,z
   input trt $ x y;		 input treatment "trt" as a character string
				 and x,y as numbers. Note the dollar sign ($).
   infile 'blah.dat' missover;	 use file "blah.dat" for the data	
				 "missover" skips over missing data rather
				 than going to a new line
				 (must appear BEFORE the input phrase)
   infile 'blah.dat' firstobs=2; skip first observation (first line)
				 handy way to document column names
   infile 'blah.dat' lrecl=2000; allow for really long records
   readlines;			 same as cards (I think!)
   cards;			 read data from following lines
				 (must appear AFTER the input phrase)
;				 end of data entry for "cards" phrase
				 (good convention, but not required)

Data values must have spaces between them (tabs can cause problems on some systems). All values must be on the same line if using the missover option. Missing data is represented by a period (.) as place holder. This can also be useful for estimation and prediction at new values using proc reg.

Data Transformations

There is no need to transform your raw data outside of SAS. In fact, it is good practice to leave your data file alone once it is debugged. Transforms are usually done in a separate data paragraph after data input. Here you need to identify the data set previously run. An example:

data logs; set direct;
   logy = log(y);

This creates a new data set logs from the set direct from data input above. The variable logy is created as the natural log of the variable y. Here are details of the first line and some transformations:

data a; set b;		create data set "a" using existing set "b"
   z = log(y);		create variable z as natural log of variable y
   z = log10(y);	log base 10
   z = sqrt(y);		square root
   z = x*y;		multiplication
   			(+ addition) (- subtraction) (/ division)
   z = y**2;		exponent: "y squared" or "y to the 2nd power"
   z = y**0.5;		"y to the 1/2 power" (same as sqrt(y))
   z = x**-2;		negative exponent: "1 over (x squared)"
   z = sin(x);		trigonometric sine function of x
			(also cos(x), tan(x), ...)

Variance Stabilizing Transformations

data a; set b;
   z = sqrt(count);		/* counts (Poisson distribution) */
				/* variance proportional to mean */
   z = log(conc);		/* concentrations, weights (log normal) */
				/* SD proportional to mean */
				/* constant coefficient of variation (CV) */
   z = arsin(sqrt(prop));	/* proportions (0-1) */
   z = arsin(sqrt(pct/100));	/* percentages (0-100) */
				/* (Binomial distribution) */
				/* variance proportional highest in middle */

Data Manipulation

You can add or drop variables and/or observations from a dataset. For instance, if you only wanted to consider the data with x greater than 10, you could have:

data other; set big;			/* create other from big */
   if x > 10;				/* only use these cases */

Suppose you had data set field with 3 treatments called control, wet, dry and you wanted to delete the control group for some procedures,

data trtonly; set field;		/* create trtonly from field */
   if trt = 'control' then delete; 	/* delete control group */

Here is some more detail on the if phrase:

   g = 0;				/* g=0 for large x */
   if x < 10 then g = 1;		/* g=1 for small x */

   if y = 99 then y = .;		/* recode 99 as missing data */
   if y = . then y = 0;			/* recode missing data as 0 */

   if z < 10 or y > 10 then x = 5;	/* examples of union (or) */
   if z < 10 and y > 10 then x = 6;	/* and intersection (and) */

   if x <= 10;				/* keep only x at most 10 */
   if x >= 10;				/* keep only x at least 10 */
   if not (x = 10);			/* keep only if x is not 10 */

You already saw how to add variables in transformations above. You can drop variables:

data a; set b;
   z = log(y);				/* create new variable z */
   drop y;				/* drop old variable y */

Usually dropping is NOT done because the cost of carrying the unused variables is very small (unless you have a lot of data!). However, this is sometimes useful if the data need to be presented in a different way. For instance,

data abc;
   input n0 n1 n2 n3 n4 n5;
   cards;
1.4	1.5	1.2	2.1	2.1	2.8
1.7	1.4	1.0	1.4	1.7	2.1
1.1	1.9	2.5	2.6	2.1	2.2
1.7	1.3	1.1	1.0	2.0	1.8
1.0	1.8	1.5	1.4	2.2	2.3
data resps; set abc;
   resp = n0; level = 0; output;
   resp = n1; level = 1; output;
   resp = n2; level = 2; output;
   resp = n3; level = 3; output;
   resp = n4; level = 4; output;
   resp = n5; level = 5; output;
   drop n0--n5;

Basically, the output phrase produces a new observation after we create the variables resp and level.

Random Numbers

Random numbers are available for a wide variety of distributions. These can also be used to generate experimental designs. It is best to use the functions with names begining with ran -- the uniform function ranuni appears to be better behaved than the function uniform using standard tests. But remember, computer generated random numbers are never truly random -- caution and some checking on your own are always a good idea. Random numbers can be generated in a data paragraph:

data a;
   do i=1 to 10;
      uni=ranuni(0);    /* an argument of 0 uses the clock as a seed */
		        /* otherwise, use a 5 to 7 digit odd number */
      output;
   end;

Note the use of a do loop, which is ended by an end; phrase. The output forces creation of a new case for each uniform number. Each case in set a will have the variables uni and i. Here are the random number generators:

   x = ranuni(seed)		/* uniform between 0 & 1 */
   x = a+(b-a)*ranuni(seed);	/* uniform between a & b */
   x = ranbin(seed,n,p);	/* binomial size n prob p */
   x = rancau(seed);		/* cauchy with loc 0 & scale 1 */
   x = a+b*rancau(seed);	/* cauchy with loc a & scale b */
   x = ranexp(seed);		/* exponential with scale 1 */
   x = ranexp(seed) / a;	/* exponential with scale a */
   x = a-b*log(ranexp(seed));	/* extreme value loc a & scale b */
   x = rangam(seed,a);		/* gamma with shape a */
   x = b*rangam(seed,a);	/* gamma with shape a & scale b */
   x = 2*rangam(seed,a);	/* chi-square with d.f. = 2*a */
   x = rannor(seed);		/* normal with mean 0 & SD 1 */
   x = a+b*rannor(seed);	/* normal with mean a & SD b */
   x = ranpoi(seed,a);		/* poisson with mean a */
   x = rantri(seed,a);		/* triangular with peak at a */
   x = rantbl(seed,p1,p2,p3);	/* random from (1,2,3) with probs */
				/* p1,p2,p3 */

The seed above is either 0 (use clock to randomly start sequence); positive (used as initial seed -- it should be odd and less than 2**31-1); or negative (use the clock to restart the sequence every time). The performance is untested for 0 or negative seed -- use at your own risk. The seed is only examined on the first encounter with a random number generator in your program, so you cannot change the process once you begin.

Experimental Design Using SAS

Experimental designs can be laid out using SAS. Here is an example of a design with 4 treatments and 5 replicates per treatment. Suppose set b has identifiers called id. This assigns trt the values 1,2,3,4, each with 5 replicates.

data uniform;
   do i = 1 to 20;
      x = ranuni(0);
      output;
   end;
data a;
   merge b uniform;
proc sort; by x;
data c; set a;			/* _N_ = line number */
   trt = ceil(_N_ / 5);		/* ceil = next highest integer */
proc sort; by id;
proc print;
   var id trt

Here is a randomized comblete design, with 3 blocks and 4 treatments per block. We assign the treatments 1,2,3,4 at random to the 4 sites within a block.

data a;
   do block = 1 to 3;
      do site = 1 to 4;
         x = ranuni(0);
         output;
      end;
   end;
proc sort; by block x;
data c; set a;
   trt = 1 + mod(_N_ - 1, 4);	/* mod = remainder of _N_/4 */
proc sort; by block site;
proc print;
   var block site trt;

Return to U WI Statistics Home Page

Last modified: Tue Feb 6 14:12:35 1996 by Brian Yandell Tue Feb 14 11:09:50 1995 by Stat Www (statwww@stat.wisc.edu)