Data in S-PLUS

We will simply work through the examples in Chapter 2 in the textbook. Here is an outline of the skills you should obtain.


Additional Examples to Try

Here are additional examples beyond those in the textbook.

There are a couple files with data in my 496 directory. You can copy them to your directory by typing these commands.

% ls ~larget/496
basketball.dat  class-list numbers
% cp ~larget/496/basketball.dat .
% cp ~larget/496/class-list .
% cp ~larget/496/numbers .
The UNIX command cp copies the first file to the second file. If you give the name of a directory instead of a file for the second argument, it creates a file with the same name as the first in the specified directory. Recall that . represents the current directory.

Files in your current directory are simply specified by name. You may also refer to files in other directories by giving a pathname.

The symbol ~ has a special meaning. The argument

% ls ~
lists the contents of your home directory. The command
% ls ~larget/496
lists the contents of user larget's subdirectory 496. (Usually, you do not have permission to read other users directories or files.)

Use emacs to look at the three files. They all are formatted in different ways. The file numbers is a simple matrix of numbers separated by white space.

This will read the data into one long vector called x.

> x <- scan("numbers")
This will read the data into a matrix called y.
> y <- matrix(scan("numbers"),10,10,byrow=T)
In S-PLUS, matrices are stored by column. However, scan will read in the data by row. The argument byrow is set to T for true to read the data in correctly.

Look at the documentation for the function matrix. Notice that the arguments are listed in order, data, nrow, ncol, .... In the example, the arguments were listed in the same order. This isn't necessary if you name the arguments in the command line. For example

> y <- matrix(ncol=10,data=scan("numbers"),nrow=10,byrow=T)
produces the same result.

The file class-list has a header for the first row that does not contain a name for the first column, and a combination of numerical and non-numerical variables. S-PLUS uses an object class called a data frame to store matrix shaped data which is not necessarily all numeric. In this case, the first column is an identifier for each individual, the second and third columns are character variables, the fourth is numeric, and the fifth is a factor with two levels. Notice that any number of spaces and/or tabs can separate the fields.

The function read.table reads data from a file into a data frame. If the first line contains one less argument than the others, the first column of values is treated as row names.

By default, all character variables are converted to factors. To override this, we assign the argument as.is to be true for the variables first and last. We do, however, want the variable gender to be treated as a factor. These commands read in this file into a data frame called students.

> students <- read.table("class-list",as.is=c(T,T,T,F))
You can display this data frame by typing its name.

The last file has both character and numeric variables. This time the data is formatted strictly into columns. Some of the colege names include spaces, so we need to do something different. Also, in this case, there are no row names provided. By default, S-PLUS will search for the first non-numeric column with unique names to use as row names. Since we don't want this, we will override this. There are also no column names. By default, S-PLUS uses V1, V2, .... We can add more meaningful names with the function names.

> x <- read.table("basketball.dat",sep=c(1,5,25,29,48),row.names=NULL)
> names(x) <- c("year","winner","wscore","loser","lscore")
The argument sep gives the first column for each field. (It can also be used if some character other than arbitrary "white space", such as TABS, separates the fields. Setting row.names=NULL causes the row names to simply be the corresponding row number.
Last modified: March 12, 1997

Bret Larget, larget@mathcs.duq.edu