Importing data into R

SIBS logoSummer Institute for Training in Biostatistics (SIBS)

Reading data from plain text files

R reads data frames from plain text files (containing data in a tabular form) using the function read.table(). See help(read.table) for details. Important options:

Things to keep in mind:

Another way of reading in tabular data, sometimes speedier, is to use scan().

Reading data from an Excel spreadsheet

Spreadsheets are often the most convenient way to enter and edit tabular data. To read data from Excel spreadsheets, the safest way is to save the data as a delimited text file first, and then read it using read.table().

Exporting data

The simplest way to export data so that it could be read in by other software (like Excel, SAS, etc) is to write it out to a file using write.table. It has a syntax similar to read.table. The following options are useful:

To explore what these options can achieve, you can write the files to the R session instead of a file (to do this, just suppress the file name argument)

data(thuesen, package = "ISwR")
write.table(thuesen)
write.table(thuesen, row.name = FALSE)
write.table(thuesen, row.name = FALSE, sep = "\t")
write.table(thuesen, row.name = FALSE, sep = ",")

The R data editor

There's a spreadsheet-like data editor for the Windows GUI version of R, but it's not very sophisticated. To use it, do fix(thuesen) or, to leave thuesen unmodified and save the edited data to another variable thu2 <- edit(thuesen)

Data from `foreign' software

Most data analysis software has its own data format. The foreign package has tools to read from a few of the formats most commonly encountered. As biostatisticians, We may expect to encounter data from SAS, typically in the XPORT format. Such files can be read using the read.xport function.

R Data files

R has it's own format to save datasets (or any other R object, like functions). Once you have read in data, you might consider saving the data set in this form, and read it in again the next time you are working with it. This can be useful for two reasons:

Exercises

  1. Save a built-in R dataset to a file, read it using Excel, edit it, and read it back to R.
  2. The Digoxin study (6800 observations on 72 variables) is a trial to examine the safety and efficacy of Digoxin in treating patients with congestive heart failure in sinus rhythm. We have:

    • Documentation for the data
    • A subset of the data containing the first 200 observations; TAB separated, with no column headers, and missing values represented by a period (.)
    • For your convenience, a file containing descriptive names of the columns. Details on what these variables represent are given in the supporting documentation. These names can be read into R as a character vector using
      names <- scan("dig-vars.txt", what = character(0))
      
    1. Read the data (sub)set into R. This should be failry easy, but keep in mind that you have to instruct R to interpret a period as a missing value (NA) (see ?read.table).
    2. Use the variable names supplied separately as names for columns in the data frame. This can be done either when reading the data, or in a separate step after the data is read (by default, the variable names will be V1, V2, ..., V72)
    3. Read the documentation to figure out what the variables are. Make a scatter plot of Body Mass Index vs Age. What other sensible plots can you think of?
    4. Several variables should be factors, but are read as numeric. How could you convert them to factors? What's an efficient way to do this, without writing lines like
      df$RACE <- factor(df$RACE)
      
      for every variable that should be a factor? Hint: for data frames (and lists) with a variable called RACE, say,
      df[["RACE"]]  ## is the same as df$RACE
      df[["RACE"]] <- factor(df[["RACE"]]) ## can be used for replacement
      ## This also works when the name is stored in a variable
      nm <- "RACE"
      df[[nm]]
      
    5. Use this fact in conjunction with a for loop.
Last modified: Tue Jun 21 08:58:02 CDT 2005