Chapter 9 Factorial Experiments

9.1 One-Factor Experiments

Here are the data from the flower lengths example in the 9.1 notes.

bihai = c(47.12, 46.75, 46.81, 47.12, 46.67, 47.43, 46.44, 46.64, 48.07, 48.34, 
    48.15, 50.26, 50.12, 46.34, 46.94, 48.36)
red = c(41.9, 42.01, 41.93, 43.09, 41.47, 41.69, 39.78, 40.57, 39.63, 42.18, 
    40.66, 37.87, 39.16, 37.4, 38.2, 38.07, 38.1, 37.97, 38.79, 38.23, 38.87, 
    37.78, 38)
yellow = c(36.78, 37.02, 36.52, 36.11, 36.03, 35.45, 38.13, 37.1, 35.17, 36.82, 
    36.66, 35.68, 36.03, 34.57, 34.6)

R's ANOVA functions expect data in two columns. The first consists of all N=54 measurements combined:

(flower.lengths = c(bihai, red, yellow))

##  [1] 47.12 46.75 46.81 47.12 46.67 47.43 46.44 46.64 48.07 48.34 48.15
## [12] 50.26 50.12 46.34 46.94 48.36 41.90 42.01 41.93 43.09 41.47 41.69
## [23] 39.78 40.57 39.63 42.18 40.66 37.87 39.16 37.40 38.20 38.07 38.10
## [34] 37.97 38.79 38.23 38.87 37.78 38.00 36.78 37.02 36.52 36.11 36.03
## [45] 35.45 38.13 37.10 35.17 36.82 36.66 35.68 36.03 34.57 34.60

(N = length(flower.lengths))

## [1] 54

The second column consists of labels indicating from which population each measurement came. To make such a column, I'll use the function rep(x, times) to replicate a label many times. It makes a vector of x repeated times times.

(bihai.labels = rep(x = "bihai", times = length(bihai)))

##  [1] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"
##  [9] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"

red.labels = rep(x = "red", times = length(red))
yellow.labels = rep(x = "yellow", times = length(yellow))
(flower.labels = c(bihai.labels, red.labels, yellow.labels))

##  [1] "bihai"  "bihai"  "bihai"  "bihai"  "bihai"  "bihai"  "bihai" 
##  [8] "bihai"  "bihai"  "bihai"  "bihai"  "bihai"  "bihai"  "bihai" 
## [15] "bihai"  "bihai"  "red"    "red"    "red"    "red"    "red"   
## [22] "red"    "red"    "red"    "red"    "red"    "red"    "red"   
## [29] "red"    "red"    "red"    "red"    "red"    "red"    "red"   
## [36] "red"    "red"    "red"    "red"    "yellow" "yellow" "yellow"
## [43] "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"
## [50] "yellow" "yellow" "yellow" "yellow" "yellow"

R calls a categorical variable a “factor”; we can convert the vector of character string labels to a factor with the factor() function:

(flower.species = factor(flower.labels))

##  [1] bihai  bihai  bihai  bihai  bihai  bihai  bihai  bihai  bihai  bihai 
## [11] bihai  bihai  bihai  bihai  bihai  bihai  red    red    red    red   
## [21] red    red    red    red    red    red    red    red    red    red   
## [31] red    red    red    red    red    red    red    red    red    yellow
## [41] yellow yellow yellow yellow yellow yellow yellow yellow yellow yellow
## [51] yellow yellow yellow yellow
## Levels: bihai red yellow

(The difference between a vector of character strings and a factor is subtle; we'll not worry about it.)

The familiar function lm() that we used to fit linear regression models can also do one-way ANOVA. The function anova(), when called on the output of lm(), displays the ANOVA table.

model = lm(flower.lengths ~ flower.species)
anova(model)

## Analysis of Variance Table
## 
## Response: flower.lengths
##                Df Sum Sq Mean Sq F value Pr(>F)    
## flower.species  2   1083     542     259 <2e-16 ***
## Residuals      51    107       2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These are the values we found in lecture. (Well, the “Total” line is missing, probably because it doesn't play a role in the F test.)

As before, options(digits=15) before the anova() call can cause more decimal places to be displayed.

Here's the quick 9.1 Review example we did at the start of the 9.2/9.3 notes.

data = c(c(0, 1, 2), c(1, 2, 3), c(5, 6, 7))
sample = factor(c(rep(x = "1", times = 3), rep(x = "2", times = 3), rep(x = "3", 
    times = 3)))
model = lm(data ~ sample)
anova(model)

## Analysis of Variance Table
## 
## Response: data
##           Df Sum Sq Mean Sq F value Pr(>F)   
## sample     2     42      21      21  0.002 **
## Residuals  6      6       1                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.3 Two-Factor Experiments

Here's the 9.3 example on the dependence, if any, of a tool lifetime's on feed rate and speed.

lifetime.row.1 = c(60.6, 57, 61.4, 59.7, 57.8, 59.4, 62.8, 58.2, 56.5, 52.3, 
    58.1, 53.9)
lifetime.row.2 = c(51.2, 53.1, 48.3, 51.6, 49.6, 48.1, 49.8, 51.1, 45.7, 48.6, 
    45, 49.2)
lifetime.row.3 = c(44.8, 46.7, 41.9, 51.3, 46.6, 41.4, 38.3, 37.9, 37.2, 32.8, 
    39.9, 35.9)
(lifetime = c(lifetime.row.1, lifetime.row.2, lifetime.row.3))

##  [1] 60.6 57.0 61.4 59.7 57.8 59.4 62.8 58.2 56.5 52.3 58.1 53.9 51.2 53.1
## [15] 48.3 51.6 49.6 48.1 49.8 51.1 45.7 48.6 45.0 49.2 44.8 46.7 41.9 51.3
## [29] 46.6 41.4 38.3 37.9 37.2 32.8 39.9 35.9

feed.rate.row.1 = rep(x = "light", times = 12)
feed.rate.row.2 = rep(x = "medium", times = 12)
feed.rate.row.3 = rep(x = "heavy", times = 12)
(feed.rate = factor(c(feed.rate.row.1, feed.rate.row.2, feed.rate.row.3)))

##  [1] light  light  light  light  light  light  light  light  light  light 
## [11] light  light  medium medium medium medium medium medium medium medium
## [21] medium medium medium medium heavy  heavy  heavy  heavy  heavy  heavy 
## [31] heavy  heavy  heavy  heavy  heavy  heavy 
## Levels: heavy light medium

speed.row.1 = c(rep(x = "slow", times = 4), rep(x = "medium", times = 4), rep(x = "fast", 
    times = 4))
speed.row.2 = speed.row.1  # copy row 1
speed.row.3 = speed.row.1  # copy row 1 again
(speed = factor(c(speed.row.1, speed.row.3, speed.row.3)))

##  [1] slow   slow   slow   slow   medium medium medium medium fast   fast  
## [11] fast   fast   slow   slow   slow   slow   medium medium medium medium
## [21] fast   fast   fast   fast   slow   slow   slow   slow   medium medium
## [31] medium medium fast   fast   fast   fast  
## Levels: fast medium slow

model = lm(lifetime ~ feed.rate + speed + feed.rate : speed)  # ':' indicates an interaction term
options(digits = 6)
anova(model)

## Analysis of Variance Table
## 
## Response: lifetime
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## feed.rate        2 1718.4   859.2 117.451 4.77e-14 ***
## speed            2  224.2   112.1  15.323 3.57e-05 ***
## feed.rate:speed  4   48.8    12.2   1.666    0.187    
## Residuals       27  197.5     7.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Data frames (optional)

Rather than keeping these data in several parallel vectors, it's probably easier to keep them in a “data frame,” which is R's equivalent of a spreadsheet. It's a table of several observations, one per row, of several variables, one per column. The function data.frame(...) creates a data frame from the vectors in “...”.

(d = data.frame(lifetime, feed.rate, speed))

##    lifetime feed.rate  speed
## 1      60.6     light   slow
## 2      57.0     light   slow
## 3      61.4     light   slow
## 4      59.7     light   slow
## 5      57.8     light medium
## 6      59.4     light medium
## 7      62.8     light medium
## 8      58.2     light medium
## 9      56.5     light   fast
## 10     52.3     light   fast
## 11     58.1     light   fast
## 12     53.9     light   fast
## 13     51.2    medium   slow
## 14     53.1    medium   slow
## 15     48.3    medium   slow
## 16     51.6    medium   slow
## 17     49.6    medium medium
## 18     48.1    medium medium
## 19     49.8    medium medium
## 20     51.1    medium medium
## 21     45.7    medium   fast
## 22     48.6    medium   fast
## 23     45.0    medium   fast
## 24     49.2    medium   fast
## 25     44.8     heavy   slow
## 26     46.7     heavy   slow
## 27     41.9     heavy   slow
## 28     51.3     heavy   slow
## 29     46.6     heavy medium
## 30     41.4     heavy medium
## 31     38.3     heavy medium
## 32     37.9     heavy medium
## 33     37.2     heavy   fast
## 34     32.8     heavy   fast
## 35     39.9     heavy   fast
## 36     35.9     heavy   fast

The calculations are almost the same as before, except that now we use the form lm(formula, data), where formula uses column names from the data frame data:

model = lm(formula = lifetime ~ feed.rate + speed + feed.rate : speed, data = d)
anova(model)

## Analysis of Variance Table
## 
## Response: lifetime
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## feed.rate        2 1718.4   859.2 117.451 4.77e-14 ***
## speed            2  224.2   112.1  15.323 3.57e-05 ***
## feed.rate:speed  4   48.8    12.2   1.666    0.187    
## Residuals       27  197.5     7.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can write a data frame x to a comma-separated values (“.csv”) file named file via write.csv(x, file). e.g. write.csv(x=d, file="lifetime.csv") would create the file “lifetime.csv”. Excel can read a “.csv” file.

We can read a “.csv” file (created by R, typed by hand, or created by Excel or other software) into a data frame via read.csv(file). e.g. d = read.csv("lifetime.csv") would create the data frame d from the file “lifetime.csv”.

R's built-in help (optional)

Running ?command opens the Help tab of RStudio with documentation of command (if it exists). e.g. To learn more about read.csv(), run ?read.csv in the console.

I should have mentioned this in the first chapter's R Guide! I didn't because R's help can be intimidating to a beginner, because it often says way more than he or she wants to know. So I should have mentioned it in the second chapter, with the warning that you need to learn to ignore parts of it to find what you need.

Using google to find help with R is usually successful. e.g. try googling “R read.csv” and checking the first few links returned.

9.5 2^p Factorial Experiments