Here are the data from the flower lengths example in the 9.1 notes.
bihai = c(47.12, 46.75, 46.81, 47.12, 46.67, 47.43, 46.44, 46.64, 48.07, 48.34,
48.15, 50.26, 50.12, 46.34, 46.94, 48.36)
red = c(41.9, 42.01, 41.93, 43.09, 41.47, 41.69, 39.78, 40.57, 39.63, 42.18,
40.66, 37.87, 39.16, 37.4, 38.2, 38.07, 38.1, 37.97, 38.79, 38.23, 38.87,
37.78, 38)
yellow = c(36.78, 37.02, 36.52, 36.11, 36.03, 35.45, 38.13, 37.1, 35.17, 36.82,
36.66, 35.68, 36.03, 34.57, 34.6)
R's ANOVA functions expect data in two columns. The first consists of all N=54 measurements combined:
(flower.lengths = c(bihai, red, yellow))
## [1] 47.12 46.75 46.81 47.12 46.67 47.43 46.44 46.64 48.07 48.34 48.15
## [12] 50.26 50.12 46.34 46.94 48.36 41.90 42.01 41.93 43.09 41.47 41.69
## [23] 39.78 40.57 39.63 42.18 40.66 37.87 39.16 37.40 38.20 38.07 38.10
## [34] 37.97 38.79 38.23 38.87 37.78 38.00 36.78 37.02 36.52 36.11 36.03
## [45] 35.45 38.13 37.10 35.17 36.82 36.66 35.68 36.03 34.57 34.60
(N = length(flower.lengths))
## [1] 54
The second column consists of labels indicating from which population
each measurement came. To make such a column, I'll use the function
rep(x, times) to replicate a label many times. It makes a vector of
x repeated times times.
(bihai.labels = rep(x = "bihai", times = length(bihai)))
## [1] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"
## [9] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"
red.labels = rep(x = "red", times = length(red))
yellow.labels = rep(x = "yellow", times = length(yellow))
(flower.labels = c(bihai.labels, red.labels, yellow.labels))
## [1] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"
## [8] "bihai" "bihai" "bihai" "bihai" "bihai" "bihai" "bihai"
## [15] "bihai" "bihai" "red" "red" "red" "red" "red"
## [22] "red" "red" "red" "red" "red" "red" "red"
## [29] "red" "red" "red" "red" "red" "red" "red"
## [36] "red" "red" "red" "red" "yellow" "yellow" "yellow"
## [43] "yellow" "yellow" "yellow" "yellow" "yellow" "yellow" "yellow"
## [50] "yellow" "yellow" "yellow" "yellow" "yellow"
R calls a categorical variable a “factor”; we can convert the vector
of character string labels to a factor with the factor() function:
(flower.species = factor(flower.labels))
## [1] bihai bihai bihai bihai bihai bihai bihai bihai bihai bihai
## [11] bihai bihai bihai bihai bihai bihai red red red red
## [21] red red red red red red red red red red
## [31] red red red red red red red red red yellow
## [41] yellow yellow yellow yellow yellow yellow yellow yellow yellow yellow
## [51] yellow yellow yellow yellow
## Levels: bihai red yellow
(The difference between a vector of character strings and a factor is subtle; we'll not worry about it.)
The familiar function lm() that we used to fit linear regression
models can also do one-way ANOVA. The function anova(), when called
on the output of lm(), displays the ANOVA table.
model = lm(flower.lengths ~ flower.species)
anova(model)
## Analysis of Variance Table
##
## Response: flower.lengths
## Df Sum Sq Mean Sq F value Pr(>F)
## flower.species 2 1083 542 259 <2e-16 ***
## Residuals 51 107 2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These are the values we found in lecture. (Well, the “Total” line is missing, probably because it doesn't play a role in the F test.)
As before, options(digits=15) before the anova() call can cause
more decimal places to be displayed.
Here's the quick 9.1 Review example we did at the start of the 9.2/9.3 notes.
data = c(c(0, 1, 2), c(1, 2, 3), c(5, 6, 7))
sample = factor(c(rep(x = "1", times = 3), rep(x = "2", times = 3), rep(x = "3",
times = 3)))
model = lm(data ~ sample)
anova(model)
## Analysis of Variance Table
##
## Response: data
## Df Sum Sq Mean Sq F value Pr(>F)
## sample 2 42 21 21 0.002 **
## Residuals 6 6 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here's the 9.3 example on the dependence, if any, of a tool lifetime's on feed rate and speed.
lifetime.row.1 = c(60.6, 57, 61.4, 59.7, 57.8, 59.4, 62.8, 58.2, 56.5, 52.3,
58.1, 53.9)
lifetime.row.2 = c(51.2, 53.1, 48.3, 51.6, 49.6, 48.1, 49.8, 51.1, 45.7, 48.6,
45, 49.2)
lifetime.row.3 = c(44.8, 46.7, 41.9, 51.3, 46.6, 41.4, 38.3, 37.9, 37.2, 32.8,
39.9, 35.9)
(lifetime = c(lifetime.row.1, lifetime.row.2, lifetime.row.3))
## [1] 60.6 57.0 61.4 59.7 57.8 59.4 62.8 58.2 56.5 52.3 58.1 53.9 51.2 53.1
## [15] 48.3 51.6 49.6 48.1 49.8 51.1 45.7 48.6 45.0 49.2 44.8 46.7 41.9 51.3
## [29] 46.6 41.4 38.3 37.9 37.2 32.8 39.9 35.9
feed.rate.row.1 = rep(x = "light", times = 12)
feed.rate.row.2 = rep(x = "medium", times = 12)
feed.rate.row.3 = rep(x = "heavy", times = 12)
(feed.rate = factor(c(feed.rate.row.1, feed.rate.row.2, feed.rate.row.3)))
## [1] light light light light light light light light light light
## [11] light light medium medium medium medium medium medium medium medium
## [21] medium medium medium medium heavy heavy heavy heavy heavy heavy
## [31] heavy heavy heavy heavy heavy heavy
## Levels: heavy light medium
speed.row.1 = c(rep(x = "slow", times = 4), rep(x = "medium", times = 4), rep(x = "fast",
times = 4))
speed.row.2 = speed.row.1 # copy row 1
speed.row.3 = speed.row.1 # copy row 1 again
(speed = factor(c(speed.row.1, speed.row.3, speed.row.3)))
## [1] slow slow slow slow medium medium medium medium fast fast
## [11] fast fast slow slow slow slow medium medium medium medium
## [21] fast fast fast fast slow slow slow slow medium medium
## [31] medium medium fast fast fast fast
## Levels: fast medium slow
model = lm(lifetime ~ feed.rate + speed + feed.rate : speed) # ':' indicates an interaction term
options(digits = 6)
anova(model)
## Analysis of Variance Table
##
## Response: lifetime
## Df Sum Sq Mean Sq F value Pr(>F)
## feed.rate 2 1718.4 859.2 117.451 4.77e-14 ***
## speed 2 224.2 112.1 15.323 3.57e-05 ***
## feed.rate:speed 4 48.8 12.2 1.666 0.187
## Residuals 27 197.5 7.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Rather than keeping these data in several parallel vectors, it's
probably easier to keep them in a “data frame,” which is R's
equivalent of a spreadsheet. It's a table of several observations, one
per row, of several variables, one per column. The function
data.frame(...) creates a data frame from the vectors in “...”.
(d = data.frame(lifetime, feed.rate, speed))
## lifetime feed.rate speed
## 1 60.6 light slow
## 2 57.0 light slow
## 3 61.4 light slow
## 4 59.7 light slow
## 5 57.8 light medium
## 6 59.4 light medium
## 7 62.8 light medium
## 8 58.2 light medium
## 9 56.5 light fast
## 10 52.3 light fast
## 11 58.1 light fast
## 12 53.9 light fast
## 13 51.2 medium slow
## 14 53.1 medium slow
## 15 48.3 medium slow
## 16 51.6 medium slow
## 17 49.6 medium medium
## 18 48.1 medium medium
## 19 49.8 medium medium
## 20 51.1 medium medium
## 21 45.7 medium fast
## 22 48.6 medium fast
## 23 45.0 medium fast
## 24 49.2 medium fast
## 25 44.8 heavy slow
## 26 46.7 heavy slow
## 27 41.9 heavy slow
## 28 51.3 heavy slow
## 29 46.6 heavy medium
## 30 41.4 heavy medium
## 31 38.3 heavy medium
## 32 37.9 heavy medium
## 33 37.2 heavy fast
## 34 32.8 heavy fast
## 35 39.9 heavy fast
## 36 35.9 heavy fast
The calculations are almost the same as before, except that now we use
the form lm(formula, data), where formula uses column names from
the data frame data:
model = lm(formula = lifetime ~ feed.rate + speed + feed.rate : speed, data = d)
anova(model)
## Analysis of Variance Table
##
## Response: lifetime
## Df Sum Sq Mean Sq F value Pr(>F)
## feed.rate 2 1718.4 859.2 117.451 4.77e-14 ***
## speed 2 224.2 112.1 15.323 3.57e-05 ***
## feed.rate:speed 4 48.8 12.2 1.666 0.187
## Residuals 27 197.5 7.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can write a data frame x to a comma-separated values (“.csv”)
file named file via write.csv(x, file). e.g. write.csv(x=d,
file="lifetime.csv") would create the file “lifetime.csv”. Excel can
read a “.csv” file.
We can read a “.csv” file (created by R, typed by hand, or created by
Excel or other software) into a data frame via
read.csv(file). e.g. d = read.csv("lifetime.csv") would create the
data frame d from the file “lifetime.csv”.
Running ?command opens the Help tab of RStudio with documentation of
command (if it exists). e.g. To learn more about read.csv(), run
?read.csv in the console.
I should have mentioned this in the first chapter's R Guide! I didn't because R's help can be intimidating to a beginner, because it often says way more than he or she wants to know. So I should have mentioned it in the second chapter, with the warning that you need to learn to ignore parts of it to find what you need.
Using google to find help with R is usually successful. e.g. try googling “R read.csv” and checking the first few links returned.
… coming soon …