Chapter 4

4.1 The Binomial Distribution

Binomial probabilities

For X ~ Bin(n,p), the cumulative probability distribution function F(x) = P(X ≤ x) is given by pbinom(x, n, p). (This function subsitutes for Table A.1, pp. 515-520.) For example, For X ~ Bin(5, .3), F(2) = P(X ≤ 2) is given by

> pbinom(2, 5, 0.3)

[1] 0.8369

To find P(X = 2), use P(X = 2) = F(2) - F(1):

> pbinom(2, 5, 0.3) - pbinom(1, 5, 0.3)

[1] 0.3087

“n choose x”

The number of outcomes with x successes, “n choose x,” is given by choose(n, x). For example, here's “5 choose 3:”

> choose(5, 3)

[1] 10

which could also be calculated as:

> factorial(5)/(factorial(3) * factorial(5 - 3))

[1] 10

So to get P(X = 2), we could also have used

> choose(5, 2) * (0.3^2) * (1 - 0.3)^(5 - 2)

[1] 0.3087

(Note that ^, used above, is the exponent operator.)

4.2 The Poisson Distribution

For X ~ Poisson(lambda), the cumulative probability distribution function F(x) = P(X ≤ x) is given by ppois(x, lambda). For example, for X ~ Poisson(4), P(X = 5) = F(5) - F(4) is given by

> ppois(5, 4) - ppois(4, 4)

[1] 0.1563

(This confirms our solution to problem 4.2.5(a) in the lecture notes.)

4.5 The Exponential Distribution

For X ~ Exp(lambda), the cumulative probability distribution function F(x) = P(X ≤ x) is given by pexp(x, lambda). For example, for X ~ Exp(1/12), P(X > 15) = 1 - P(X \le; 15) is given by

> 1 - pexp(15, 1/12)

[1] 0.2865

(This confirms our solution to problem 4.5.4(a) in the lecture notes.)

4.6 Some Other Continuous Distributions

Uniform distribution

For X ~ U(a,b), the cumulative probability distribution function F(x) = P(X ≤ x) is given by punif(x, a, b). For example, for X ~ U(0,10), P(X ≤ 4) is given by

> punif(4, 0, 10)

[1] 0.4

4.3 The Normal Distribution

Cumulative probability function F(x) via `pnorm()`

For X ~ N(\( \mu \), \( \sigma^2 \)), the cumulative probability distribution function F(x) = P(X ≤ x) is given by pnorm(x, mu, sigma). For example, for X ~ N(518, 114² ), P(X ≤ 680) is given by

> pnorm(680, 518, 114)

[1] 0.9223

In the 4.3 lecture notes, Eleanor scored higher than this proportion of students. In class we found her standardized score, z, and then used the N(0,1) table, which corresponds to:

> z = (680 - 518)/114
> z

[1] 1.421

> pnorm(z, 0, 1)

[1] 0.9223

Inverse “quantile” function via `qnorm()`

To find the x such that F(x)=p for some cumulative probability (or left-tail-area) p, use x = qnorm(p, mu, sigma). (The “q” prefix refers to “quantile.” A quantile function x=quantile(p) is the inverse of the cumulative distribution function p=F(x).) e.g. The SAT score that's higher than 92.23% of scores is

> qnorm(0.9223, 518, 114)

[1] 680

Or, for IQ scores that are N(100, 15² ) (another example from the 4.3 notes), the score that's better than 25% of scores is

> qnorm(0.25, 100, 15)

[1] 89.88

Density f(x) via `dnorm()`

The the density function f(x) is dnorm(x, mu, sigma), which is useful for graphs. Let's draw the N(0,1) curve over the interval [-3,3]. To get a bunch of x coordinates, use seq(from, to, by), which gives a sequence of points from from, to to, using the increment by. The corresponding y coordinates for y=f(x) are given by dnorm():

> x = seq(-3, 3, 0.5)
> y = dnorm(x, 0, 1)
> plot(x, y, main = "N(0,1)")

plot of chunk dnorm1

Smooth it out by decreasing the increment from .5 to .01:

> x = seq(-4, 4, 0.01)
> y = dnorm(x, 0, 1)
> plot(x, y, main = "N(0,1)")

plot of chunk dnorm2

Random numbers from a normal distribution via `rnorm()`

Generate \( n \) random numbers from N(\( \mu \), \( \sigma^2 \)) via rnorm(n, mu, sigma). Here are 100 random numbers from N(0,1):

> x = rnorm(100, 0, 1)
> x

  [1]  0.06237  0.30731 -0.49006 -0.49438  0.29423  0.22194 -0.22657
  [8] -1.00618 -1.42367  0.86872 -0.34115  0.73883  1.34827 -1.31388
 [15] -0.98579 -0.42858 -2.01318 -0.50307  0.45606 -0.41144  0.70934
 [22]  0.73712  0.97525  1.04083 -0.69671 -1.29635  0.05943  0.97565
 [29] -0.10867 -1.15886  1.86362 -1.03775 -0.91508 -0.07312  1.32779
 [36]  0.96687  0.02575 -1.07823  0.55074  0.33252  0.35317  1.95310
 [43] -0.98310 -0.55296 -0.75479  0.38368  0.72563 -0.35472 -0.17587
 [50] -1.68392  2.49740  1.63988  1.04667  0.26688  1.74554 -0.64943
 [57]  0.94487  1.13604 -0.37853 -0.07760 -1.12952  0.69152  1.13060
 [64]  0.34569  0.02395 -0.15548  0.35961  0.25738 -0.80401 -0.28172
 [71]  1.02331  0.76891  0.14168  2.22751  0.71044 -1.21245  0.83570
 [78]  0.97904 -2.75361  0.63444  2.58448 -0.25352  0.23587  1.30658
 [85] -0.98416 -1.49539  1.47256 -0.63592 -0.94581  0.16478 -2.15813
 [92]  0.74268 -0.43424  0.39039 -0.18879 -0.42065  1.21480  0.89964
 [99]  2.20653  0.21548

As a review, use chapter 1 material to get numerical and graphical summaries of these random numbers.

> mean(x)  # should be near 0

[1] 0.1265

> sd(x)  # should be near 1

[1] 1.035

> median(x)  # should be near the mean since N(0,1) is symmetric

[1] 0.1901

> require("lattice")

Loading required package: lattice

> histogram(x, type = "count")  # should look roughly like N(0,1)

plot of chunk numeric.norm

> densityplot(x)  # should look roughly like N(0,1)

plot of chunk numeric.norm

Notice the symmetry and lack of outliers typical of a sample from a normal distribution.

To summarize, we used the four functions dnorm(), pnorm(), qnorm(), and rnorm(). The “norm” suffix refers to the normal distribution. The d, p, q, and r prefixes refer to “density,” “probability (cumulative),” “quantile,” and “random.” R has these four d, p, q, and r functions for each distribution we'll encounter.

4.4 The Lognormal Distribution

Cumulative probability function F(x) via `plnorm()`

For X ~ lognormal with parameters \( \mu \) and \( \sigma \) (which implies Y=ln(X) ~ N(\( \mu \), \( \sigma^2 \))), the cumulative probability distribution function F(x) = P(X ≤ x) is given by plnorm(x, mu, sigma). For example, for (body mass index) BMI ~ lognormal with parameters \( \mu \)=3.215 and \( \sigma \)=.157, P(BMI ≤ 22) is

> plnorm(22, 3.215, 0.157)

[1] 0.2149

(This confirms our answer to 4.4.3(d) in the lecture notes.)

Inverse “quantile” function via `qlnorm()`

To find the x such that F(x)=p for some cumulative probability (or left-tail-area) p, use x = qlnorm(p, mu, sigma). e.g. The 75^th percentile of BMI is

> qlnorm(0.75, 3.215, 0.157)

[1] 27.69

(This confirms our answer to 4.4.3(e) in the lecture notes.)

Density f(x) via `dlnorm()`

The the density function f(x) is dlnorm(x, mu, sigma).

Random numbers from a lognormal distribution via `rlnorm()`

Generate \( n \) random numbers from lognormal \( \mu \), \( \sigma \) via rlnorm(n, mu, sigma). Here are 100 random numbers from lognormal with \( \mu \)=0, \( \sigma \)=1:

> x = rlnorm(100, 0, 1)
> x

  [1]  0.78857  1.45044  2.65504  0.53037  0.79836  0.46961  0.22628
  [8]  8.01634  1.98707  0.20921 12.12551  2.30127  0.19010  0.07466
 [15]  0.32991  0.23414  2.42293  2.84230  0.35611  7.26090  0.39400
 [22]  0.90463  0.95094  0.86228  1.12455  2.57511  0.62375  0.34679
 [29]  1.72720  1.91696  0.48234  0.30698 14.46074  0.80875  0.64853
 [36]  0.60538  1.49890  3.04862  4.70927  1.48188  0.95413  1.97410
 [43]  0.76930  0.33888  7.48985  0.21759  2.07762  0.23269  0.68671
 [50]  0.81731  1.48795  0.34480  1.70008  1.25547  0.53518  1.07993
 [57]  0.63207  0.99650  0.84108  2.16330  0.64718  0.22541 15.54648
 [64]  0.36359  0.33697  1.21697  0.78312  0.16823  4.00243  5.16696
 [71] 11.35176  0.89242  0.76458  0.66930  2.80583  2.21881  0.36795
 [78]  0.54831  1.89356  0.66249  0.25847  0.57659  0.77518  0.43600
 [85]  1.17293  2.50871  1.92816  1.65632  0.71974  0.72707  0.27803
 [92]  0.79649  1.63543  0.77949  2.82106  0.16195  1.28883  0.22047
 [99]  0.46437  0.91832

Here are summaries:

> mean(x)

[1] 1.821

> median(x)

[1] 0.813

> densityplot(x)

plot of chunk numeric.lnorm

Notice that the mean is right of the median, as is typical of a skewed-right distribution, and that there are outliers, as is typical of a lognormal distribution.

4.7 Probability Plots

Draw a normal probability plot of a sample in vector x via qqmath(x). (The qq refers to a “quantile-quantile” plot, and the math refers to plotting data vs. a mathematical distribution.) Here's a plot like the one in the 4.7 lecture notes:

> x = c(11.6, 12.6, 12.7, 12.8, 13.1, 13.3, 13.6, 13.7, 13.8, 14.1, 14.3, 14.3, 
+     14.6, 14.8, 15.1, 15.2, 15.6, 15.6, 15.7, 15.8, 15.8, 15.9, 15.9, 16.1, 
+     16.2, 16.2, 16.3, 16.4, 16.5, 16.5, 16.5, 16.6, 17, 17.1, 17.3, 17.3, 17.4, 
+     17.4, 17.4, 17.6, 17.7, 18.1, 18.3, 18.3, 18.3, 18.5, 18.5, 18.8, 19.2, 
+     20.3)
> require("lattice")  # load the lattice graphics package
> qqmath(x, xlab = "Idealized normal weight", ylab = "Actual soap weight")

plot of chunk qqmath

Note that, compared to the 4.7 lecture notes,

the x and y axes are interchaged
the idealized normal weights are standardized (i.e. they're from N(0,1), not N(mean(x), sd(x)))
the reference line is missing

Eliminating these differences is possible but beyond what I want to cover here.

4.8 The Central Limit Theorem (optional R material)

The following R material is not required.

Let's illustrate the Central Limit Theorem (CLT) by taking many random samples from an exponential distribution and plotting the distribution of their sample means. First, here's a plot of X ~ exp(1). (Note that dexp(x, lambda) gives f(x) for X ~ exp(\( \lambda \)).)

> x = seq(0, 4, 0.01)
> y = dexp(x, 1)
> plot(x, y)

plot of chunk dexp

This doesn't look like a normal distribution! Here's a random sample of size 3 from the exp(1) distribution, and then its sample mean:

> x = rexp(3, 1)
> x

[1] 0.9553 1.5329 0.4827

> mean(x)

[1] 0.9903

The function call replicate(N, ...) repeats the calculation in ... N times, and returns a vector of the N results. .e.g.

> replicate(5, mean(rexp(3, 1)))  # means of 5 samples of size 3

[1] 1.6265 0.9479 0.4345 1.4256 1.6813

Here I'll make a vector of 1000 means of samples of size 3, and then plot the distribution of those means:

> means.sample.size.3 = replicate(1000, mean(rexp(3, 1)))
> densityplot(means.sample.size.3)

plot of chunk replicate2

This looks more normal than the plot of X ~ exp(1), but it's still skewed right. Increase the sample size to let the CLT do its thing:

> means.sample.size.100 = replicate(1000, mean(rexp(100, 1)))
> densityplot(means.sample.size.100)

plot of chunk replicate3