Plotting probability densities

Table of Contents

1 Plotting probability density functions

Functions to evaluate probability densities in R have names of the form d<dabb> where dabb is the abbreviated distribution name. For example, norm for the normal (or Gaussian) density, unif for the uniform density, exp for the exponential density. A more complete list of distributions and their abbreviations is given here.

One simple way of plotting a theoretical density function is to establish a range of x values, evaluate the density (or probability mass function) on these values and plot the result.

It is not always straightforward to decide what a reasonable range of x values would be. For example, if I want to plot the exponential density for the rate, λ=0.2, how far out on the right-hand tail should I go? One way to answer this is to find, say, the 0.995 quantile.

(xmax <- qexp(0.995, rate=0.2))
[1] 26.49159

and choose equally spaced values from 0, below which the density is zero, to xmax

xvals <- seq(0, xmax, length=100)

The plot can now be generated as

library(ggplot2)
qplot(xvals, dexp(xvals, rate=0.2), geom="line", ylab="f(x)", xlab="x")

exp_dens1.png

We can enhance this plot by adding a reference line at y=0 and changing the aspect ratio.

qplot(xvals, dexp(xvals, rate=0.2), geom="line", ylab="f(x)",
      xlab="x") + geom_hline(yintercept=0)

exp_dens2.png

When a density function like the exponential density has a lower bound – 0 in this case – it is a good idea to include a point just slightly below the lower bound

xvals <- c(-1e-5, xvals)

so that a vertical line gets drawn on the plot.

exp_dens3.png

2 Plotting multiple densities

Often we want to plot multiple theoretical densities on the same set of axes so that we see the effect of changing parameter values. Suppose, for example, that we want to compare the exponential densities for λ=0.5, λ=0.4, λ=0.3 and λ=0.2

If we know the properties of the exponential distribution we will realize that the current range of x values stored in xvals will cover the region of non-negligible density for each of these densities. If we didn't know that, we could redefine xvals based on an xmax of

xmax <- max(qexp(0.995, rate=c(0.2, 0.3, 0.4, 0.5)))

As explained in the ggplot2 book, we should create a data frame with the x and y values for all the points we wish to plot, but in two columns with a third column indicating the rate.

2.1 Using melt to change a data frame from wide to long

One way to create such a representation is to start with good old cut-and-paste

df <- data.frame(xvals,
                 dexp(xvals,rate=0.2),
                 dexp(xvals,rate=0.3),
                 dexp(xvals,rate=0.4),
                 dexp(xvals,rate=0.5))
names(df) <- c("x", "0.2", "0.3", "0.4", "0.5")
str(df)

+ + + > > 'data.frame':       101 obs. of  5 variables:
 $ x  : num  -0.00001 0 0.26759 0.53518 0.80278 ...
 $ 0.2: num  0 0.2 0.19 0.18 0.17 ...
 $ 0.3: num  0 0.3 0.277 0.256 0.236 ...
 $ 0.4: num  0 0.4 0.359 0.323 0.29 ...
 $ 0.5: num  0 0.5 0.437 0.383 0.335 ...

and then "melt" the data frame, preserving the x values.

str(fr <- melt(df, id="x"))
'data.frame': 404 obs. of  3 variables:
 $ x       : num  -0.00001 0 0.26759 0.53518 0.80278 ...
 $ variable: Factor w/ 4 levels "0.2","0.3","0.4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value   : num  0 0.2 0.19 0.18 0.17 ...

from which we produce the plot of multiple densities

qplot(x, value, data=fr, geom="line", ylab="f(x)",
      xlab="x", color=variable) + geom_hline(yintercept=0)

exp_dens4.png

It is slightly more informative if we change the name of the variable called "variable" to "rate".

names(fr) <- c("x", "rate", "value")

exp_dens5.png

The structure of the df data frame

str(df)
'data.frame': 101 obs. of  5 variables:
 $ x  : num  -0.00001 0 0.26759 0.53518 0.80278 ...
 $ 0.2: num  0 0.2 0.19 0.18 0.17 ...
 $ 0.3: num  0 0.3 0.277 0.256 0.236 ...
 $ 0.4: num  0 0.4 0.359 0.323 0.29 ...
 $ 0.5: num  0 0.5 0.437 0.383 0.335 ...
head(df)
           x       0.2       0.3       0.4       0.5
1 -0.0000100 0.0000000 0.0000000 0.0000000 0.0000000
2  0.0000000 0.2000000 0.3000000 0.4000000 0.5000000
3  0.2675918 0.1895777 0.2768581 0.3593971 0.4373843
4  0.5351836 0.1796985 0.2555013 0.3229156 0.3826100
5  0.8027754 0.1703342 0.2357920 0.2901373 0.3346953
6  1.0703671 0.1614578 0.2176030 0.2606863 0.2927809

is called the "wide" format. Each x value corresponds to multiple y values and it is the column positions or names that distinguish which response is which.

The structure of fr is the corresponding "long" format. All the responses are in one column and we distinguish the parameter values by with an auxillary column.

2.2 Using expand.grid to generate the long format directly

An alternative approach is to set up the "x" and "rate" columns of the data frame and evaluate the density on these two long vectors.

str(fr <- within(expand.grid(x=xvals, rate=c(0.2,0.3,0.4,0.5)),
                 value <- dexp(x, rate)))
'data.frame':   404 obs. of  3 variables:
 $ x    : num  -0.00001 0 0.26759 0.53518 0.80278 ...
 $ rate : num  0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
 $ value: num  0 0.2 0.19 0.18 0.17 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  101 4
  .. ..- attr(*, "names")= chr  "x" "rate"
  ..$ dimnames:List of 2
  .. ..$ x   : chr  "x=-0.0000100" "x= 0.0000000" "x= 0.2675918" "x= 0.5351836" ...
  .. ..$ rate: chr  "rate=0.2" "rate=0.3" "rate=0.4" "rate=0.5"

The expand.grid function creates a data frame of two columns containing all possible combinations of the two vectors given as its arguments (although it is not restricted to just two arguments). We then evaluate the density on those combinations of values of x and rate.

We do need to make one small change in the call to qplot because rate is now a numeric variable and we want it to be treated as a factor in the plot.

qplot(x, value, data=fr, geom="line", ylab="f(x)",
      xlab="x", color=factor(rate)) + geom_hline(yintercept=0)

exp_dens6.png

Author: Douglas Bates

Date: 2010-08-25 Wed

HTML generated by org-mode 7.01h in emacs 23