STAT 303 Homework 3

We’ll grade your homework by

opening your “hw3.Rmd” file in RStudio
clicking “Knit HTML”
reading the HTML output
reading your “hw3.Rmd”

You should write R code anywhere you see an empty R code chunk. You should write English text anywhere you see “…”; please surround it with doubled asterisks (**...**) so that it will show up as boldface and be easy for us to find.

Include reasonable labels (titles, axis labels, legends, etc.) with each of your graphs.

Name: Kayley Seow

Email: kseow@wisc.edu

We’ll use data on housing values in suburbs of Boston. They are in an R package called “MASS.” (An R package is a collection of code, data, and documentation. “MASS” refers to the book “Modern Applied Statistics with S.” R developed from the earlier language, S.) The MASS package comes with the default R installation, so it’s already on your computer. However, it’s not loaded into your R session by default. So we’ll load it via the require() command (there’s nothing for you to do here):

require("MASS")

## Loading required package: MASS

Run ?Boston (outside this R Markdown document) to read the help page for the Boston data frame.

Convert the chas variable to a factor with labels “off” and “on” (referring to the Charles river).

b = MASS::Boston
b$chas = factor(b$chas, levels = c(0,1), labels = c("off", "on"))

How many rows are in the Boston data frame? How many columns?

cat(sep = "", "There number of columns are: ", ncol(b), ".\n")

## There number of columns are: 14.

cat(sep = "", "There number of rows are: ", nrow(b), ".\n")

## There number of rows are: 506.

There are 14 columns and 506 rows in the data frame

What does a row represent? A row represents a suburb in the Boston area, with 13 statistics each.

What does a column represent? A column represents each value of a particular statistic in each suburb of Boston.

Make a density plot (with rug) of tax rates.

plot(density(b$tax))
rug(b$tax)

Describe the shape of the distribution of tax rates. This distribution has two peaks making it bimodal, centered around 300 and 700.

Note that the distribution shape doesn’t make sense in light of the rug representation of the data. Make a histogram of the tax rates.

hist(b$tax, ylim = c(0, 140), xlim = c(100, 800))

Why is the second peak of the density plot so large? In what way is the rug representation of the data inadequate? Write a line or two of code to figure it out, and then explain it.

maxtimes = length(which(b$tax == max(b$tax)))
maxvalue = max(b$tax)
cat(sep = "", "To explain the peak around 700, we can find the maximum value, ", maxvalue, ", which only occurs roughly ", maxtimes, " in this sample. As such, the reason why the rug does not display this, is because all of the values are on top of one another.\n")

## To explain the peak around 700, we can find the maximum value, 711, which only occurs roughly 5 in this sample. As such, the reason why the rug does not display this, is because all of the values are on top of one another.

To explain the peak around 700, we can find the maximum value, 711, which only occurs roughly 5 in this sample. As such, the reason why the rug does not display this, is because all of the values are on top of one another. There are 5 occurrences of the second highest value, 711 ,in this histogram. Make a barplot of “chas”.

counts = table(b$chas)
barplot(counts, ylim = c(0,500))

How many neighborhoods are on the Charles river?

cat(sep = "", "The neighborhoods on the Charles River are: ", length(which(b$chas == 'on')), ".\n")

## The neighborhoods on the Charles River are: 35.

Make a single graph consisting of three plots:

a scatterplot of “nox” on the y-axis vs. “dis” on the x-axis
a (vertical) boxplot of “nox” left of the scatterplot’s y-axis
a (horizontal) boxplot of “dis” below the scatterplot’s x-axis

Hint: use layout() with a 4x4 matrix, using the top-right 3x3 corner for the scatterplot, leaving the bottom-left 1x1 corner blank, and using the other parts for the boxplots.

(An optional challenge, worth 0 extra credit points: remove the axis and plot border from each boxplot.)

# first we need to do the layout, which we have to manually input the matrix into
m = matrix(data = c(1, 1, 1, 0, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2), nrow = 4, ncol = 4, byrow = FALSE)
layout(m)

# To correspond with 1, we need to do a vertical boxplot of nox
boxplot(b$nox)

# To correspond with 2, we need to do a horizontal boxplot of dis
boxplot(b$dis, horizontal = TRUE)

# To correspond with 3, we will take up the 3x3 space with the full scatterplot
plot(b$dis, b$nox)

Look into the highest-crime neighborhood by making a single graph of one column of three rows:

Find the row number, r, of the neighborhood with the highest “crim”.
Make a density plot of “crim”. Include a rug to show the data.
Add a red circle at (x, y) = (max crime rate, 0) to make this maximum crime rate stand out.
Make a density plot with rug of “medv”, adding a red circle at (x, y) = (medv[r], 0) to see what medv corresponds to the highest crime rate.
Repeat the last step for “ptratio”.

# Finds the row number of the neighborhood
crimRowNumber <- order(b$crim, decreasing = TRUE)[1]

# Finds the max crime rate to plot the circle on
maxCrimRate <- sort(b$crim, decreasing = TRUE)[1]

# makes density plot with rug for crim
plot(density(b$crim))
rug(x = b$crim)
points(x = maxCrimRate, y = 0, cex = 2, pch = 19, col = 'red')

# makes density plot with rug for medv
plot(density(b$medv))
rug(x = b$medv)
points(x = b[crimRowNumber, "medv"], y = 0, cex = 2, pch = 19, col = 'red')

# makes density plot with rug for ptratio
plot(density(b$ptratio))
rug(x = b$ptratio)
points(x = b[crimRowNumber, "ptratio"], y = 0, cex = 2, pch = 19, col = 'red')

What do you notice about the ptratio and medv for the highest-crime neighborhood? The house price where the high crime occurs is less than the median house price, and there are high pupil to teacher ratios in these areas as well. It would be particularly interesting to see the other statistics and how they relate to crime in these areas.