Hi Robin,

Ok great! Last night I finished running the same experiment again, but now on
the entire dataset (including all categories, not just LangLit1 or a subset of
it). In summary, (i) I detect significant differences of word usage between
all pairs of decades, but (ii) I do not detect, in this experiment, any
influence by Johnson on word usage. Below I give details about the experiment,
present and analyze the results, and discuss some other question it might be
interesting to investigate.

What I did:

Here are specific notes about the experiment. (Some of this is repeated from
previous emails but I have written it more carefully this time.)

- I extracted the year from each document's filename in the txts directory.
  Some docs had more than one date, and I always took the one that appeared
  first in the filename. I did this mostly automatically, but I went through
  and cleaned up mistakes. I excluded document with no date in the filename or
  with a date before 1700 or after 1799. After this 85518 documents remained.

- I assigned each year to one of the decades 1700-1709, 1710-1719, etc.

- I did the following simple OCR cleanups on the text of each document (I'm
  curious what ones you and Daniel have been doing, too - I haven't really
  thought much about this issue yet):
    - replace " ' d" -> "'d", e.g., "reform 'd" -> "reform'd"
    - replace "& c" -> "&c", e.g., "& c" -> "&c"
    - replace "- " -> "", e.g., "Spi- rit" -> "Spirit"
    - replace "-" -> " ", e.g., "He boldly hiccups-but he cannot" -> "He
      boldly hiccups but he cannot"
    - remove all chars other than a-z, A-Z, 0-9, "&", and " "
    - lowercase everything.

- I made a big list of all words (with counts) occurring more than once in the
  corpus. There were 94084366 words that occurred only once in the corpus. The
  word "the" occurred 262543504 times. I can give you this if you're
  interested and don't already have it.

- I removed from each document all words occurring <100 or >5000000 times. The
  goal was to exclude, on one hand, OCR and other noise, and on the other
  hand, stopwords, since both of these could hide true variations, and in my
  first experiment on a small dataset, I did observe this before I did
  thresholding.  I chose the specific thresholds that I did by eyeballing the
  big list of words+counts, but it would be nice to choose the thresholds by
  some automatic tuning procedure. After this thresholding, 662280 words
  remained in the vocabulary.

- I mapped each document to a bag-of-words count vector, I averaged (since the
  cosine is length-invariant) the vectors in each decade, and I computed the
  cosine between the averaged vectors for each pair of decades, as in my
  previous experiments. The result is shown in the first table below, where
  the (i,j) entry is the cosine between decade i's averaged vector and decade
  j's averaged vector.

- I did a permutation test for significance as in my previous experiment: for
  each pair of decades, I (a) collected the documents occurring in one or the
  other decade, (b) randomly permuted the labels on the documents as being in
  one or the other decade, (c) averaged the count vectors according to the new
  decade assignments, and (d) computed the cosine between the averaged
  vectors. I did this 1000 times for each pair and counted the number of times
  the new cosine (under permuted labels) was less than the observed cosine
  (under the original labels). Call this number r. The fraction (r+1)/(1000+1)
  gives an estimate of the probability that a cosine under a random
  permutation of decade labels would be smaller than the observed cosine.
  Hopefully this number is smaller than say 0.05. I report this fraction, for
  each pair of decades, in the second table ("levels") below.

Results:

cosines =
 1.0000 0.9892 0.9897 0.9812 0.9638 0.9626 0.9506 0.9399 0.8993 0.9317
 0.9892 1.0000 0.9887 0.9861 0.9631 0.9599 0.9514 0.9387 0.8907 0.9331
 0.9897 0.9887 1.0000 0.9921 0.9804 0.9794 0.9699 0.9606 0.9234 0.9510
 0.9812 0.9861 0.9921 1.0000 0.9881 0.9849 0.9809 0.9722 0.9339 0.9644
 0.9638 0.9631 0.9804 0.9881 1.0000 0.9931 0.9872 0.9874 0.9655 0.9748
 0.9626 0.9599 0.9794 0.9849 0.9931 1.0000 0.9893 0.9885 0.9672 0.9784
 0.9506 0.9514 0.9699 0.9809 0.9872 0.9893 1.0000 0.9920 0.9684 0.9856
 0.9399 0.9387 0.9606 0.9722 0.9874 0.9885 0.9920 1.0000 0.9814 0.9894
 0.8993 0.8907 0.9234 0.9339 0.9655 0.9672 0.9684 0.9814 1.0000 0.9676
 0.9317 0.9331 0.9510 0.9644 0.9748 0.9784 0.9856 0.9894 0.9676 1.0000
levels =
 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010
 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010
 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010
 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010
 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010
 0.0010 0.0010 0.0010 0.0010 0.0050 1.0000 0.0010 0.0010 0.0010 0.0010
 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010
 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010
 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010
 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000

Discussion:

Compared to the results in my previous emails, the cosines here change much
more regularly across decades. This is a relief - I was confused about some of
the previous results - but it is also what one would expect, since the corpus
is much larger than in my previous experiments. As before, the cosines
decrease mostly monotonically as separation in time increases.

At first I thought that the levels reported above were not correct, since all
but one of the off-diagonal entries is 0.0010, and this means that none of the
permuted cosines was smaller than the observed cosine. However I have
doublechecked the code both by hand on small examples and using a different
(slower but more straightfoward) script on a few real examples, and I am
pretty sure that the levels are being computed correctly. This is very
different from what happened on my smaller examples before.

Together, the two tables say that, as measured by cosine similarity of count
vectors, etc., there are very significant differences in word usage between
every pair of decades in the 18th century. This is kind of cool - it's
especially interesting that the differences are so significant across every
pair of decades - but it is also what one would expect, so it is not exactly
earth-shattering.

Further questions:

I had hoped that these tables would reveal significant differences only
between the first and second halves of the century, but not within them, or
something like that, so that one could conclude that Johnson influenced word
usage substantially. But there are better ways to look at this question than
what I did above. Two ideas: (1) It would be interesting to try to quantify
not just the amount of change but the rate of change. Hopefully the rate would
increase post-Johnson. (2) It would be interesting to try to quantify the
variance in word usage pre-Johnson versus post-Johnson. Hopefully the variance
is less post-Johnson. (This second experiment, especially, seems to relate
more directly than the one I did to the kind of influence you told me Johnson
is thought to have had on English usage.)

It would also be interesting to investigate in more detail the results I found
in the tables above. In particular: (1) It would be interesting to look at
different slices of the corpus using the same procedure, e.g., to look at
pairs of single years, or pairs of different kinds of documents. (2) It would
be good to look in detail at the experiment above to see whether there are any
conceptual bugs in the setup of the experiment.  For example, perhaps the
differences in cosine similarity among decades are due to just some simple,
silly variations, e.g. every document in the 1720s probably has at least one
occurrence of 172x. (I do not think this particular example is likely a
problem, but it would be good to be sure.) (3) Perhaps it would be interesting
to run more permutations for each pair of decades, to get more differentiation
among the levels. The observed values' levels are all very small, but clearly
some of the levels must be even smaller than others on the true cdf of the
statistic.

Finally, I am still trying to think of other interesting different questions
one can ask about this corpus. I am curious if you have any recommendations of
short books or survey papers about contemporary literary criticism (or other
academic work) on 18th century literature.

I am also curious if you have any thoughts about any of this, of course!

Nate