Hi Robin, Ok great! Last night I finished running the same experiment again, but now on the entire dataset (including all categories, not just LangLit1 or a subset of it). In summary, (i) I detect significant differences of word usage between all pairs of decades, but (ii) I do not detect, in this experiment, any influence by Johnson on word usage. Below I give details about the experiment, present and analyze the results, and discuss some other question it might be interesting to investigate. What I did: Here are specific notes about the experiment. (Some of this is repeated from previous emails but I have written it more carefully this time.) - I extracted the year from each document's filename in the txts directory. Some docs had more than one date, and I always took the one that appeared first in the filename. I did this mostly automatically, but I went through and cleaned up mistakes. I excluded document with no date in the filename or with a date before 1700 or after 1799. After this 85518 documents remained. - I assigned each year to one of the decades 1700-1709, 1710-1719, etc. - I did the following simple OCR cleanups on the text of each document (I'm curious what ones you and Daniel have been doing, too - I haven't really thought much about this issue yet): - replace " ' d" -> "'d", e.g., "reform 'd" -> "reform'd" - replace "& c" -> "&c", e.g., "& c" -> "&c" - replace "- " -> "", e.g., "Spi- rit" -> "Spirit" - replace "-" -> " ", e.g., "He boldly hiccups-but he cannot" -> "He boldly hiccups but he cannot" - remove all chars other than a-z, A-Z, 0-9, "&", and " " - lowercase everything. - I made a big list of all words (with counts) occurring more than once in the corpus. There were 94084366 words that occurred only once in the corpus. The word "the" occurred 262543504 times. I can give you this if you're interested and don't already have it. - I removed from each document all words occurring <100 or >5000000 times. The goal was to exclude, on one hand, OCR and other noise, and on the other hand, stopwords, since both of these could hide true variations, and in my first experiment on a small dataset, I did observe this before I did thresholding. I chose the specific thresholds that I did by eyeballing the big list of words+counts, but it would be nice to choose the thresholds by some automatic tuning procedure. After this thresholding, 662280 words remained in the vocabulary. - I mapped each document to a bag-of-words count vector, I averaged (since the cosine is length-invariant) the vectors in each decade, and I computed the cosine between the averaged vectors for each pair of decades, as in my previous experiments. The result is shown in the first table below, where the (i,j) entry is the cosine between decade i's averaged vector and decade j's averaged vector. - I did a permutation test for significance as in my previous experiment: for each pair of decades, I (a) collected the documents occurring in one or the other decade, (b) randomly permuted the labels on the documents as being in one or the other decade, (c) averaged the count vectors according to the new decade assignments, and (d) computed the cosine between the averaged vectors. I did this 1000 times for each pair and counted the number of times the new cosine (under permuted labels) was less than the observed cosine (under the original labels). Call this number r. The fraction (r+1)/(1000+1) gives an estimate of the probability that a cosine under a random permutation of decade labels would be smaller than the observed cosine. Hopefully this number is smaller than say 0.05. I report this fraction, for each pair of decades, in the second table ("levels") below. Results: cosines = 1.0000 0.9892 0.9897 0.9812 0.9638 0.9626 0.9506 0.9399 0.8993 0.9317 0.9892 1.0000 0.9887 0.9861 0.9631 0.9599 0.9514 0.9387 0.8907 0.9331 0.9897 0.9887 1.0000 0.9921 0.9804 0.9794 0.9699 0.9606 0.9234 0.9510 0.9812 0.9861 0.9921 1.0000 0.9881 0.9849 0.9809 0.9722 0.9339 0.9644 0.9638 0.9631 0.9804 0.9881 1.0000 0.9931 0.9872 0.9874 0.9655 0.9748 0.9626 0.9599 0.9794 0.9849 0.9931 1.0000 0.9893 0.9885 0.9672 0.9784 0.9506 0.9514 0.9699 0.9809 0.9872 0.9893 1.0000 0.9920 0.9684 0.9856 0.9399 0.9387 0.9606 0.9722 0.9874 0.9885 0.9920 1.0000 0.9814 0.9894 0.8993 0.8907 0.9234 0.9339 0.9655 0.9672 0.9684 0.9814 1.0000 0.9676 0.9317 0.9331 0.9510 0.9644 0.9748 0.9784 0.9856 0.9894 0.9676 1.0000 levels = 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0050 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 0.0010 1.0000 Discussion: Compared to the results in my previous emails, the cosines here change much more regularly across decades. This is a relief - I was confused about some of the previous results - but it is also what one would expect, since the corpus is much larger than in my previous experiments. As before, the cosines decrease mostly monotonically as separation in time increases. At first I thought that the levels reported above were not correct, since all but one of the off-diagonal entries is 0.0010, and this means that none of the permuted cosines was smaller than the observed cosine. However I have doublechecked the code both by hand on small examples and using a different (slower but more straightfoward) script on a few real examples, and I am pretty sure that the levels are being computed correctly. This is very different from what happened on my smaller examples before. Together, the two tables say that, as measured by cosine similarity of count vectors, etc., there are very significant differences in word usage between every pair of decades in the 18th century. This is kind of cool - it's especially interesting that the differences are so significant across every pair of decades - but it is also what one would expect, so it is not exactly earth-shattering. Further questions: I had hoped that these tables would reveal significant differences only between the first and second halves of the century, but not within them, or something like that, so that one could conclude that Johnson influenced word usage substantially. But there are better ways to look at this question than what I did above. Two ideas: (1) It would be interesting to try to quantify not just the amount of change but the rate of change. Hopefully the rate would increase post-Johnson. (2) It would be interesting to try to quantify the variance in word usage pre-Johnson versus post-Johnson. Hopefully the variance is less post-Johnson. (This second experiment, especially, seems to relate more directly than the one I did to the kind of influence you told me Johnson is thought to have had on English usage.) It would also be interesting to investigate in more detail the results I found in the tables above. In particular: (1) It would be interesting to look at different slices of the corpus using the same procedure, e.g., to look at pairs of single years, or pairs of different kinds of documents. (2) It would be good to look in detail at the experiment above to see whether there are any conceptual bugs in the setup of the experiment. For example, perhaps the differences in cosine similarity among decades are due to just some simple, silly variations, e.g. every document in the 1720s probably has at least one occurrence of 172x. (I do not think this particular example is likely a problem, but it would be good to be sure.) (3) Perhaps it would be interesting to run more permutations for each pair of decades, to get more differentiation among the levels. The observed values' levels are all very small, but clearly some of the levels must be even smaller than others on the true cdf of the statistic. Finally, I am still trying to think of other interesting different questions one can ask about this corpus. I am curious if you have any recommendations of short books or survey papers about contemporary literary criticism (or other academic work) on 18th century literature. I am also curious if you have any thoughts about any of this, of course! Nate