Complete list of pages scanned
Authors: Tim Cluff and Dan Wendorf
We have written a Python script that crawls the University of Wisconsin - Madison's computer science webpage looking at the frequency of 'scholarly' word use. Instead of reporting just on the most frequent words, we report on those appearing in pages that have changed in the past seven days and are determined to have a 'scholarly' nature. The results attempt to show which terms or concepts are popular in active pages. After filtering out the most common English words, the remaining words are subjected to a scoring mechanism, then the top one hundred scores are reported on our website. Each word in the top one hundred can be further examined to show which sites include that word.
This is similar to Google News, a website that attempts to detect trends in news reports
We first use the general Unix 'find' command to get a list of all user sites in the cs.wisc.edu domain. We manage this by searching the computer science department's file system for all *.html and *.htm files in the /public/html directories of all users.
This search is multithreaded in an attempt to speed up the search, and looks for only those pages that have changed within seven days. By only looking at the most recently changed pages, we limit our reports to current activities in the department, allowing new words to easily surface from week to week.
From this list, all files owned by the user "list" are removed, as they are merely logs of discussions created most by classes within the department, and do not represent pages meant for any degree of pubication.
Once the list of sites has been generated, a Python script breaks the list into 26 pieces and sends each piece to a separate computer for processing. Each computer then splits its list into 26 pieces, and creates a thread for each piece. The threads examine the contents of the pages, counting word occurences and recording on which pages each word can be found. When the threads have completed, the process will merge each thread's list, then send the merged list back to the original process, which will in turn merge all 26 lists. This gives us the raw word count data to be analyzed.
Our ranking system attempts to determine which of the most common words can be considered scholarly, then reports those words. It first removes the 250 most common English words (found in the file elimList), which, due to their frequency, cannot be considered scholarly. Next, it does a simple sort of all words based on raw frequency. Once it is sorted, it uses the Google API to generate a Scholar Score for the most common words. This score is determined by a simple formula:
Using the first ten Google results for the current word, generate two numbers. 'e' is the number of results that are in the .edu top-level domain, and 'c' is the number of results that are in the .com top-level domain. The final score is 2^e - 1.3^c. This formula weights c and e such that, for example, a value of 4 for c is more than twice as significant as a value of 1. c is valued less than e so that a word with e = 5 and c = 5 will be more valuable than a word in which both are equal to 2, as five .edu results are still significant.
This formula results in some words with similar scores, even though the e and c totals may be fairly different. For example, when e and c are 3 and 7, respectively, the Scholar Score is approximately 1.73, and when the totals are 2 and 3, the Scholar Score is approximately 1.80. Such similar scores implies that the two results are considered almost equally scholarly. In a similar vein, totals of e = 4 and c = 6 would have a score of approximately 11.17. This outweighs a word with scores of 3 and 0, which would have a score of 7. That is, a word whose top ten Google results are 40% .edu will always be more scholarly than a word whose top ten results are only 30% .edu, regardless of what percent are .com
We apply the formula to the 500 most frequent words, then sort them by score. All but the 100 most scholary words are discarded, and the 100 remaining words are sorted again merely by frequency of use to generate our top 100 scholarly words of the week
The weekly top 100 can be found at http://www.cs.wisc.edu/~wendorf/top100.py. This website is generated by a Python script that parses the top 100 results, gives additional information on the project, and allows one to obtain additional information on each of the 100 words.
The additional information is a list of all websites that include that particular word, sorted by frequency. This allows one to see exactly which active pages and projects are somehow dealing with the most popular words.
Our code is available for download at http://www.cs.wisc.edu/~wendorf/scholarCount.tar
Once our code is downloaded, a few steps must be taken to properly execute it. First, Python and PyGoogle and a webserver that is capable of executing Python scripts must be installed. Our downloadable code contains a copy of PyGoogle already, though it may not be the newest version. The current implementation of our code runs on the UW - Madison computer science webserver, which runs in a Unix-like environment. It is likely that this will be the easiest sort of environment in which to get the code running.
Once the code is downloaded, run "./htmlList > addresses.txt", a bash script which will find all .html and .htm files in the user directories. If you are not running this on the UW Madison computer science network, you may need to modify htmlList to find the files in your environment, otherwise results may be incorrect.
Next, run the "./parseAddresses", another bash script that calls the Python scripts to compute word counts, sort the words, find their Scholar Score, and generate the top 100 scholarly word list. This list is output to a file called resultsDict.
An optional script, "./splitParse" is also included, and can be used instead of "./parseAddresses" which breaks the address list into several segments before parsing them. This is usually unneccessary, however it reduces overhead used by the Python scripts and is important for large file lists. With more than 10,000 files to parse, the run time can increase to over a day if not split into smaller segments. splitParse takes 1 parameter, the number of segments to split the file into. It's best if each of the evenly divided segments has less than 3,000 or so files.
Loading top100.py in a web browser will now display the top results as computed by generateTop100.py.
The code contains a subfolder "results" which has test results of the cripts. If you would like to display the website without generating your own results, copy the contents of this folder into the folder that contains the Python scripts.