Due Thursday, July 22th, at the start of class.
Write a Perl script to analyze a web server log file and report “referrers” for web pages.
Webmasters frequently want to know what pages users are looking at, and where they come from. This information is present in their logs, but not in an easy to read form.
To load a web page, the web browser contacts the server using the HTTP protocol. It sends the URL desired, along with other information. This information includes the page the user came from, the so called “Referer” or referring URL. (Yes, “Referer” is misspelled. Trivia for the day: It is misspelled in the HTTP protocol specification.) Many web servers log this information.
Here are a few lines from a real access_log, modifed slightly to anonymize the users:
97.100.0.0 - - [13/Apr/2009:03:11:23 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8" 96.236.0.0 - - [13/Apr/2009:04:57:26 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)" 168.137.0.0 - - [13/Apr/2009:05:37:34 -0700] "GET /gencon-indy-2009.cgi/type/SPA/All_events HTTP/1.1" 304 22 "-" "Mozilla/4.0 (compatible;)"
Assume that fields are seperated with spaces. (The reality is obviously a bit more complex, but this is good enough for purposes of this assignment.) The fields are:
Your script will read the access_log file, analyze the log lines, and generate a report. The report will list all of the requested URLs. Under each requested URL will be a list of referring URLs and a count for how many times that each referring URL was seen for the associated URL requested. The results can be ordered in any way, so long as the referring URl are indicated under the matching requested URL. For example, given the above three log lines, appropriate output might look like:
/gencon-indy-2009.cgi/group/ 2 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" /gencon-indy-2009.cgi/type/SPA/All_events 1 "-"
We can see that /gencon-indy-2009.cgi/group/ was requested twice, and both times the visitor came from "http://gencon.highprogrammer.com/gencon-indy-2009.cgi". /gencon-indy-2009.cgi/type/SPA/All_events was requested once, but the "-" indicates that we don't know where the visitor came from.
Use Perl’s built-in split() function to extract the relevant fields from the access_file. This function takes a string, splits it into elements using another string, and returns the elements as a list. So:
my $log_line = <WEB_LOG_FILE>; my @log_line_elements = split(" ", $log_line); my $ip_address = $log_line_elements[0]; # etc.
Note that the most straighforward usage of split in this case will leave the referring URL in quotes. This is fine — there is no need to remove the quotes from around the referring URL.
Here is an example access_log for testing. This is real data, trimmed to keep the output relatively simple, and the IP addresses have been modified to protect the privacy of visitors to the web site. Using this input file, output might look like this:
% ./homework-06-script /gencon-indy-2009.cgi/type/TCG/Magic_The_Gathering 2 "http://www.teammeandeck.com/index.php?topic=31207.msg449498;boardseen" 16 "http://www.themanadrain.com/index.php?topic=37686.0" 2 "http://www.mtgthesource.com/forums/showthread.php?t=13019" 7 "http://www.teammeandeck.com/index.php?topic=31207.0" 2 "http://www.themanadrain.com/index.php?topic=37686.msg525960;boardseen" 13 "-" 3 "http://www.teammeandeck.com/index.php?topic=31207.0;topicseen" /gencon-indy-2009.cgi/event/SEM0902117 2 "http://www.rpgblog2.com/" 2 "http://twitturls.com/" 42 "-" /gencon-indy-2009.cgi/event/TCG0905629 4 "http://forums.mtgsalvation.com/showthread.php?p=3782810" 22 "http://www.themanadrain.com/index.php?topic=37686.0" 387 "http://forums.mtgsalvation.com/showthread.php?t=157666" 26 "-" 2 "http://www.themanadrain.com/index.php?topic=37686.msg524752" /gencon-indy-2009.cgi/event/ZED0905139 7 "-" 2 "http://community.gencon.com/forums/p/20022/222834.aspx" 15 "http://community.gencon.com/forums/t/20022.aspx" /gencon-indy-2009.cgi/group/ 3 "http://forum.rpg.net/showthread.php?p=10286372" 15 "http://forum.rpg.net/showthread.php?t=450129" 58 "-" 151 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" /gencon-indy-2009.cgi/event/SEM0900278 2 "http://community.gencon.com/forums/p/19852/221347.aspx" 15 "-" /gencon-indy-2009.cgi/type/SPA/All_events 5 "http://blog.craftzine.com/2.html" 2 "http://www.google.com/reader/view/?tab=cy" 11 "http://www.google.com/reader/view/" 3 "http://www.google.com/reader/view/?tab=my" 88 "-" 28 "http://blog.craftzine.com/" 21 "http://blog.craftzine.com/archive/2009/04/gen_con_gets_crafty.html"
Given this report, we can see that /gencon-indy-2009.cgi/event/SEM0902117 was requested 46 times. 2 of those times the visitor came from rpgblog2, twice the visitor came from twitturls.com, and 42 times the referrer information was not available.
Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.
A printout of your output on a single sheet of paper. Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.