Computer Sciences Department logo

CS 368-1 (2010 Summer) — Day 6 Homework

Due Thursday, July 22th, at the start of class.

Description

Write a Perl script to analyze a web server log file and report “referrers” for web pages.

Webmasters frequently want to know what pages users are looking at, and where they come from. This information is present in their logs, but not in an easy to read form.

Background Information

To load a web page, the web browser contacts the server using the HTTP protocol. It sends the URL desired, along with other information. This information includes the page the user came from, the so called “Referer” or referring URL. (Yes, “Referer” is misspelled. Trivia for the day: It is misspelled in the HTTP protocol specification.) Many web servers log this information.

Here are a few lines from a real access_log, modifed slightly to anonymize the users:

97.100.0.0 - - [13/Apr/2009:03:11:23 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8"
96.236.0.0 - - [13/Apr/2009:04:57:26 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)"
168.137.0.0 - - [13/Apr/2009:05:37:34 -0700] "GET /gencon-indy-2009.cgi/type/SPA/All_events HTTP/1.1" 304 22 "-" "Mozilla/4.0 (compatible;)"

Assume that fields are seperated with spaces. (The reality is obviously a bit more complex, but this is good enough for purposes of this assignment.) The fields are:

  1. 97.100.0.0 - IP address of visitor
  2. - - Unused
  3. - - Unused
  4. [13/Apr/2009:03:11:23 - Date and time of visit
  5. -0700] - Time zone for the date
  6. "GET - Command (usually GET or POST)
  7. /gencon-indy-2009.cgi/group/ - Requested URL. The "http://gencon.highprogrammer.com" is not listed. It can be implied, since an access_log usually corresponds to a single web site.
  8. HTTP/1.1" - HTTP protocol version
  9. 200 - Return code. 200 is success.
  10. 26852 - Bytes returned
  11. "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" - Referring URL. It will be "-" if the web browser didn't indicate a referring URL.
  12. "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8" - Web browser used. (This field contains spaces. We are ignoring this field for this exercise, so we can ignore this detail.)

What to Do

Your script will read the access_log file, analyze the log lines, and generate a report. The report will list all of the requested URLs. Under each requested URL will be a list of referring URLs and a count for how many times that each referring URL was seen for the associated URL requested. The results can be ordered in any way, so long as the referring URl are indicated under the matching requested URL. For example, given the above three log lines, appropriate output might look like:

/gencon-indy-2009.cgi/group/
    2 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi"

/gencon-indy-2009.cgi/type/SPA/All_events
    1 "-"

We can see that /gencon-indy-2009.cgi/group/ was requested twice, and both times the visitor came from "http://gencon.highprogrammer.com/gencon-indy-2009.cgi". /gencon-indy-2009.cgi/type/SPA/All_events was requested once, but the "-" indicates that we don't know where the visitor came from.

Use Perl’s built-in split() function to extract the relevant fields from the access_file. This function takes a string, splits it into elements using another string, and returns the elements as a list. So:

my $log_line = <WEB_LOG_FILE>;
my @log_line_elements = split(" ", $log_line);
my $ip_address = $log_line_elements[0];
# etc.

Note that the most straighforward usage of split in this case will leave the referring URL in quotes. This is fine — there is no need to remove the quotes from around the referring URL.

Here is an example access_log for testing. This is real data, trimmed to keep the output relatively simple, and the IP addresses have been modified to protect the privacy of visitors to the web site. Using this input file, output might look like this:

% ./homework-06-script

/gencon-indy-2009.cgi/type/TCG/Magic_The_Gathering
    2 "http://www.teammeandeck.com/index.php?topic=31207.msg449498;boardseen"
    16 "http://www.themanadrain.com/index.php?topic=37686.0"
    2 "http://www.mtgthesource.com/forums/showthread.php?t=13019"
    7 "http://www.teammeandeck.com/index.php?topic=31207.0"
    2 "http://www.themanadrain.com/index.php?topic=37686.msg525960;boardseen"
    13 "-"
    3 "http://www.teammeandeck.com/index.php?topic=31207.0;topicseen"

/gencon-indy-2009.cgi/event/SEM0902117
    2 "http://www.rpgblog2.com/"
    2 "http://twitturls.com/"
    42 "-"

/gencon-indy-2009.cgi/event/TCG0905629
    4 "http://forums.mtgsalvation.com/showthread.php?p=3782810"
    22 "http://www.themanadrain.com/index.php?topic=37686.0"
    387 "http://forums.mtgsalvation.com/showthread.php?t=157666"
    26 "-"
    2 "http://www.themanadrain.com/index.php?topic=37686.msg524752"

/gencon-indy-2009.cgi/event/ZED0905139
    7 "-"
    2 "http://community.gencon.com/forums/p/20022/222834.aspx"
    15 "http://community.gencon.com/forums/t/20022.aspx"

/gencon-indy-2009.cgi/group/
    3 "http://forum.rpg.net/showthread.php?p=10286372"
    15 "http://forum.rpg.net/showthread.php?t=450129"
    58 "-"
    151 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi"

/gencon-indy-2009.cgi/event/SEM0900278
    2 "http://community.gencon.com/forums/p/19852/221347.aspx"
    15 "-"

/gencon-indy-2009.cgi/type/SPA/All_events
    5 "http://blog.craftzine.com/2.html"
    2 "http://www.google.com/reader/view/?tab=cy"
    11 "http://www.google.com/reader/view/"
    3 "http://www.google.com/reader/view/?tab=my"
    88 "-"
    28 "http://blog.craftzine.com/"
    21 "http://blog.craftzine.com/archive/2009/04/gen_con_gets_crafty.html"

Given this report, we can see that /gencon-indy-2009.cgi/event/SEM0902117 was requested 46 times. 2 of those times the visitor came from rpgblog2, twice the visitor came from twitturls.com, and 42 times the referrer information was not available.

Reminders

Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.

Hand In

A printout of your output on a single sheet of paper. Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.