Computer Sciences Department logo

CS 368 (Summer 2009) — Day 5 Homework

Due Thursday, July 23, at the start of class.

Description

Write a Perl script to report "referrers" for web pages.

Webmasters frequently want to know what pages users are looking at, and where they come from. This information is present in their logs, but not in an easy to read form.

Details

To load a web page, the web browser contacts the server using the HTTP protocol. It sends the URL desired, along with other information. This information includes the page the user came from, the so called "Referer" or referring URL. (Yes, "Referer" is misspelled. It is misspelled in the HTTP protocol specification.) Many web servers log this information, often in a file named the "access_log".

Here are a few lines from a real access_log, modifed slightly to anonymize the users:

97.100.0.0 - - [13/Apr/2009:03:11:23 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8"
96.236.0.0 - - [13/Apr/2009:04:57:26 -0700] "GET /gencon-indy-2009.cgi/group/ HTTP/1.1" 200 26852 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 (.NET CLR 3.5.30729)"
168.137.0.0 - - [13/Apr/2009:05:37:34 -0700] "GET /gencon-indy-2009.cgi/type/SPA/All_events HTTP/1.1" 304 22 "-" "Mozilla/4.0 (compatible;)"

Assume that fields are seperated with spaces. (The reality is obviously a bit more complex, but this is good enough for purposes of this assignment.) The fields are:

  1. 97.100.0.0 - IP address of visitor
  2. - - Unused
  3. - - Unused
  4. [13/Apr/2009:03:11:23 - Date and time of visit
  5. -0700] - Time zone for the date
  6. "GET - Command (usually GET or POST)
  7. /gencon-indy-2009.cgi/group/ - URL requested. The "http://gencon.highprogrammer.com" is not listed. It can be implied, since an access_log usually corresponds to a single web site.
  8. HTTP/1.1" - HTTP protocol version
  9. 200 - Return code. 200 is success.
  10. 26852 - Bytes returned
  11. "http://gencon.highprogrammer.com/gencon-indy-2009.cgi" - Referring URL. It will be "-" if the web browser didn't indicate a referring URL.
  12. "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.8) Gecko/2009032608 Firefox/3.0.8" - Web browser used. (This field contains spaces. We are ignoring this field for this exercise, so we can ignore this detail.)

The user will run the script and pass an access_log in. You can accept the access_log as a command line argument, or by redirecting STDIN, whichever you like. The program will read the access_log, then generate a report. The report will list all of the requested URLs. Under each requested URL will be a list of referring URLs and a count for how many times that each referring URL was seen for the associated URL requested. The results can be ordered in any way, so long as the referring URl are indicated under the matching requested URL. For example, given the above three log lines, appropriate output might look like:

/gencon-indy-2009.cgi/group/
    2 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi"

/gencon-indy-2009.cgi/type/SPA/All_events
    1 "-"

We can see that /gencon-indy-2009.cgi/group/ was requested twice, and both times the visitor came from "http://gencon.highprogrammer.com/gencon-indy-2009.cgi". /gencon-indy-2009.cgi/type/SPA/All_events was requested once, but the "-" indicates that we don't know where the visitor came from.

Perl's built-in split function may be useful for extracting the relevant fields from the access_file. It takes a pattern to split on and a string to cut up. It returns an array of the original string, broken into pieces wherever the pattern was found. For example:

#! /usr/bin/perl
use strict;
use warnings;
my $original = "adesmet:X:3014:3014:Alan De Smet:/u/a/d/adesmet:/bin/tcsh";
my(@fields) = split(":", $original);
for(my $i = 0; $i < @fields; $i++) {
	print "$i. $fields[$i]\n"
}

The output from this example is:

0. adesmet
1. X
2. 3014
3. 3014
4. Alan De Smet
5. /u/a/d/adesmet
6. /bin/tcsh

(As it happens, the above $original is a line from an /etc/passwd file, the file used on Linux and other Unix-like systems to identify users. Various files in /etc, including passwd, group, and services, are well suited to processing in a scripting language. This is a great convenience to administrators.)

Note that the most straighforward usage of split in this case will leave the referring URL in quotes. This is fine.

Here is an example access_log for testing. This is real data, trimmed to keep the output relatively simple, and the IP addresses have been modified to protect the privacy of visitors to the web site. Using this input file, output might look like this:

% ./homework-06-script < example_access_log

/gencon-indy-2009.cgi/type/TCG/Magic_The_Gathering
    2 "http://www.teammeandeck.com/index.php?topic=31207.msg449498;boardseen"
    16 "http://www.themanadrain.com/index.php?topic=37686.0"
    2 "http://www.mtgthesource.com/forums/showthread.php?t=13019"
    7 "http://www.teammeandeck.com/index.php?topic=31207.0"
    2 "http://www.themanadrain.com/index.php?topic=37686.msg525960;boardseen"
    13 "-"
    3 "http://www.teammeandeck.com/index.php?topic=31207.0;topicseen"

/gencon-indy-2009.cgi/event/SEM0902117
    2 "http://www.rpgblog2.com/"
    2 "http://twitturls.com/"
    42 "-"

/gencon-indy-2009.cgi/event/TCG0905629
    4 "http://forums.mtgsalvation.com/showthread.php?p=3782810"
    22 "http://www.themanadrain.com/index.php?topic=37686.0"
    387 "http://forums.mtgsalvation.com/showthread.php?t=157666"
    26 "-"
    2 "http://www.themanadrain.com/index.php?topic=37686.msg524752"

/gencon-indy-2009.cgi/event/ZED0905139
    7 "-"
    2 "http://community.gencon.com/forums/p/20022/222834.aspx"
    15 "http://community.gencon.com/forums/t/20022.aspx"

/gencon-indy-2009.cgi/group/
    3 "http://forum.rpg.net/showthread.php?p=10286372"
    15 "http://forum.rpg.net/showthread.php?t=450129"
    58 "-"
    151 "http://gencon.highprogrammer.com/gencon-indy-2009.cgi"

/gencon-indy-2009.cgi/event/SEM0900278
    2 "http://community.gencon.com/forums/p/19852/221347.aspx"
    15 "-"

/gencon-indy-2009.cgi/type/SPA/All_events
    5 "http://blog.craftzine.com/2.html"
    2 "http://www.google.com/reader/view/?tab=cy"
    11 "http://www.google.com/reader/view/"
    3 "http://www.google.com/reader/view/?tab=my"
    88 "-"
    28 "http://blog.craftzine.com/"
    21 "http://blog.craftzine.com/archive/2009/04/gen_con_gets_crafty.html"

Given this report, we can see that /gencon-indy-2009.cgi/event/SEM0902117 was requested 46 times. 2 of those times the visitor came from rpgblog2, twice the visitor came from twitturls.com, and 42 times the referrer information was not available.

Reminders

Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.

Hand In

A printout of your script on a single sheet of paper. At the top of the printout, please include “CS 368 Summer 2009”, your name, and “Homework 06, July 23, 2009”. Identifying your work is important, or you may not receive appropriate credit.