CS 368 (Summer 2009) — Day 11 Homework

Due Friday, July 31st, at the start of class.

Description

Download a webpage and identify the IP addresses from which images would be loaded.

Details

There are two parts to this problem. The first is to get a URL from the command line, and then fetch the webpage located at that URL. The second step is to parse the resulting webpage, extract image URLs, and look up the IP address for each unique domain name being referenced by the image URLs. Here is sample output from my solution:

% ./solution-11.pl http://www.cnn.com/

fetching 'http://www.cnn.com/' now....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 99974    0 99974    0     0   219k      0 --:--:-- --:--:-- --:--:--  394k

i.cdn.turner.com: 199.93.46.126, 205.128.93.126, 206.33.36.126
i.l.cnn.net: 205.213.110.7, 205.213.110.14
i2.cdn.turner.com: 199.93.46.126, 206.33.36.126, 206.33.57.126

This task may sound very complex, but it’s really not too bad if you break it down into small steps.

To fetch a webpage, use a command-line utility like curl or wget that takes a URL and attempts to download the page. Most Linux systems include at least one, if not both; Mac OS X systems generally include curl. Read or at least skim the manpage for your chosen utility to get a feel for how it works. At very least, make sure you know how to save the downloaded page to a filename that you select. For what it’s worth, I used the following curl command in my script:

curl --fail --location --output download.html URL

Your script should download the page to a file, allowing any standard output or error output go to the screen. Which should you use, system() or backticks? Of course, you’ll want to check for errors afterward, because downloads can fail; but be reasonable here, no need to go crazy.

Next, you need to read the file and look for image tags. If you don’t know HTML, that’s OK, here is everything you need to know. An image tag has the following format:

<img src="image_url_here" ...>

There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but ONLY from img tags. Most webpages will have lots of these image tags, so find ALL of the image URLs that you can.

OK, so now you have a bunch of image URLs. We care about the ones that look like this:

http://domain.name.here/path/goes/here/filename.ext

For each image URL like this, extract the domain name, which is the text between the // and the first single /. However, you want each unique domain name listed only once, so think about how to do that.

Almost there!

Now, call the host command line utility to look up the IP address for each domain name. Because you simply need the output from the host command, which should you use, system() or backticks? The host command has potentially quite messy output, but you only care about one part which is very regularly formatted. Here is the complete output from a call to host, with the key parts highlighted.

% host images.apple.com
images.apple.com is an alias for images.apple.com.edgesuite.net.
images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net.
images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net.
a199.gi3.akamai.net has address 205.213.110.8
a199.gi3.akamai.net has address 205.213.110.14

Just look for and parse the IP addresses prefixed by “has address” as shown above.

Finally, your script should output each unique domain name, followed by any and all IP addresses from the host command. Sometimes, host will not return any IP addresses, which is OK (assuming your code is correct).

Reminders

Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.

For what it’s worth, it has taken me far more lines of text to describe this problem than it took me lines of Perl (about 25, for the record) to code the solution. Of course, I know lots of Perl tricks, but still, the point is that this is fairly easy to code in Perl.

Hand In

A printout of your script on a single sheet of paper. At the top of the printout, please include “CS 368 Summer 2009”, your name, and “Homework 11, July 30, 2009”. Identifying your work is important, or you may not receive appropriate credit.