CS 368-3 (2012 Summer) — Day 11 Homework

Due Thursday, July 26, at the start of class.

Goal

Download a webpage and identify the IP addresses from which images would be loaded.

Tasks

This script is most easily described in three parts (how you organize your actual script is up to you):

Fetch (download) the HTML for a webpage located at a particular URL.
Parse the downloaded HTML and extract image URLs and their domain names.
Look up the IP address for each unique domain name being referenced by the image URLs.

Here is the command line (starting with %) and sample output from my solution:

% ./homework-11.pl http://www.cnn.com

fetching 'http://www.cnn.com' now....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   99k  100   99k    0     0   232k      0 --:--:-- --:--:-- --:--:--  263k

Found 59 external images and 2 unique domains.

i.cdn.turner.com: 8.26.219.254, 8.26.221.126, 207.123.44.126
i2.cdn.turner.com: 207.123.44.126, 8.26.219.254, 8.26.221.126

This task may sound very complex, but it’s really not too bad if you break it down into small steps.

Part 1

The first main step is to download the HTML of the webpage at the given URL. You must accept the starting URL from the command line. Of the two command-line parsing methods discussed in class, which is most appropriate in this case?

To fetch a webpage, use a command-line utility like curl or wget that takes a URL and attempts to download the page. Most Linux systems include at least one, if not both; Mac OS X systems generally include curl. Read or at least skim the manpage for your chosen utility to get a feel for how it works. At very least, make sure you know how to save the downloaded page to a filename that you select. For what it’s worth, I used the following curl command in my script:

curl --fail --location --output download.html URL

Your script should download the HTML directly to a file, allowing any standard output or error output go to the screen. Which should you use, system() or backticks? Of course, you must check for errors after the download, because it might fail; but be reasonable here, no need to go crazy.

Part 2

Next, you need to read the file and look for image tags. If you do not know HTML, that is OK, here is everything you need to know. An image tag has the following format:

<img src="image_url_here" ...>

There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but only from img tags. Most webpages will have lots of these image tags, so find all of the image URLs that you can.

Here is a hint about creating a regular expression to find and extract the image URLs. Assume that the “image_url_here” part of the img tag does not contain any double quotes itself. That is, you are looking for the minimal text between a pair of double quotes. What regular expression tricks did we cover already to help in this kind of situation?

OK, so now you have a bunch of image URLs. We care about the ones that look like this — specifically, ones that start with http://:

http://domain.name.here/path/goes/here/filename.ext

For each image URL like this, extract the domain name, which is the text between the first // and the first single /. You can use a similar trick here as the one hinted at above for matching stuff between double quotes, except here you are matching stuff between / characters. And once you have them, you want to store each unique domain name only once, so think about how to do that.

Almost there!

Part 3

In this part, your script will find the IP address(es) corresponding to each domain name that it found in Part 2.

There is a handy command-line program to look up IP addresses, it is called host. You need the output from the host command, so which should you use, system() or backticks? The host command has potentially quite messy output, but you only care about one part which is very regularly formatted. Here is the complete output from a call to host, with the key parts highlighted.

% host images.apple.com
images.apple.com is an alias for images.apple.com.edgesuite.net.
images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net.
images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net.
a199.gi3.akamai.net has address 205.213.110.8
a199.gi3.akamai.net has address 205.213.110.14

Just find and extract the IP addresses prefixed by “has address”, as shown above.

Output

For output, your script should print the number of image URLs found, the number of unique domain names, and for each unique domain name, all IP addresses associated with that domain name (from the host command). Sometimes, host will not return any IP addresses, which is OK (assuming your code is correct).

See the sample output at the top of this page for one possible output format. Note that some of the output comes from commands that are run by the script.

Final Note

This script takes many words to describe, but can be written fairly compactly. My solution easily fit onto a single page (much less than 50 lines of code). If you find that your solution is multiple pages long, think about ways to reorganize your solution to be simpler.

Testing

How will you test your script? Can you write some unit tests to try out the trickiest parts? Here are some general ideas for testing this script:

Each part of your script has input(s) and output(s) — be sure to check each part by itself.
For Part 1, does your script gracefully detect and handle a download failure?
For Part 2, does your script find all image tags? How do you know?
Also for Part 2, does your script extract image URLs correctly? How do you know?
For Part 2, does your script parse the output from host correctly? How do you know?
Does your script use data structures, regular expressions, built-in functions, and system calls naturally and effectively?

Reminders

Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.

Hand In

A printout of your code, ideally on a single sheet of paper. Be sure to put your own name in the initial comment block. Identifying your work is important, or you may not receive appropriate credit.