Computer Sciences Department logo

CS 368-1 (2011 Summer) — Day 10 Homework

Due Thursday, July 28, at the start of class.

Description

Download a webpage and identify the IP addresses from which images would be loaded.

Details

There are multiple phases to this problem. The first part is to fetch (download) the webpage located at a particular URL. The second part is to parse the resulting webpage and extract image URLs and their domain names. The last part is to look up the IP address for each unique domain name being referenced by the image URLs. Here is the command line (starting with %) and sample output from my solution:

% ./homework-10-solution.pl http://www.cnn.com/

fetching 'http://www.cnn.com/' now....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 99974    0 99974    0     0   219k      0 --:--:-- --:--:-- --:--:--  394k

i.cdn.turner.com: 199.93.46.126, 205.128.93.126, 206.33.36.126
i.l.cnn.net: 205.213.110.7, 205.213.110.14
i2.cdn.turner.com: 199.93.46.126, 206.33.36.126, 206.33.57.126

This task may sound very complex, but it’s really not too bad if you break it down into small steps.

Phase I

To fetch a webpage, use a command-line utility like curl or wget that takes a URL and attempts to download the page. Most Linux systems include at least one, if not both; Mac OS X systems generally include curl. Read or at least skim the manpage for your chosen utility to get a feel for how it works. At very least, make sure you know how to save the downloaded page to a filename that you select. For what it’s worth, I used the following curl command in my script:

curl --fail --location --output download.html URL

Your script should download the page to a file, allowing any standard output or error output go to the screen. Which should you use, system() or backticks? Of course, you must check for errors after the download, because it might fail; but be reasonable here, no need to go crazy.

Phase II

Next, you need to read the file and look for image tags. If you do not know HTML, that is OK, here is everything you need to know. An image tag has the following format:

<img src="image_url_here" ...>

There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but only from img tags. Most webpages will have lots of these image tags, so find all of the image URLs that you can.

Here is a handy regular expression trick that might help. We can assume that the “image_url_here” part of the img tag contains no double quotes itself. Therefore, instead of just matching on /"(.*)"/, which can cause problems with certain img tags, we can use a more precise expression:

"([^"]+)"

Effectively, that says, match and save the stuff between double quotes. Of course, you must still provide the full regular expression that matches the img tag context, but the regular expression fragment above should help.

OK, so now you have a bunch of image URLs. We care about the ones that look like this:

http://domain.name.here/path/goes/here/filename.ext

For each image URL like this, extract the domain name, which is the text between the // and the first single /. You can use a similar trick here as the one shown above for matching stuff between double quotes, except here you are matching stuff between / characters. And once you have them, you want each unique domain name listed only once, so think about how to do that.

Almost there!

Phase III

Now, call the host command-line utility to look up the IP address for each domain name. Because you simply need the output from the host command, which should you use, system() or backticks? The host command has potentially quite messy output, but you only care about one part which is very regularly formatted. Here is the complete output from a call to host, with the key parts highlighted.

% host images.apple.com
images.apple.com is an alias for images.apple.com.edgesuite.net.
images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net.
images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net.
a199.gi3.akamai.net has address 205.213.110.8
a199.gi3.akamai.net has address 205.213.110.14

Just find and extract the IP addresses prefixed by “has address”, as shown above.

Phase IV (Output)

For output, your script should print each unique domain name, followed by any and all IP addresses from the host command. Sometimes, host will not return any IP addresses, which is OK (assuming your code is correct).

Reminders

Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.

Hand In

A printout of your code on a single sheet of paper (if possible). Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.