Computer Sciences Department logo

CS 368-1 (2010 Summer) — Day 9 Homework

Due Tuesday, July 27th, at the start of class.

Description

Download a webpage and identify the IP addresses from which images would be loaded.

Details

There are two parts to this problem. The first part is to fetch (download) the webpage located at a particular URL. The second part is to parse the resulting webpage, extract image URLs, and look up the IP address for each unique domain name being referenced by the image URLs. Here is sample output from my solution:

% ./solution-11.pl http://www.cnn.com/

fetching 'http://www.cnn.com/' now....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 99974    0 99974    0     0   219k      0 --:--:-- --:--:-- --:--:--  394k

i.cdn.turner.com: 199.93.46.126, 205.128.93.126, 206.33.36.126
i.l.cnn.net: 205.213.110.7, 205.213.110.14
i2.cdn.turner.com: 199.93.46.126, 206.33.36.126, 206.33.57.126

This task may sound very complex, but it’s really not too bad if you break it down into small steps.

To fetch a webpage, use a command-line utility like curl or wget that takes a URL and attempts to download the page. Most Linux systems include at least one, if not both; Mac OS X systems generally include curl. Read or at least skim the manpage for your chosen utility to get a feel for how it works. At very least, make sure you know how to save the downloaded page to a filename that you select. For what it’s worth, I used the following curl command in my script:

curl --fail --location --output download.html URL

Your script should download the page to a file, allowing any standard output or error output go to the screen. Which should you use, system() or backticks? Of course, you’ll want to check for errors afterward, because downloads can fail; but be reasonable here, no need to go crazy.

Next, you need to read the file and look for image tags. If you don’t know HTML, that’s OK, here is everything you need to know. An image tag has the following format:

<img src="image_url_here" ...>

There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but ONLY from img tags. Most webpages will have lots of these image tags, so find ALL of the image URLs that you can.

Here is a handy regular expression trick that might help. We can assume that the “image_url_here” part of the img tag contains no double quotes itself. Therefore, instead of just matching on /"(.*)"/, which can cause problems with certain img tags, we can use a more precise expression:

"([^"]+)"

Effectively, that says, match and save the stuff between double quotes. Of course, you must still provide the full regular expression that matches the img tag context, but the regular expression fragment above should help.

OK, so now you have a bunch of image URLs. We care about the ones that look like this:

http://domain.name.here/path/goes/here/filename.ext

For each image URL like this, extract the domain name, which is the text between the // and the first single /. You can use a similar trick here as the one shown above for matching stuff between double quotes, except here you are matching stuff between / characters. And once you have them, you want each unique domain name listed only once, so think about how to do that.

Almost there!

Now, call the host command line utility to look up the IP address for each domain name. Because you simply need the output from the host command, which should you use, system() or backticks? The host command has potentially quite messy output, but you only care about one part which is very regularly formatted. Here is the complete output from a call to host, with the key parts highlighted.

% host images.apple.com
images.apple.com is an alias for images.apple.com.edgesuite.net.
images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net.
images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net.
a199.gi3.akamai.net has address 205.213.110.8
a199.gi3.akamai.net has address 205.213.110.14

Just look for and parse the IP addresses prefixed by “has address” as shown above.

Finally, your script should output each unique domain name, followed by any and all IP addresses from the host command. Sometimes, host will not return any IP addresses, which is OK (assuming your code is correct).

Reminders

Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.

Hand In

A printout of your output on a single sheet of paper. Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.