Due Thursday, July 28, at the start of class.
Download a webpage and identify the IP addresses from which images would be loaded.
There are multiple phases to this problem. The first part is to fetch (download) the webpage located at a particular URL. The second part is to parse the resulting webpage and extract image URLs and their domain names. The last part is to look up the IP address for each unique domain name being referenced by the image URLs. Here is the command line (starting with %) and sample output from my solution:
% ./homework-10-solution.pl http://www.cnn.com/ fetching 'http://www.cnn.com/' now.... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 99974 0 99974 0 0 219k 0 --:--:-- --:--:-- --:--:-- 394k i.cdn.turner.com: 22.214.171.124, 126.96.36.199, 188.8.131.52 i.l.cnn.net: 184.108.40.206, 220.127.116.11 i2.cdn.turner.com: 18.104.22.168, 22.214.171.124, 126.96.36.199
This task may sound very complex, but it’s really not too bad if you break it down into small steps.
To fetch a webpage, use a command-line utility like
wget that takes a URL and attempts to download the page.
Most Linux systems include at least one, if not both; Mac OS X
systems generally include
curl. Read or at least skim the
manpage for your chosen utility to get a feel for how it works. At very
least, make sure you know how to save the downloaded page to a filename that
you select. For what it’s worth, I used the following curl command in
curl --fail --location --output download.html URL
Your script should download the page to a file, allowing any standard output or error output go to the screen. Which should you use, system() or backticks? Of course, you must check for errors after the download, because it might fail; but be reasonable here, no need to go crazy.
Next, you need to read the file and look for image tags. If you do not know HTML, that is OK, here is everything you need to know. An image tag has the following format:
<img src="image_url_here" ...>
There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but only from img tags. Most webpages will have lots of these image tags, so find all of the image URLs that you can.
Here is a handy regular expression trick that might help. We can assume that the “image_url_here” part of the img tag contains no double quotes itself. Therefore, instead of just matching on /"(.*)"/, which can cause problems with certain img tags, we can use a more precise expression:
Effectively, that says, match and save the stuff between double quotes. Of course, you must still provide the full regular expression that matches the img tag context, but the regular expression fragment above should help.
OK, so now you have a bunch of image URLs. We care about the ones that look like this:
For each image URL like this, extract the domain name, which is the text between the // and the first single /. You can use a similar trick here as the one shown above for matching stuff between double quotes, except here you are matching stuff between / characters. And once you have them, you want each unique domain name listed only once, so think about how to do that.
Now, call the
host command-line utility to look up the IP
address for each domain name. Because you simply need the output from the
host command, which should you use, system() or backticks? The
host command has potentially quite messy output, but you only
care about one part which is very regularly formatted. Here is the complete
output from a call to
host, with the key parts highlighted.
% host images.apple.com images.apple.com is an alias for images.apple.com.edgesuite.net. images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net. images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net. a199.gi3.akamai.net has address 188.8.131.52 a199.gi3.akamai.net has address 184.108.40.206
Just find and extract the IP addresses prefixed by “has address”, as shown above.
For output, your script should print each unique domain name, followed by any and all IP addresses from the host command. Sometimes, host will not return any IP addresses, which is OK (assuming your code is correct).
Do the work yourself, consulting reasonable reference materials as needed; any reference material that gives you a complete or nearly complete solution to this problem or a similar one is not OK to use. Asking the instructors for help is OK, asking other students for help is not.
A printout of your code on a single sheet of paper (if possible). Be sure to put your own name in the initial comment block of the code. Identifying your work is important, or you may not receive appropriate credit.