Due Thursday, July 26, at the start of class.
Download a webpage and identify the IP addresses from which images would be loaded.
This script is most easily described in three parts (how you organize your actual script is up to you):
Here is the command line (starting with %
) and sample output from my solution:
% ./homework-11.pl http://www.cnn.com fetching 'http://www.cnn.com' now.... % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 99k 100 99k 0 0 232k 0 --:--:-- --:--:-- --:--:-- 263k Found 59 external images and 2 unique domains. i.cdn.turner.com: 8.26.219.254, 8.26.221.126, 207.123.44.126 i2.cdn.turner.com: 207.123.44.126, 8.26.219.254, 8.26.221.126
This task may sound very complex, but it’s really not too bad if you break it down into small steps.
The first main step is to download the HTML of the webpage at the given URL. You must accept the starting URL from the command line. Of the two command-line parsing methods discussed in class, which is most appropriate in this case?
To fetch a webpage, use a command-line utility like curl
or wget
that takes a URL and
attempts to download the page. Most Linux systems include at least one, if not both; Mac OS X systems
generally include curl
. Read or at least skim the manpage for your chosen utility to get a feel for
how it works. At very least, make sure you know how to save the downloaded page to a filename that you select. For
what it’s worth, I used the following curl command in my script:
curl --fail --location --output download.html URL
Your script should download the HTML directly to a file, allowing any standard output or error
output go to the screen. Which should you use, system()
or backticks? Of course, you must check for
errors after the download, because it might fail; but be reasonable here, no need to go crazy.
Next, you need to read the file and look for image tags. If you do not know HTML, that is OK, here is everything you need to know. An image tag has the following format:
<img src="image_url_here" ...>
There can be extra spaces just about anywhere, but you are free to ignore that here. Essentially, what you want is the stuff between the quotes of the src="…" part — but only from img tags. Most webpages will have lots of these image tags, so find all of the image URLs that you can.
Here is a hint about creating a regular expression to find and extract the image URLs. Assume that the
“image_url_here” part of the img
tag does not contain any double quotes itself. That is,
you are looking for the minimal text between a pair of double quotes. What regular expression tricks did we cover
already to help in this kind of situation?
OK, so now you have a bunch of image URLs. We care about the ones that look like this — specifically,
ones that start with http://
:
http://domain.name.here/path/goes/here/filename.ext
For each image URL like this, extract the domain name, which is the text between the
first //
and the first single /
. You can use a similar trick here as the one hinted at
above for matching stuff between double quotes, except here you are matching stuff between / characters. And once
you have them, you want to store each unique domain name only once, so think about how to do that.
Almost there!
In this part, your script will find the IP address(es) corresponding to each domain name that it found in Part 2.
There is a handy command-line program to look up IP addresses, it is called host
. You need the output
from the host
command, so which should you use, system()
or backticks? The
host
command has potentially quite messy output, but you only care about one part which is very
regularly formatted. Here is the complete output from a call to host
, with the key parts highlighted.
% host images.apple.com images.apple.com is an alias for images.apple.com.edgesuite.net. images.apple.com.edgesuite.net is an alias for images.apple.com.edgesuite.net.globalredir.akadns.net. images.apple.com.edgesuite.net.globalredir.akadns.net is an alias for a199.gi3.akamai.net. a199.gi3.akamai.net has address 205.213.110.8 a199.gi3.akamai.net has address 205.213.110.14
Just find and extract the IP addresses prefixed by “has address”, as shown above.
For output, your script should print the number of image URLs found, the number of unique domain names, and for each
unique domain name, all IP addresses associated with that domain name (from the host
command). Sometimes, host
will not return any IP addresses, which is OK (assuming your code is
correct).
See the sample output at the top of this page for one possible output format. Note that some of the output comes from commands that are run by the script.
This script takes many words to describe, but can be written fairly compactly. My solution easily fit onto a single page (much less than 50 lines of code). If you find that your solution is multiple pages long, think about ways to reorganize your solution to be simpler.
How will you test your script? Can you write some unit tests to try out the trickiest parts? Here are some general ideas for testing this script:
host
correctly? How do you know?Do the work yourself, consulting reasonable reference materials as needed. Any resource that provides a complete solution or offers significant material assistance toward a solution not OK to use. Asking the instructor for help is OK, asking other students for help is not. All standard UW policies concerning student conduct (esp. UWS 14) and information technology apply to this course and assignment.
A printout of your code, ideally on a single sheet of paper. Be sure to put your own name in the initial comment block. Identifying your work is important, or you may not receive appropriate credit.