Assignment 2

Due Wednesday, June 29 at 9:00 AM

Be sure you are acquainted with my collaboration policy and my late policy, both found on the main course webpage. If you are working with a partner, be sure to include both of your names.


Introduction

XML (eXtensible Mark-up Language) is a language for describing data in a particular structured way. Many computer programs use XML-based file formats. One example you may be familiar with is iTunes, Apple's digital music player.

In this assignment you will use stacks to check whether an iTunes XML document is syntactically well-formed. This is essentially a fancy version of the parentheses-matching problem we looked at in class. In order to complete this assignment, you'll need these files:


Acquaint Yourself with playlist.xml

If you view playlist.xml in the Firefox web browser, then you'll see a hierarchical, "tree-like" view of its contents; Firefox understands the structure of XML files. If you view endcut.xml, startcut.xml, and mismatch.xml in Firefox, then the program complains that they are not well-formed XML files; that's what this assignment is all about.

To understand what's going on, examine the playlist.xml file in a text editor. As you can see, an XML document is just a bunch of text, but sprinkled with nuggets enclosed in < and >, such as <dict>; these are called tags.

The first two lines in playlist.xml are special. The first line has an ?xml tag, which indicates that this is an XML file. This tag has two attributes, naming the intended XML version and text encoding. The second line has a !DOCTYPE tag, which has an attribute describing what kind of XML document we're looking at. In this assignment we're going to ignore these two special lines.

After the first two lines, the file takes on a rigid structure. There are three kinds of tags. What distinguishes them is whether they have a slash / in them, and where.

After the first two lines, the structure of playlist.xml is a nest of opening and closing tags. The whole file (after the first two lines) is contained in a single <plist> </plist> pair. Inside this plist pair is a single <dict> </dict> pair. Inside that dict pair are many keys and other things.

The important point here is that all tags (after the first two lines) are properly nested. If a tag is opened inside a matching pair of opening/closing tags, then it is closed inside that same matching pair. Put another way, each closing tag matches the most-recently-opened-but-not-yet-closed tag. We say that the XML file is well-formed if the tags nest properly like this. In this assignment you will write a program to check whether a given XML file is well-formed.


Acquaint Yourself with the Code

The file XMLChecker.java already contains some code. Read through it and familiarize yourself with what it is doing. The first chunk of code is a check to make sure the user has passed a single XML file to be validated. Notice that the usage should look like this once everything has been compiled:

java XMLChecker filename.xml
The code then stores the contents of the file in a string--you should have seen code like this in previous courses. Then comes something you may not have seen before--our String of XML is parsed down to a list of tags using something called a "regular expression." Regular expressions are a type of pattern matching that you will see frequently in your CS career. It is not required that you understand them for this class. However, if you are interested, Wikipedia's article is actually pretty decent. You're also welcome to come ask me about it during office hours. Anyway, the important thing to know is that this code will take any tags found between "<" and ">" characters and stores them in an ArrayList called tagList.

Finally, the code prints out the entire list of tags. That's not what we want it to do, though. Here's where you come in.


Write Your Well-Formedness Checker

Your assignment is to replace the line that prints out the tags with a bunch of code that uses a stack to check the well-formedness of the XML. You will be using basically the same algorithm we covered in class (Tuesday, 6/21). Essentially, each XML tag is a different kind of parenthesis, with opening (e.g. dict) and closing (e.g. /dict) variants. Remember that an empty tag (e.g. true/) opens and immediately closes itself; how do you treat this? (Although <true/> is the only empty tag in playlist.xml, your program should be able to handle any empty tag.)

If the XML is well-formed, then your program should print out a happy message to that effect. If the XML is not well-formed, then there are several different ways in which it could have failed. You should print out a detailed error message to tell the user what went wrong. Part of your error message should be the contents of the stack at the time the error occurred. This helps the user figure out where the bad part of the XML file is. For example, here's what the program might do when handed a well-formed XML file:

ealexand$ java XMLChecker playlist.xml
The file playlist.xml correctly nests XML tags.
Here's what the program might do with an XML file that closes tags incorrectly:
ealexand$ java XMLChecker mismatch.xml
The file mismatch.xml does not correctly nest XML tags.
The tag /sminteger was found where the following tags were expected 
to be closed (starting with the inner-most tag):
integer
dict
dict
dict
plist
I have provided you with three damaged iTunes XML files, in addition to the undamaged playlist.xml. Make sure that your program works correctly on all of them. You may want to test your program on other iTunes XML files as well; I'm sure you can find someone who uses iTunes, to get more samples.


Turning in your work

To hand in your program, copy XMLChecker.java and any other files needed to create XMLChecker.class (not including the files with which I provided you) to the following directory:

~cs367-1/handin/login-name/P1

Use your actual CS login name (not your UW NetID!) in place of login-name. Be sure that a comment at the top of the XMLChecker.java file gives the name(s) of the author(s) of the code.

If you worked with a partner, then only one of you should turn in the program files as specified above. It doesn't matter which partner hands in the program files.

Do not copy any ".class" files, and do not create any subdirectories in your handin directory. Note that for this assignment, you should test your program thoroughly, but you do not need to hand in your test data.

You will be graded on program correctness, error checking, the usefulness of your error messages, and general style.


Note: this assignment has been adapted from a similar assignment given by Carleton College professor Josh Davis.