DTD Handling in the New Niagara Introduction DTDs and XML Schemas are an important part of the new Niagara; they are needed for many purposes. For this docu- ment I will just use the term DTD as shorthand to refer to both DTDs and XML Schemas. I will refer to each explicitly when it matters. One of the most obvious uses for a DTD is for document parsing. If a document conforms to a DTD the DTD needs to be used to validate the document. DTDs are also needed for understanding documents. One example of this is understand- ing what IDs and IDREFs a document contains. Without the DTD you have no idea of what is an ID or a IDREF. The stor- age of a document can vary depending upon the DTD; for exam- ple a document may contain both ignoreable and non-ignore- able whitespace. This difference is that non-ignoreable whitespace must be preserved exactly. Ignoreable whitespace can be converted to a single whitespace, or may be elimi- nated -- such as a payload word adjacent to an element. Another important use for DTDs in query processing is to provide the optimizer with information that it can use to create a more efficient query plan. The DTD can provide information that allows optimization of the query plan; removing the need for duplicate elimination is one trivial example. Another use is to indicate that a query doesn't match the data, or will run horribly -- joins on subtrees for example. See Stratis or Sekar for better examples of how to use schema information when querying documents. DTD and XML Schemas Both DTD and XML Schemas are widely used to provide schema information about XML data. DTDs are the old accepted way of providing schema information. However the are difficult to use as they don't have a well defined structure, such as XML itself has. Another problem is that DTDs are not capable of representing many important schema constraints. XML Schemas were developed as a more expressive means of representing XML schema information. Instead of a rather ad-hoc representation, XML Schemas are themselves valid XML -2- documents. This allows them to be manipulated and used eas- ily, just as any other XML document. It also allows them to be easily extensible and self descriptive. DTD Storage in Niagara DTDs and XML Schemas are just another document which will be stored in the XML database. If the DTD is needed it will be loaded from the database. If the DTD is required and is not available in the database, a query will be run to load the document so that it may be accessed. I propose that DTDs be translated to simple XML Schemas for storage in Niagara. This conversion allows the DTD to be queried and accessed the same as any other document in the system, instead of requiring a new and different mecha- nism for just the DTDs. If the DTD itself is required, a DTD printing module will be used with the incremental evalu- ator to return the Schema to a DTD state. One use of this would be to print the DTD for viewing and comparison. Another use will be to feed the textual DTD to the parser for use in insuring document compliance to the DTD. A more complex version of this DTD printer could try to convert non-simple XML Schemas into a DTD for such use. While some elements of the schema will be lost, this would allow use of XML Schemas with components which do not sup- port them. This, of course, is an optional extra and not a critical path item. Input and output of DTDs XML Schemas do not require any special tools for input and output. They are normal XML documents and will be han- dled with all the normal tools. We can use an existing DTD parser or create our own DTD parser, depending upon our needs. The DTD parser will be another of our parser modules. It will take a DTD as an input, and will generate a set of parse events for the resulting simple XML Schema. It fits into our document ingestion framework as any other parser does. DTD output will be done by a special DTD printer which uses the incremental subtree evaluator. This DTD printer is the reverse of the DTD parser and converts a simple XML schema back into a corresponding textual DTD. As mentioned earlier, a more complex DTD printer could make DTDs from XML schemas. Schema Creation It may be possible to generate schema information for documents or portions of documents which otherwise lack -3- schema information. Having this information available for optimizer use would help improve plan generation. There are a number of methods for this. One is noticing that a docu- ment conforms to an existing schema, or a portion of an existing schema. It can be noticed that parts of a document conform to a sub-schema. Another tactic is to analyze the document and generate a schema for it. All of this is research, but the new system easily pro- vides a basis for this type of research. Use of DTDs There are three consumers of DTDs in the Niagara sys- tem. The first consumer is the XML parser, which needs a DTD to determine if a document is compliant. Typically this will require a textual DTD as input to the parser. If a XML parser supports XML Schemas, schema information can be pro- vided as textual input to the scanner. If the parser accepts XML schemas in DOM format, another output module can be made which will print a DOM tree from the database. The next user of DTDs is the database itself. As DTDs are just ordinary documents and may be queried as such, either for ordinary retrieval, for a query, or any other use that a document can be used for. The last consumer of DTDs is the optimizer. As men- tioned earlier the optimizer can use the schema information to great effect. It allows plan validation and optimiza- tions that are not possible without schema information. There are two ways which the optimizer can use DTD informa- tion. The most simple is for it to issue a query on the database! This ia always a possibility for retrieving sim- ple information and requires no special tools to be deveoped. For more complex analysis of DTDs the optimizer will likely want some form of in-memory structure which it can traverse and query itself to discover and validate information for the query. Whatever is needed for this is not a problem -- the in-memory tree can be created via an ordinary query which prints an in-memory tree which is to the optimizer's liking. The wonder of the component ori- ented Niagara system! Implementation Notes The DTD support needed by the system is quite minimal, and doesn't involve a lot of work. There are already DTD parsers available, there is not a great deal of code to write to do most of this. A parser module to glue a DTD parser into Niagara and convert it to a "simple" XML schema. A "print" module for the incremental evaluator to print the "simple" XML schema into DTD form for those tools which need the textual DTD. -4- The users of DTDs, the optimizer in particular, will need work to understand and use schema information, but that is not the responsibility of the execution engine. The only thing required for execution of the optimizer is a "print" module for the incremental evaluator which can "print" the appropriate in-memory data structure for the optimizer to use. Complexity of Validation Yuan has pointed out that some types of document vali- dation can be quite expensive to perform in the parser. This is due to the complexity of some of the validations, which either have global consistency requirements, or require complex validations of individual values. This type of validation can be done in the parser; however it slows processing quite a bit due to the resources. Yuan suggests that a better solution is to perform the more costly valida- tion through a query or other database style operation on the completely parsed document. For example, IDREF <-> ID and KEYREF <-> KEY matching and type-checking can easily be done with a query. This can be performed two ways in Nia- gara, depending upon how the document is stored and or index. One way is to run a query on the multi-attribute IDREF index for validation. The other method is to run a unnest query which dereferences xREFs to ensure they point to the correct type. After discovering the above issue, it seems that we shouldn't think about document validation as a all-or-noth- ing proposition. Yes, to be strict, that is the only way to be exact with regards to the stanadards of document compli- ance to a schema. However there are cases that I can think of where strict validation is not required by, or desired by, or useful to the user. In these cases we should look at document validation not as an absolute truth, but as having a level of compliance. If document validation is examined it is apparent that validation consists of several independent, validations. I've broken these up into the following categories, with an arbitrary rating of least complex to most complex. Complex is some arbitrary qualification of processing cost and dif- ficulty with having the parser perform the validation: -5- +------+-------------------------------+ |level | validation | +------+-------------------------------+ +------+-------------------------------+ |0 | No info (no DTD) | +------+-------------------------------+ |1 | Document specifies a DTD | +------+-------------------------------+ |2 | DTD used to interpret the | | | basic document structure cor- | | | rectly; important items such | | | as whitespace, IDs and | | | IDREFs, KEYs and KEYREFs. | +------+-------------------------------+ |3 | Local structural constraints | +------+-------------------------------+ |4 | Global structural constraints | +------+-------------------------------+ |5 | Value constraints | +------+-------------------------------+ |6 | Global constraints | +------+-------------------------------+ |7 | Completely valid | +------+-------------------------------+ Levels 0-3 can easily be performed by the parser with little additional cost. Levels 4-6 can be done by the parser but require increasing resources and processing time. Level 1 may seem odd but isn't -- a document which specifies a DTD, but for which we can't locate a DTD at parse time. Note levels 2 and 3; some structural constraints are more global in nature and have a correspondingly higher cost to vali- date. This cost may be greater than classified -- the cost to perform them could be comparable to Level 5, but it seems a natural break, 2 and 3 started out as a sinle entry of structural constraints. Level 4, value constraints is also odd. Some value constraints are possible to determine at parse time; for example determining that a payload word is a valid integer. Other value constraints are more difficult and similar to SQL constraints on value -- unique values for example. I propose that we store a document validation level for each document which we store or know about. When a query is run it can specify which minimum validation level it will accept for documents. This will allow only documents which are valid enough to be processed for the query. If the user cares less about validity, they can access a larger assort- ment of documents. If the user cares more about validity, they will see only the valid, or valid enough, documents. The query can also specify how much work the system should perform on behalf of the query. This would allow the system to take the time to increase the validation level on documents validated at a "lower level". It would do this by -6- executing validation queries or operators on documents below the specified validation level, to determine what their validity is. Another way of looking at this model is to specify how a document is valid, rather than some abstract level. This would allow one to specify document validity in a orthogonal fashion; you may not care that a document is valid, but just that it says it conforms to a DTD. For example a query may only care that a document has local structure and correct values, not that the IDREFs are consistent. More DTD Wackyness Sekar pointed out some more considerations with docu- ment schema informatino that also affect things such as whitespace handling and more. First, is that an XML document can have an xml-whites- pace attribute in any element. The attribute specifies how whitespace should be handled for this element instance and all nested elements. This thing has two values which mean default, and non-ignoreable. Default says the node follows whatever whitespace specification has been made for the entire document -- which is specified in the root. The default specification in the root specifies whether whites- pace is ignoreable or not. Non-ignoreable means that whitespace must not be ignored until a different (default) specification is met. Overall this scheme is odd, but allows non-ignoreable whitespace to exist in odd documents correctly, though it does lack a way to explicity disable non-ignoreable whitespace if that was desired. Another issue is that a DTD can specify attributes and/or attribute values which must implicitly exist in the document. This directive can specify that the value of the attribute should be fixed which means that the value of the attribute, if present in an element, must match the value specified in the DTD. The other ??? specification means that the attribute with the value specified in the DTD must be added to each element which does not contain that attribute. Some consideration should be given to handling this; the most simple is to just add the attribute to each element at parse time. However it is rather inefficient; keeping track of document meta information may be valuable if this issue becomes important. Difference queries can be used to determine elements with the default attribute value -- by subtracting indexed attributes which were found in the document. There are also issues about re-creating the orig- inal document without all the "extra implied added-in" attributes which this thing needs. -7- Conclusion There seems to be a lot of work in this area which shows potential for research. However, beyond the immediate needs of having DTD and/or XML Schema information on-hand for the parser, none of this work is something to worry about for the short term.