DTD Handling
in the
New Niagara
Introduction
DTDs and XML Schemas are an important part of the new
Niagara; they are needed for many purposes. For this docu-
ment I will just use the term DTD as shorthand to refer to
both DTDs and XML Schemas. I will refer to each explicitly
when it matters.
One of the most obvious uses for a DTD is for document
parsing. If a document conforms to a DTD the DTD needs to
be used to validate the document. DTDs are also needed for
understanding documents. One example of this is understand-
ing what IDs and IDREFs a document contains. Without the
DTD you have no idea of what is an ID or a IDREF. The stor-
age of a document can vary depending upon the DTD; for exam-
ple a document may contain both ignoreable and non-ignore-
able whitespace. This difference is that non-ignoreable
whitespace must be preserved exactly. Ignoreable whitespace
can be converted to a single whitespace, or may be elimi-
nated -- such as a payload word adjacent to an element.
Another important use for DTDs in query processing is
to provide the optimizer with information that it can use to
create a more efficient query plan. The DTD can provide
information that allows optimization of the query plan;
removing the need for duplicate elimination is one trivial
example. Another use is to indicate that a query doesn't
match the data, or will run horribly -- joins on subtrees
for example.
See Stratis or Sekar for better examples of how to use
schema information when querying documents.
DTD and XML Schemas
Both DTD and XML Schemas are widely used to provide
schema information about XML data. DTDs are the old
accepted way of providing schema information. However the
are difficult to use as they don't have a well defined
structure, such as XML itself has. Another problem is that
DTDs are not capable of representing many important schema
constraints.
XML Schemas were developed as a more expressive means
of representing XML schema information. Instead of a rather
ad-hoc representation, XML Schemas are themselves valid XML
-2-
documents. This allows them to be manipulated and used eas-
ily, just as any other XML document. It also allows them to
be easily extensible and self descriptive.
DTD Storage in Niagara
DTDs and XML Schemas are just another document which
will be stored in the XML database. If the DTD is needed it
will be loaded from the database. If the DTD is required
and is not available in the database, a query will be run to
load the document so that it may be accessed.
I propose that DTDs be translated to simple XML Schemas
for storage in Niagara. This conversion allows the DTD to
be queried and accessed the same as any other document in
the system, instead of requiring a new and different mecha-
nism for just the DTDs. If the DTD itself is required, a
DTD printing module will be used with the incremental evalu-
ator to return the Schema to a DTD state. One use of this
would be to print the DTD for viewing and comparison.
Another use will be to feed the textual DTD to the parser
for use in insuring document compliance to the DTD.
A more complex version of this DTD printer could try to
convert non-simple XML Schemas into a DTD for such use.
While some elements of the schema will be lost, this would
allow use of XML Schemas with components which do not sup-
port them. This, of course, is an optional extra and not a
critical path item.
Input and output of DTDs
XML Schemas do not require any special tools for input
and output. They are normal XML documents and will be han-
dled with all the normal tools.
We can use an existing DTD parser or create our own DTD
parser, depending upon our needs. The DTD parser will be
another of our parser modules. It will take a DTD as an
input, and will generate a set of parse events for the
resulting simple XML Schema. It fits into our document
ingestion framework as any other parser does.
DTD output will be done by a special DTD printer which
uses the incremental subtree evaluator. This DTD printer is
the reverse of the DTD parser and converts a simple XML
schema back into a corresponding textual DTD. As mentioned
earlier, a more complex DTD printer could make DTDs from XML
schemas.
Schema Creation
It may be possible to generate schema information for
documents or portions of documents which otherwise lack
-3-
schema information. Having this information available for
optimizer use would help improve plan generation. There are
a number of methods for this. One is noticing that a docu-
ment conforms to an existing schema, or a portion of an
existing schema. It can be noticed that parts of a document
conform to a sub-schema. Another tactic is to analyze the
document and generate a schema for it.
All of this is research, but the new system easily pro-
vides a basis for this type of research.
Use of DTDs
There are three consumers of DTDs in the Niagara sys-
tem. The first consumer is the XML parser, which needs a
DTD to determine if a document is compliant. Typically this
will require a textual DTD as input to the parser. If a XML
parser supports XML Schemas, schema information can be pro-
vided as textual input to the scanner. If the parser
accepts XML schemas in DOM format, another output module can
be made which will print a DOM tree from the database.
The next user of DTDs is the database itself. As DTDs
are just ordinary documents and may be queried as such,
either for ordinary retrieval, for a query, or any other use
that a document can be used for.
The last consumer of DTDs is the optimizer. As men-
tioned earlier the optimizer can use the schema information
to great effect. It allows plan validation and optimiza-
tions that are not possible without schema information.
There are two ways which the optimizer can use DTD informa-
tion. The most simple is for it to issue a query on the
database! This ia always a possibility for retrieving sim-
ple information and requires no special tools to be
deveoped. For more complex analysis of DTDs the optimizer
will likely want some form of in-memory structure which it
can traverse and query itself to discover and validate
information for the query. Whatever is needed for this is
not a problem -- the in-memory tree can be created via an
ordinary query which prints an in-memory tree which is to
the optimizer's liking. The wonder of the component ori-
ented Niagara system!
Implementation Notes
The DTD support needed by the system is quite minimal,
and doesn't involve a lot of work. There are already DTD
parsers available, there is not a great deal of code to
write to do most of this. A parser module to glue a DTD
parser into Niagara and convert it to a "simple" XML schema.
A "print" module for the incremental evaluator to print the
"simple" XML schema into DTD form for those tools which need
the textual DTD.
-4-
The users of DTDs, the optimizer in particular, will
need work to understand and use schema information, but that
is not the responsibility of the execution engine. The only
thing required for execution of the optimizer is a "print"
module for the incremental evaluator which can "print" the
appropriate in-memory data structure for the optimizer to
use.
Complexity of Validation
Yuan has pointed out that some types of document vali-
dation can be quite expensive to perform in the parser.
This is due to the complexity of some of the validations,
which either have global consistency requirements, or
require complex validations of individual values. This type
of validation can be done in the parser; however it slows
processing quite a bit due to the resources. Yuan suggests
that a better solution is to perform the more costly valida-
tion through a query or other database style operation on
the completely parsed document. For example, IDREF <-> ID
and KEYREF <-> KEY matching and type-checking can easily be
done with a query. This can be performed two ways in Nia-
gara, depending upon how the document is stored and or
index. One way is to run a query on the multi-attribute
IDREF index for validation. The other method is to run a
unnest query which dereferences xREFs to ensure they point
to the correct type.
After discovering the above issue, it seems that we
shouldn't think about document validation as a all-or-noth-
ing proposition. Yes, to be strict, that is the only way to
be exact with regards to the stanadards of document compli-
ance to a schema. However there are cases that I can think
of where strict validation is not required by, or desired
by, or useful to the user. In these cases we should look at
document validation not as an absolute truth, but as having
a level of compliance.
If document validation is examined it is apparent that
validation consists of several independent, validations.
I've broken these up into the following categories, with an
arbitrary rating of least complex to most complex. Complex
is some arbitrary qualification of processing cost and dif-
ficulty with having the parser perform the validation:
-5-
+------+-------------------------------+
|level | validation |
+------+-------------------------------+
+------+-------------------------------+
|0 | No info (no DTD) |
+------+-------------------------------+
|1 | Document specifies a DTD |
+------+-------------------------------+
|2 | DTD used to interpret the |
| | basic document structure cor- |
| | rectly; important items such |
| | as whitespace, IDs and |
| | IDREFs, KEYs and KEYREFs. |
+------+-------------------------------+
|3 | Local structural constraints |
+------+-------------------------------+
|4 | Global structural constraints |
+------+-------------------------------+
|5 | Value constraints |
+------+-------------------------------+
|6 | Global constraints |
+------+-------------------------------+
|7 | Completely valid |
+------+-------------------------------+
Levels 0-3 can easily be performed by the parser with little
additional cost. Levels 4-6 can be done by the parser but
require increasing resources and processing time. Level 1
may seem odd but isn't -- a document which specifies a DTD,
but for which we can't locate a DTD at parse time. Note
levels 2 and 3; some structural constraints are more global
in nature and have a correspondingly higher cost to vali-
date. This cost may be greater than classified -- the cost
to perform them could be comparable to Level 5, but it seems
a natural break, 2 and 3 started out as a sinle entry of
structural constraints. Level 4, value constraints is also
odd. Some value constraints are possible to determine at
parse time; for example determining that a payload word is a
valid integer. Other value constraints are more difficult
and similar to SQL constraints on value -- unique values for
example.
I propose that we store a document validation level for
each document which we store or know about. When a query is
run it can specify which minimum validation level it will
accept for documents. This will allow only documents which
are valid enough to be processed for the query. If the user
cares less about validity, they can access a larger assort-
ment of documents. If the user cares more about validity,
they will see only the valid, or valid enough, documents.
The query can also specify how much work the system
should perform on behalf of the query. This would allow the
system to take the time to increase the validation level on
documents validated at a "lower level". It would do this by
-6-
executing validation queries or operators on documents below
the specified validation level, to determine what their
validity is.
Another way of looking at this model is to specify how
a document is valid, rather than some abstract level. This
would allow one to specify document validity in a orthogonal
fashion; you may not care that a document is valid, but just
that it says it conforms to a DTD. For example a query may
only care that a document has local structure and correct
values, not that the IDREFs are consistent.
More DTD Wackyness
Sekar pointed out some more considerations with docu-
ment schema informatino that also affect things such as
whitespace handling and more.
First, is that an XML document can have an xml-whites-
pace attribute in any element. The attribute specifies how
whitespace should be handled for this element instance and
all nested elements. This thing has two values which mean
default, and non-ignoreable. Default says the node follows
whatever whitespace specification has been made for the
entire document -- which is specified in the root. The
default specification in the root specifies whether whites-
pace is ignoreable or not. Non-ignoreable means that
whitespace must not be ignored until a different (default)
specification is met. Overall this scheme is odd, but
allows non-ignoreable whitespace to exist in odd documents
correctly, though it does lack a way to explicity disable
non-ignoreable whitespace if that was desired.
Another issue is that a DTD can specify attributes
and/or attribute values which must implicitly exist in the
document. This directive can specify that the value of the
attribute should be fixed which means that the value of the
attribute, if present in an element, must match the value
specified in the DTD. The other ??? specification means
that the attribute with the value specified in the DTD must
be added to each element which does not contain that
attribute. Some consideration should be given to handling
this; the most simple is to just add the attribute to each
element at parse time. However it is rather inefficient;
keeping track of document meta information may be valuable
if this issue becomes important. Difference queries can be
used to determine elements with the default attribute value
-- by subtracting indexed attributes which were found in the
document. There are also issues about re-creating the orig-
inal document without all the "extra implied added-in"
attributes which this thing needs.
-7-
Conclusion
There seems to be a lot of work in this area which
shows potential for research. However, beyond the immediate
needs of having DTD and/or XML Schema information on-hand
for the parser, none of this work is something to worry
about for the short term.