DTD Handling
                           in the
                        New Niagara


Introduction

     DTDs  and  XML Schemas are an important part of the new
Niagara; they are needed for many purposes.  For this  docu-
ment  I  will just use the term DTD as shorthand to refer to
both DTDs and XML Schemas.  I will refer to each  explicitly
when it matters.

     One  of the most obvious uses for a DTD is for document
parsing.  If a document conforms to a DTD the DTD  needs  to
be  used to validate the document.  DTDs are also needed for
understanding documents.  One example of this is understand-
ing  what  IDs  and IDREFs a document contains.  Without the
DTD you have no idea of what is an ID or a IDREF.  The stor-
age of a document can vary depending upon the DTD; for exam-
ple a document may contain both ignoreable  and  non-ignore-
able  whitespace.   This  difference  is that non-ignoreable
whitespace must be preserved exactly.  Ignoreable whitespace
can  be  converted  to a single whitespace, or may be elimi-
nated -- such as a payload word adjacent to an element.

     Another important use for DTDs in query  processing  is
to provide the optimizer with information that it can use to
create a more efficient query plan.   The  DTD  can  provide
information  that  allows  optimization  of  the query plan;
removing the need for duplicate elimination is  one  trivial
example.   Another  use  is to indicate that a query doesn't
match the data, or will run horribly --  joins  on  subtrees
for example.

     See  Stratis or Sekar for better examples of how to use
schema information when querying documents.

DTD and XML Schemas

     Both DTD and XML Schemas are  widely  used  to  provide
schema  information  about  XML  data.   DTDs  are  the  old
accepted way of providing schema information.   However  the
are  difficult  to  use  as  they  don't have a well defined
structure, such as XML itself has.  Another problem is  that
DTDs  are  not capable of representing many important schema
constraints.

     XML Schemas were developed as a more  expressive  means
of representing XML schema information.  Instead of a rather
ad-hoc representation, XML Schemas are themselves valid  XML









                             -2-


documents.  This allows them to be manipulated and used eas-
ily, just as any other XML document.  It also allows them to
be easily extensible and self descriptive.

DTD Storage in Niagara

     DTDs  and  XML  Schemas are just another document which
will be stored in the XML database.  If the DTD is needed it
will  be  loaded  from the database.  If the DTD is required
and is not available in the database, a query will be run to
load the document so that it may be accessed.

     I propose that DTDs be translated to simple XML Schemas
for storage in Niagara.  This conversion allows the  DTD  to
be  queried  and  accessed the same as any other document in
the system, instead of requiring a new and different  mecha-
nism  for  just  the DTDs.  If the DTD itself is required, a
DTD printing module will be used with the incremental evalu-
ator  to  return the Schema to a DTD state.  One use of this
would be to  print  the  DTD  for  viewing  and  comparison.
Another  use  will  be to feed the textual DTD to the parser
for use in insuring document compliance to the DTD.

     A more complex version of this DTD printer could try to
convert  non-simple  XML  Schemas  into  a DTD for such use.
While some elements of the schema will be lost,  this  would
allow  use  of XML Schemas with components which do not sup-
port them.  This, of course, is an optional extra and not  a
critical path item.

Input and output of DTDs

     XML  Schemas do not require any special tools for input
and output.  They are normal XML documents and will be  han-
dled with all the normal tools.

     We can use an existing DTD parser or create our own DTD
parser, depending upon our needs.  The DTD  parser  will  be
another  of  our  parser  modules.  It will take a DTD as an
input, and will generate a  set  of  parse  events  for  the
resulting  simple  XML  Schema.   It  fits into our document
ingestion framework as any other parser does.

     DTD output will be done by a special DTD printer  which
uses the incremental subtree evaluator.  This DTD printer is
the reverse of the DTD parser  and  converts  a  simple  XML
schema  back into a corresponding textual DTD.  As mentioned
earlier, a more complex DTD printer could make DTDs from XML
schemas.

Schema Creation

     It  may  be possible to generate schema information for
documents or portions  of  documents  which  otherwise  lack









                             -3-


schema  information.   Having this information available for
optimizer use would help improve plan generation.  There are
a  number of methods for this.  One is noticing that a docu-
ment conforms to an existing schema,  or  a  portion  of  an
existing schema.  It can be noticed that parts of a document
conform to a sub-schema.  Another tactic is to  analyze  the
document and generate a schema for it.

     All of this is research, but the new system easily pro-
vides a basis for this type of research.

Use of DTDs

     There are three consumers of DTDs in the  Niagara  sys-
tem.   The  first  consumer is the XML parser, which needs a
DTD to determine if a document is compliant.  Typically this
will require a textual DTD as input to the parser.  If a XML
parser supports XML Schemas, schema information can be  pro-
vided  as  textual  input  to  the  scanner.   If the parser
accepts XML schemas in DOM format, another output module can
be made which will print a DOM tree from the database.

     The  next user of DTDs is the database itself.  As DTDs
are  just ordinary documents and may  be  queried  as  such,
either for ordinary retrieval, for a query, or any other use
that a document can be used for.

     The last consumer of DTDs is the  optimizer.   As  men-
tioned  earlier the optimizer can use the schema information
to great effect.  It allows plan  validation  and  optimiza-
tions  that  are  not  possible  without schema information.
There are two ways which the optimizer can use DTD  informa-
tion.   The  most  simple  is for it to issue a query on the
database!  This ia always a possibility for retrieving  sim-
ple   information  and  requires  no  special  tools  to  be
deveoped.  For more complex analysis of DTDs  the  optimizer
will  likely  want some form of in-memory structure which it
can traverse and  query  itself  to  discover  and  validate
information  for  the query.  Whatever is needed for this is
not a problem -- the in-memory tree can be  created  via  an
ordinary  query  which  prints an in-memory tree which is to
the optimizer's liking.  The wonder of  the  component  ori-
ented Niagara system!

Implementation Notes

     The  DTD support needed by the system is quite minimal,
and doesn't involve a lot of work.  There  are  already  DTD
parsers  available,  there  is  not  a great deal of code to
write to do most of this.  A parser module  to  glue  a  DTD
parser into Niagara and convert it to a "simple" XML schema.
A "print" module for the incremental evaluator to print  the
"simple" XML schema into DTD form for those tools which need
the textual DTD.









                             -4-


     The users of DTDs, the optimizer  in  particular,  will
need work to understand and use schema information, but that
is not the responsibility of the execution engine.  The only
thing  required  for execution of the optimizer is a "print"
module for the incremental evaluator which can  "print"  the
appropriate  in-memory  data  structure for the optimizer to
use.

Complexity of Validation

     Yuan has pointed out that some types of document  vali-
dation  can  be  quite  expensive  to perform in the parser.
This is due to the complexity of some  of  the  validations,
which   either  have  global  consistency  requirements,  or
require complex validations of individual values.  This type
of  validation  can  be done in the parser; however it slows
processing quite a bit due to the resources.  Yuan  suggests
that a better solution is to perform the more costly valida-
tion through a query or other database  style  operation  on
the  completely  parsed document.  For example, IDREF <-> ID
and KEYREF <-> KEY matching and type-checking can easily  be
done  with  a query.  This can be performed two ways in Nia-
gara, depending upon how  the  document  is  stored  and  or
index.   One  way  is  to run a query on the multi-attribute
IDREF index for validation.  The other method is  to  run  a
unnest  query  which dereferences xREFs to ensure they point
to the correct type.

     After discovering the above issue,  it  seems  that  we
shouldn't  think about document validation as a all-or-noth-
ing proposition.  Yes, to be strict, that is the only way to
be  exact with regards to the stanadards of document compli-
ance to a schema.  However there are cases that I can  think
of  where  strict  validation is not required by, or desired
by, or useful to the user.  In these cases we should look at
document  validation not as an absolute truth, but as having
a level of compliance.

     If document validation is examined it is apparent  that
validation  consists  of  several  independent, validations.
I've broken these up into the following categories, with  an
arbitrary  rating of least complex to most complex.  Complex
is some arbitrary qualification of processing cost and  dif-
ficulty with having the parser perform the validation:



















                             -5-


          +------+-------------------------------+
          |level |          validation           |
          +------+-------------------------------+
          +------+-------------------------------+
          |0     | No info (no DTD)              |
          +------+-------------------------------+
          |1     | Document specifies a DTD      |
          +------+-------------------------------+
          |2     | DTD  used  to  interpret  the |
          |      | basic document structure cor- |
          |      | rectly;  important items such |
          |      | as   whitespace,   IDs    and |
          |      | IDREFs, KEYs and KEYREFs.     |
          +------+-------------------------------+
          |3     | Local structural constraints  |
          +------+-------------------------------+
          |4     | Global structural constraints |
          +------+-------------------------------+
          |5     | Value constraints             |
          +------+-------------------------------+
          |6     | Global constraints            |
          +------+-------------------------------+
          |7     | Completely valid              |
          +------+-------------------------------+
Levels 0-3 can easily be performed by the parser with little
additional cost.  Levels 4-6 can be done by the  parser  but
require  increasing  resources and processing time.  Level 1
may seem odd but isn't -- a document which specifies a  DTD,
but  for  which  we  can't locate a DTD at parse time.  Note
levels 2 and 3; some structural constraints are more  global
in  nature  and  have a correspondingly higher cost to vali-
date.  This cost may be greater than classified -- the  cost
to perform them could be comparable to Level 5, but it seems
a natural break, 2 and 3 started out as  a  sinle  entry  of
structural  constraints.  Level 4, value constraints is also
odd.  Some value constraints are possible  to  determine  at
parse time; for example determining that a payload word is a
valid integer.  Other value constraints are  more  difficult
and similar to SQL constraints on value -- unique values for
example.

     I propose that we store a document validation level for
each document which we store or know about.  When a query is
run it can specify which minimum validation  level  it  will
accept  for documents.  This will allow only documents which
are valid enough to be processed for the query.  If the user
cares  less about validity, they can access a larger assort-
ment of documents.  If the user cares more  about  validity,
they will see only the valid, or valid enough, documents.

     The  query  can  also  specify how much work the system
should perform on behalf of the query.  This would allow the
system  to take the time to increase the validation level on
documents validated at a "lower level".  It would do this by









                             -6-


executing validation queries or operators on documents below
the specified validation  level,  to  determine  what  their
validity is.

     Another  way of looking at this model is to specify how
a document is valid, rather than some abstract level.   This
would allow one to specify document validity in a orthogonal
fashion; you may not care that a document is valid, but just
that  it says it conforms to a DTD.  For example a query may
only care that a document has local  structure  and  correct
values, not that the IDREFs are consistent.

More DTD Wackyness

     Sekar  pointed  out some more considerations with docu-
ment schema informatino that  also  affect  things  such  as
whitespace handling and more.

     First,  is that an XML document can have an xml-whites-
pace attribute in any element.  The attribute specifies  how
whitespace  should  be handled for this element instance and
all nested elements.  This thing has two values  which  mean
default,  and non-ignoreable.  Default says the node follows
whatever whitespace specification  has  been  made  for  the
entire  document  --  which  is  specified in the root.  The
default specification in the root specifies whether  whites-
pace  is  ignoreable  or  not.   Non-ignoreable  means  that
whitespace must not be ignored until a  different  (default)
specification  is  met.   Overall  this  scheme  is odd, but
allows non-ignoreable whitespace to exist in  odd  documents
correctly,  though  it  does lack a way to explicity disable
non-ignoreable whitespace if that was desired.

     Another issue is that  a  DTD  can  specify  attributes
and/or  attribute  values which must implicitly exist in the
document.  This directive can specify that the value of  the
attribute  should be fixed which means that the value of the
attribute, if present in an element, must  match  the  value
specified  in  the  DTD.   The other ??? specification means
that the attribute with the value specified in the DTD  must
be  added  to  each  element  which  does  not  contain that
attribute.  Some consideration should be given  to  handling
this;  the  most simple is to just add the attribute to each
element at parse time.  However it  is  rather  inefficient;
keeping  track  of document meta information may be valuable
if this issue becomes important.  Difference queries can  be
used  to determine elements with the default attribute value
-- by subtracting indexed attributes which were found in the
document.  There are also issues about re-creating the orig-
inal document  without  all  the  "extra  implied  added-in"
attributes which this thing needs.












                             -7-


Conclusion

     There  seems  to  be  a  lot of work in this area which
shows potential for research.  However, beyond the immediate
needs  of  having  DTD and/or XML Schema information on-hand
for the parser, none of this  work  is  something  to  worry
about for the short term.