XML Parsing
                            and
                        Parse Events


Introduction

     Niagara  will have a common XML parser interface.  This
will allow multiple consumers of  parse  data  and  multiple
parsers  with  only a N+M complexity, rather than a N*M com-
plexity.  By factoring the system this way  we  create  more
orthogonal components.

     The  cost  of components is defining interfaces so they
can exist as a non-monolithic block.  The interface is often
the  most important factor in designing a system.  With good
interfaces you can get by with poor  implementations,  since
the  implementation  can  be replaced with a better one only
when a better one is required.

     This vignette will discuss the issues  that  the  parse
events  interface  will  need  to address.  It will do so by
providing a number of sample data streams and a set of parse
events  which could be used to represent that data.  This is
not necessarily the way to do this, but  it  is  a  starting
point for further refinement.

     There are defintions of the terms and attributes at the
end of the document.

     The parser should provide all of the information needed
by  the  consumers  of  parse  events to know about document
structure.  This will eliminate the need for  the  receivers
to implement their own independent tools for doing something
the parser can easily provide.  Of course there will be par-
ticular  things  which  the  parser can't provide -- such as
building the ID<->IDREF information needed by the IM.

Whitespace

     It turns out that whitespace is a non-trivial issue  in
XML  parsing.   This  is because there are multiple kinds of
whitespace in a document.  To make  this  more  interesting,
the kind of whitespace is context dependent; it depends upon
what element or combination of elements the textual informa-
tion  is  enclosed  by!   There  are also whitespace scoping
rules in XML which make things more interesting, or tedious,
depending upon your viewpoint.

     The  first  kind of whitespace is ignoreable whitespace
which really  can't  be  ignored,  despite  the  name.   The









                             -2-


salient  point  is  that  you  need  not care about what the
actual whitespace is; only to know that  there  was  whites-
pace.  For example, some complicated whitespace of 10K char-
acters can be annotated as "whitespace", the  original  form
of the WS need not be recorded.

     The second kind of whitespace is non-ignoreable whites-
pace.  This kind of whitespace must be recorded  exactly  as
presented  in  the  source  document;  replacing  it  with a
generic whitespace is not allowed.

     There are some cases where  whitespace  is  ignoreable,
but  where  we  do  not  want to or must not ignore it!  For
example, storing a document that we would like  to  retrieve
an  exact version of, not just a semantically identical ver-
sion.  This is an important issue to some  users  and  docu-
ments, and is not an issue with other users or documents.  I
think the way to handle this is to specify, on  a  per-docu-
ment  basis,  if  ignoreable whitespace should be treated as
such, or as non-ignoreable whitespace.  This could be  indi-
cated in the catalog entry for the document.  Note that this
very exact whitespace implies that we  may  need  to  ignore
some  of  the  whitespace  during processing, or take it all
into consideration, or ignore all of it.   Fortunately  this
is not a big problem with our new design!

     A  discussion  brought up another concept of whitespace
handling; a query may want to treat  non-ignoreable  whites-
pace  as ignoreable whitespace.  This would allow a query to
match documents more freely  than  a  query  which  required
exact  whitespace semantics.  Essentially this kind of query
asks that whitespace rules be relaxed for its execution.

     XML allows control of whitespace in the schema specifi-
cation  for  a  document.  In addition, local XML whitespace
attributes in the document can modify whitespace handling to
an  extent.  The parser will need to be aware of the whites-
pace context and keep track of it.

Terminating Characters

     Niagara has a concept which add to  the  complexity  of
XML  document  parsing.   This  is the idea of a terminating
character.  This is a character which causes  a  non-whites-
pace  word  break  in a document.  The characters before the
terminating character are indexed as  their  own  word,  the
terminating character may be indexed, and the trailing char-
acters will be indexed as a new word.

     Terminating characters may be context  dependent  in  a
document,  and  may vary between documents.  What exactly to
do with terminating characters, and how best to utilize them
are  left  as  research.  However, the parser still needs to
handle them and interact correctly with whitespace.









                             -3-


Document Structure

     Finally we come to XML document structure.  This  turns
out  to  be the most simple thing to handle, as XML document
structure is regular.

     If we want to retrieve a document in  the  exact  order
that  it  was stored into the system, it may be necessary to
record extra information about the document  source  so  the
original  can  be  retrieved.   With the current niagara DM,
that means a mapping of attribute position to attribute term
number will be needed.

     However,  if exact whitespace representation is desired
in a document, complexity is considerably  increased.   This
is not just payload whitespace, but all whitespace in a doc-
ument.

Attributes

     Attributes are not in the payload of a document.   They
aren't  assigned  unique start numbers.  Instead, attributes
belong to the element which they are nested inside, and  are
assigned attribute#s relative to that node.  Attribute words
are numbered relative to the attribute  they  are  contained
within.

Items of Large Size

     The  intent  of the new Niagara is to be scaleable.  In
the context of parsing and storage, it means we treat every-
thing  as  it could be a huge object.  That is the case even
if we expect it to be small, or if we expect the common case
to  be small.  By using this approach we are assured in hav-
ing a system which will just work correctly in the  face  of
adversity  or unexpected inputs.  By doing this we eliminate
the source of many problems, and also eliminate the need  to
do  things  in multiple ways, which makes the system simpler
over all.

     The flip side of this is that we would like to make the
common  case,  where  items are small and easy to work with,
really fast.  This is allowed for throughout the new Niagara
design; small string values in tuples, but a fallback to the
Incremental Evaluator cache or Data Manager if the value  is
large.   The same issues ccurs during data storage and pars-
ing.  The common, short case will be  fast,  yet  everything
will  be  treated  as uniformly large to allow for the cases
when that occurs.

     Scalability is performed by allowing access to  sequen-
tial  chunks  of  a thing, and providing context about where
that chunk is located in the final result.










                             -4-


Which Whitespace to Consider

     If whitespace is  ordinary  whitespace,  the  following
rules apply.  Multiple whitespaces characters are reduced to
a single whitespace, which is not a character,  but  can  be
represented  as  one,  a  space  for example.  If whitespace
abuts document structure in elements, it is eliminated.   In
other words

                   this is   a test 

becomes

                   this is a test

This is consistent with document handling, does no harm, and
makes a regular storage model possible.  Let  us  call  this
canonical whitespace.

     If  non-ignoreable  whitespace  mode  is  enabled,  any
whitespace in a element payload will be  exactly  stored  as
represented in the source document.

Whitespace handling in attribute values may be different, as
whitespace could be more significant there.  As  an  initial
pass,  we  can  treat  attribute  whitespace  as  ignoreable
whitespace and apply the earlier rules to it.  There  is  no
way to query the whitespace, and if structure is desired, it
belongs in the document payload.  If schema  information  or
XML  standards  allow  more  exact  control of whitespace in
attributes, we may need to do something else.  If there is a
dataset  which  needs  significant  whitespace  handling  in
attributes, it will be easy to deal with it.

     Note that if a query contains whitespace that the  same
rules  can  be applied to the query to squish out whitespace
so that it matches the whitespace in a document.

What to Implement

     Obviously the parser and parse events need to correctly
implement  the parsing of document structure.  It also needs
to be able to parse payload and whitespace.

     What we aren't going  to  implement  for  now  is  100%
whitespace exactness.  Whitespace in attribute values and in
element payload will be the only whitespace the system cares
about.  This simplifies parsing and storage considerably, as
whitespace events can be ignored inside document  structure.

     Whitespace  interpretation  (exact,  non-exact) is con-
trolled by the evaluation function which prints stored  data
into  a  string  to be evaluated.  This is not an issue with
content which has ignoreable whitespace; it  is  already  in









                             -5-


the  canonical  form.   However,  for content which has non-
ignoreable whitespace, the evaluator can convert the whites-
pace  into  canonical  whitespace.   This  allows specifying
queries which don't care about  exact  whitespace,  and  can
explore a larger possible result realm.

ParseEvents

     I'm going to divide the discussion of parse events into
two arenas, which match the structure versus  content  split
which  I  imply  earlier.  However, before we get into that,
lets provide some common terminology for all parse events.

     There are two kinds of parse events.  The  first  group
has  to do with document structure, and reflect the presence
of elements and attributes.  The second kind of parse  event
has  to  do  with  document  payload, a misused term in this
paper.  Document payload only exists in  certain  points  of
document  structure.   The  two  different groups share some
attributes, but other attributes are unique to  each  group,
for example whitespace considerations and adjacency.

     Parse  events have a number of attributes which present
information about the document which is  needed  to  provide
the context  of that event within the structure of the docu-
ment.

Term           Definition
------------------------------------------------------------------
start          The numbering scheme which the IM uses  to  number
               payload  and  strcture  portions  of the document.
               Each element and significant word  is  assigned  a
               start  number.   This implies that both whitespace
               and stopwords, which act as whitespace to  a  cer-
               tain  extent,  need  not have unique start number.
               If a start number is needed,the  start  number  of
               the  preceding  event  could  be  repeated without
               harm.
end            The start number of an element's closing tag.
ele_id         The scheme the DM currently uses to store a  docu-
               ment;  an  ele_id is assigned to each XML element,
               and  to  each  DM-special  text   element.    Note
               match_start  below;  match_ele_id  isn't  required
               since this field can be used in both cases.
PW             Payload word number; all payloads words in a docu-
               ment are numbered from the first.
SW             Flag indicating a word is a stop-word
















                             -6-


isComplete     Parse  event  is  complete;  full  value  has been
               transfered.  If  the  data  takes  multiple  parse
               events  to transfer, isComplete will be present on
               the last event of that sequence.  For uniform han-
               dling  this  should  be true for all parse events,
               except those  which  require  >1  parse  event  to
               transfer the data.
offset=        Location  of  text  chunk  in result of this parse
               event.  Offset==0 in  the  first  event  of  large
               data,  and  grows uniformly based on the length of
               data presented so ar.
AN             Attribute number of this attribute in this element
AW             Attribute word number; all words in a attribute in
               a element are numbered from the first.
PE             Parse  Event number; unique number identifying the
               parse event.  This is a unique ID for  debug  pur-
               poses,  I don't think a event receiver should ever
               need to care about it.  It just says this  is  the
               n'th parse event received.
text           The  text associated with the parse event, or this
               chunk.  There is  a  length  associated  with  the
               text.
isAdjacent     This parse event is adjacent to the previous parse
               event.  Whitespace will always be isAdjacent.   If
               adjaceny  isn't specified, the event consumer does
               what it needs to do to represent things  properly;
               for  example  the  first  word  in a payload or an
               attribute won't have a preceeding whitespace.
nest           The nesting level in the document of a  particular
               element  or payload.  An element is at the nesting
               level of the payload of the  element  it  is  con-
               tained in.  Payload of an element is at a +1 nest-
               ing level.
match_start    The start number of the element which matches this
               element.  This is only used for end-element.
element_word   This  is  an  example of a attribute that we could
               add  if  we  choose  to.   This  illustrates   the
               attribute based nature of the interface.  The ele-
               ment word would be the cardinal number of  a  pay-
               load word in it's element.
isAttribute    Future  consideration  for common word and whites-
               pace model.  Indicates that the word or whitespace
               is  an  attribute  value  and not a payload value.
               This is not strictly necessary; consumers can dis-
               cover it from document structure easily.


     Any and all large values are handled by the offset= and
isComplete mechanism.  Briefly,  (offset==0  &&  isComplete)
means  that  an  entry is the common, shore case and the one
parse event has provided the complete value.  If  the  value
is  broken  into  several  chunks, the first chunk will have
(offset==0), and the last chunk will have (isComplete).  The
intermediate  chunks  of  the  value  will  have  positively









                             -7-


incrementing offset= attributes, as well as the  data  which
starts at that offset=.

     Every  parse  event will have a start number associated
with it.

     Some attributes of a parse event may not change between
successive  parse  events.  A trivial example of this is the
isComplete issue.  Another example is the start  number  may
not  change between some parse events.  This reflects a lack
of movement or change of context in the document.  For exam-
ple, the start number will remain constant until the end tag
of the element is seen, as the element counts as 1 piece  of
document  structure.   Another  example of this occur during
attribute handling; all attributes and attribute  words  and
the  end tag of an element will all have the start number of
the  element.   In  addition,  all  attribute  words  of  an
attribute will have the same attribute number.

     Adjaceny,  denoted through isAdjacent, indicates that a
payload element abuts the previous payload element.  Without
the adjacency event, the consumers are free to consider that
canonical whitespace exists between  the  payload  elements.
As mentioned earlier, all whitespace will be adjacent, since
any representation of whitespace as a parse event means that
the  whitespace  must be recorded exactly.  To simplify han-
dling of whitespace in consumers of parse events, the parser
could  or should mark payload following white space as adja-
cent.  Similarily, the adjacent property could be set on the
first payload in an attribute or element to signify that the
payload is abutted to the document structure;  or  in  other
words, to not have to keep state to avoid inserting a canon-
ical whitespace there.

Document Structure

     Document structure is easy to provide parse events for;
here they are:

+-------------------------------------------------------------+
|     Event           XML            Text        Text Content |
+-------------------------------------------------------------+
|Element                       -                -            |
|End-Element       |   -                -            |
|Attribute         def="        attribute name   def          |
|End-Attribute     "            -                             |
+-------------------------------------------------------------+

     The  structure  parse  events  will  have the following
attributes; isComplete is noted  to  indicate  which  things
could require incomplete handling.











                             -8-


+----------------------------------------------------------------------+
|                         Document Structure                           |
+----------------+-----------------------------------------------------+
|     Event      | start   text   attribute   match_start   isComplete |
+----------------+-----------------------------------------------------+
|Element         |   x      x         -            -            x      |
|End-Element-Tag |   x      -         -            -            -      |
|End-Element     |   x      -         -            x            -      |
|Attribute       |   x      x         x            -            x      |
|End-Attribute   |   x      -         x            -            -      |
+----------------+-----------------------------------------------------+


Document Payload

     As  I  mentioned earlier, document payload is a misused
term in this document.  It refers to both the text() payload
of an XML document, and to all non-structural content of the
document.  In this case, both to what we call payload  words
and  attribute  words.   It  also includes whitespace, which
only exists in document payload  (ignoring  exact  retrieval
issues).

     One  way  of dealing with this is to remove the differ-
ence between an attribute word and a payload word.  It would
then  be  the  responsbility  of the parse event consumer to
retrieve the desired attributes  from  the  parse  event  --
depending upon what it's state is.  That would make things a
bit more orthogonal, however the cost is it adds slight com-
plexity  to  some  components  (IM) which is trivial, and is
already required for whitespace handling.  The flip side  of
this  is  that  whitespace  should be labelled as payload or
attribute?  Ick.

     Another consideration are  terminating  characters;  it
may  desireable  to  either  issue a 'terminating character'
event, or add an attribute which indicates that a word is  a
terminating character.

     In  summary  of  the last paragraph, it doesn't matter;
i'll use either or both forms in the examples to see how  it
works.

+-------------------------------------------------------------------------+
|                            Document Payload                             |
+---------------+---------------------------------------------------------+
|    Event      | start   text   attrib   AW   SW   Adjacent   isComplete |
+---------------+---------------------------------------------------------+
|Payload Word   |   x      x       -      -    x       x           x      |
|Attribute Word |   x      x       x      x    x       x           x      |
|Whitespace     |   +      x       +      -    x       x           x      |
+---------------+---------------------------------------------------------+
If we posit the uniform word handling, it would look like:










                             -9-


+---------------------------------------------------------------------+
|                     Alternate Document Payload                      |
+-----------+---------------------------------------------------------+
|  Event    | start   text   attrib   AW   SW   Adjacent   isComplete |
+-----------+---------------------------------------------------------+
|Word       |   x      x       ?      ?    x       x           x      |
|Whitespace |   +      x       +      -    x       x           x      |
+-----------+---------------------------------------------------------+


Optimization


This  section  describe optimizations which are not required
for correctness.  The system as  it  stands  can  work  cor-
rectly;  optimizations  can  wait  for  profiling  and speed
needs.  These changes will not affect the system overall, as
they only affect the generator and receiver of parse events.

     In the interests of efficiency and the common  case  it
is  possible  to  streamline  the  isComplete protocol a bit
more.  For example, making the short and/or common case eas-
ier to detect.  However, this will work for now, is regular,
and isn't too bad.

     It is also possible to generate aggregate parse events.
This  depends  upon  the  individual consumer and the set of
consumers of parse events in an indiviudal  parse  scenario.
Payload  data  (words and attributes) received by the DM are
the main benefit of the  optimization.   However,  it  could
also be used with the IM as well.

     Worst  case  is  that  the  parse  events  are  used to
describe document structure,  and  the  new  common  payload
parser  is  used each by the DM and IM loading components to
parse payload.  The system is still operating off  a  common
parse, though.  I mention this here for completeness's sake.

     If the XML tag was a standalone element such as  
it  would  still generate two parse events with this schema.
It may be possible to streamline that a bit, but there is  a
problem  in that the element and close element actually need
to generate two start numbers, one for  the  begin  element,
and  another  start  number for the end element.  An elegant
solution for this is to use the start number of the  element
as  the end number.  This seems to work well, doesn't appear
to break any containment constraints, and reduces the number
of parse events for standalone element to one.

     Similarily,  tags without attributes could be optimized
into a single event.