Niagara Parse Streams


Background

     There  is a certain amount of redundancy in Niagara due
to its structure and the independent development efforts  of
its  separate  components.  For example the Data Manager and
the Index Manager each have their own XML parser.  These are
both event-driven SAX parsers.

Actually  ... there is yet another XML parser in the system,
that which the File Scan  operator  uses,  which  is  a  DOM
parser.   However  the need for that separate parser will be
eliminated once the DM's XNodes will be  used  for  all  XML
data representation.

     What  is  the  problem with two parsers you ask?  Well,
there are several.  First off there is a performance problem
-- a document must be parsed twice, once for the DM and once
for the IM, if it will be stored in  Niagara.   XML  parsing
has  a  lot of overhead, and this cost is not reasonable for
real queries.

Secondly, there is the problem that the system  as  a  whole
must  maintain two XML parsers.  Outwardly this may not seem
a big deal.  However, you must realize that the two  parsers
must  be  in total agreement about the numbering of items in
an XML document This coupling makes  you  realize  that  the
parsers  are not truly independent, but rather duplicate the
functionality of each other.  If they  duplicate  it  incor-
rectly the system won't work.

Next,  there  is the issue or orthogonality.  The IM and the
DM have nothing to do with document parsing, and  everything
todo  with  document storage.  Yet, considerable portions of
both revolve around document parsing.   By  moving  document
parsing to its own system the complexity of these components
will be reduced.  In addition, it will be possible  to  have
multiple  XML parsers and multiple parse users in the system
with only N+M complexity  instead  of  NxM  complexity.   An
example of this is "parsing" out of a document stored in the
DM so that it can be [re-]indexed.

Issues

     The XML parser is the repository  for  knowledge  about
numbering  of  XML  content.  At the very least there is the
Element-ID numbering currently used  by  the  DM.   However,
even this simplistic numbering scheme must take into account
the way documents are broken up in  Niagara.   For  example,









                             -2-


text  between  XML  elements  at a given level are stored as
Text nodes and the numbering must be adjusted to account for
this.

In  addition  to  element numbering there is also the start,
end numbering used by the  IM.   This  numbering  scheme  is
needed  as  it  reflects  the  structure of the document and
allows contains predicates to be evaluated against two post-
ings  which  have start, end numbers.  Knowledge of document
structure is required for contains queries.

With this new Niagara we are also introducing the concept of
payload  word  numbering.   This  allows  exact non-stopword
queries to be performed by the IM -- see the IM vignette for
details.

Stopwords  are another issue, as the parser may need to num-
ber stopwords in some cases and not  number  them  in  other
cases.   As  such,  the parser needs to know what a stopword
is.  It may be important to  have  a  configurable  stopword
list.   For  now  a  fixed stopword list is good enough, but
this is something important to consider for the future.

     In general, the parser will most  likely  provide  more
information  than  is  needed  by  any  one  entity which is
receiving its parse events.   Generically,  this  means  the
parser  will  provide  a  union  of  all  parse  information
required by each of its users.

Multiple Parsing

     All of the above goals are workable with  a  standalone
parsing module.  One parser in the system is better than two
and a big step forward from what we have now.  However,  one
of  the  goals is to have the parser provide parse events to
multiple consumers of those events at the same  time.   This
is  what will allow one parse to both load the document into
the DM, and to generate postings which are stored in the IM.
Since  this is a modular design, the multiple parsing module
could just be an ordinary parse event  receiver  which  then
duplicates  the  events  to  a  number  of other parse event
receivers.

     It is currently assumed that the parse event  interface
will  be  procedural in nature, rather than an actual event-
driven channel such as a stream.  This is specified for  the
following  reasons.  First is to minimize the amount of data
that is buffered.  If streams were used  and  the  two  con-
sumers  have different consumption rates then a backlog will
accrue to to  slower  consumer..   By  using  a  synchronous
interface  this  buffering  problem  is eliminated.  Another
issue to consider is the need for incremental  parsing.   If
there  is, for example, a large string we don't want to have
to build it all in memory before passing it to the consumer.









                             -3-


An incremental approach where portions of the large data are
passed to the consumers reduces the impact of large data and
makes  the  system  more scalable.  The third factor to con-
sider is the issue of  copying  data.   XML  parsing  is  an
expensive  operation,  and  copying  the  data  to  multiple
sources is that much more expensive.

The bottom line is that the procedural interface is  simple,
highly  adaptable,  and  useful  from the beginning.  We may
decide to change it in the future, but it is a good building
block for now, and may be all that Niagara requires.

Implementation Notes and Goals

     One  goal  of the parsing system is to divorce the rest
of the system from dependence upon a particular XML  parser.
As  such,  the parse events interface should completely iso-
late the consumers of parse events from the produce  of  the
same.  This means that there should be no direct or indirect
coupling between the existing XML parser, Xerces-C, and  the
remainder  of  the  Niagara system.  In other words, the XML
parser is an implementation detail of the parse stream  gen-
erator and nothing more.

     Currently  the IM and the DM each drive their own load-
ing with a load(url) function.   They  know  about  creating
transactions  for  load, and then either commit or abort the
transaction.  In the new system this will  change  --  loads
will  happen  in  the  context  of  a load operator which is
responsible for connecting  a  document  source  to  an  XML
parser.   The  load  operation  will also be responsible for
collecting the receiver(s) of parse events and arranging for
the parse events to be sent to those event receiver(s).  The
IM and DM will know nothing about transactions, and  operate
in the context of one.  Transaction abort and commit will be
carried out by higher-level entities which can  make  policy
decisions.

     Currently,  the  IM  and  the DM are both entities with
somewhat unrelated halfs.  The important part of both the IM
and  the DM are the storage and retrieval of database infor-
mation.  However, both the IM and the DM have a large  frac-
tion  devoted to parsing XML into a format which the storage
engine can then store.  This increases the complexity of the
two systems quite a bit, Moving the bulk of this functional-
ity to the parsing system, with only a  IM  or  DM  relevant
data  interpreter to convert the parsed data into the format
needed by those subsystems.  Even better would be to isolate
the  the  data  conversion process in the IM and DM from the
actual DB components.  This would allow the DB components to
just  store a "stream" of relevant objects, whether postings
or XNodes.  This will make the DB components just components
which  deal  with  database access, and make the system more
orthogonal.  Of course this is a  long-range  goal,  but  it









                             -4-


should be known up-front to guide development of the system.

     It is important that the parser be "scalable" in  terms
of  document size.  This scalability item means that parsing
should be independent of document size -- if  document  size
grows  parsing  size  shouldn't grow with it.  However it is
also important that the parser  be  scalable  on  a  smaller
level  --  for  example  individual  elements of a document,
whether they be words or XML elements.  For example, a 100MB
word  in  a  document  must not require the 100MB word to be
formed in memory before it is passed to the event receivers.
The event driven nature of the parser will tend to take care
of the large documents by breaking them up  into  individual
elements.    However,  a  further  incremental  approach  is
required for handling individual large content.   For  exam-
ple,  a start_element function to start a new element of the
text, and a continue_element  function  which  incrementally
adds  text  to  a  started element until it is complete.  Of
course, a small element would have be complete with just the
call to start_element, and not require further calls.