Niagara Incremental Evaluator


Introduction

     This  component  is  horribly misnamed.  It is used for
incrementally traversing data stored in the  DM.   Typically
it  is used to generate some sort of printable or stringable
representation of that data.  That data is then used by some
function to evaluate a predicate, or to generate a printable
represenation, or an in-memory data structure, or  whatever.

     In  its  current incarnation, the Incremental evaluator
is an traverser which maintains some state and incrementally
traverses  a  subtree  of a document.  Most often it returns
successive  chunks  of  the  document  in  a  textual,   aka
stringable,  representation.  This data can then be printed,
evaluated in a predicate, or otherwise used.   However,  the
generation of a textual material is not the sole function of
the evaluator, it is just one thing that  it  can  do.   The
evaluator  is  really  a  traverser.  Once it reaches a node
that is traversed, it has a user-defined module  which  does
something.   For  example,  along  with the traverser is the
default module, which generates the stringable  output  men-
tioned here.  An example of an alternate module would be one
thatn "prints" an in-memory data structure  usable  by  some
portion  of  the system.  For example, print a DTD or schema
information as some in-memory tree suitable  for  the  opti-
mizer to use.

     This  is the incremental subtree evaluator.  What's a a
subtree?  Well, a subtree of a document could  be  just  one
node,  a  node  name,  the  payload  of a node, or indeed an
entire subtree.  The incremental nature needs to  work  with
both of those scenarios -- for example a single node that is
very large, and for a tree that is large, or  a  large  tree
with large elements.

Structure and Operation

     Next,  we need to determine how the incremental evalua-
tor works.  For now we'll consider the case where  only  one
subtree  is involved; for example we want to print a subtree
as output, or evaluate a  hash  function  on  the  subtree's
printable  XML  representation.   The  multiple string case,
string compare for example, will  need  to  know  about  two
incremental  sources and how to run them, but there is _lit-
tle_ additional complexity.

     The incremental evaluator will most likely be developed
in  3  or 4 parts.  One part will be the new evaluator which









                             -2-


takes care of most of the things  mentioned  here.   Another
part will be changes to the DM that are needed to support an
incremental evaluator.  The third part  will  be  a  set  of
incremental  evaluator  functions, string compare and print-
able XML for example, which use the system  and  demonstrate
how  to  write  other evaluation functions.  The fourth part
will be the cache, which is not  required  for  correctness,
but is desired for performance.

     To  start  we have an XNode which refers to the portion
of the document of interest.  I  mentioned  in  one  of  the
meetings  that an XNode can't point to an attribute and they
would need to be handled differently.  This was not correct,
a  XNode  may  refer to an attribute and there is no need to
handle them differently from other portions of an XML  docu-
ment.

     The  next  thing  we'll  have is a parameter to run the
evaluator with, the block size or chunk size.  This will  be
used  to  determine how much data is instantiated at a time.
This should be controllable for eacbh instantiation  of  the
evaluator.   This  will  allow  us to run experiments to see
what effective block sizes are.  It  allows  us  to  control
overall resource use of an operator.

     There  are several ways to instantiate XML data.   Full
blown XML (string()?) text(), and  data().    I  forget  and
probably  don't  know  what all of the requirements are off-
hand.  At the least examine the Quilt and XQuery  papers  to
determine  what  is needed.  You should discuss these issues
with some Niagara people, Sekar for certain and some  others
too,  to see if there are any fine points of this.  In addi-
tion determine what facilities are needed for  XMLQL,  XPath
and  XSLT  and see what requirements they have for XML->text
conversion.  This is important, for we are building an  gen-
eral  XML  query engine, NOT a quilt/xquery specific engine.
It is likely that the evaluator will be useful for  creating
(dumping)  XML  documents, so that is another consideration.
Another one is being able to produce a "regularized" version
of  what  is stored in the DM, or producing an exact copy of
the input document.  It seems that a general purpose instan-
tiator is capable of dealing with all these scenarios.

     It is likely that one can develop a model of what parts
of the document need to be  instantiated  for  each  various
"mode".   If  this  can  be generalized it would allow us to
choose what is instantiated from the document by the evalua-
tor.   For  example, a set of flags of what are instantiated
-- elements, text, attribute names, attribute values, nested
elements,  ordinary  whitespace, ignoreable whitespace, etc.
This would provide a building block tool that  is  generally
useful   for  a  number  of  purposes,  instead  of  a  more
restricted thing.










                             -3-


     When instantiating  the  blocks,  some  sort  of  state
(think  of  it as a seek offset) will need to exist which is
generated by the DM and the evaluator to represent the  cur-
rent  position  in the DM document.  For example, if a large
text element is being evaluated, evaluation may need to stop
after  the first X bytes, the block size, are processed.  If
evaluation of that subtree must continue, the  context  will
allow  restarting  from  *the point* where it left off.  The
context also provides structure related info to traverse the
stored  document  correctly.   Note that this context may be
long lived -- if we are using the cache it will be necessary
to continue a traversal at an arbitrary later time.

     Once the data is instantiated the "evaluation function"
will execute on the chunk of data.  The function  will  need
an  execution  context  so  that it can retain state between
chunks.  On way is to allow  the  function  to  control  the
incremental  evaluator,  as that allows the most flexibility
with the "evaluation function" -- it requests the next chunk
of  data  from  the evaluator, is given a pointer to it, and
may keep whatever state it wants while  it  uses  the  data.
Another way is to allow the evaluator to control evaluation,
and to present the extracted data to an  evaluator  function
callback.    The  callback  is  a  class which maintains the
state between callback invocations.

     I recommend the "function controlling  the  incremental
evaluator"  approach.   This  is the most general of models,
including the  capability  to  build  a  efficient  callback
interface.   One  thing  to  note with this model is that is
necessary to provide an interface which allows  for  one  or
more  subtrees  to  be incrementally evaluated at once.  Two
reasons here.  One is that it allows the evaluator to  build
in-memory  strings  in  parallel  if it is desired to do so.
Without such an interface, the evaluation function is forced
to  serially use the incremental evaluator, and remove hopes
of parallelism with simple evaluation functions.  This is no
different  from  the  callback  model,  as the callback only
occurs once all strings in questions have their  next  chunk
evaluated.  The other reason is that it allows evaluation of
multiple things to be tailored  to  the  structure  of  each
other, if convenient.

Caching

     The  next  thing  to  consider is the partial evaluator
cache.  Note that the cache _is_not_ required  for  correct-
ness;   it  is  present  for performance reasons.  The cache
would hold evaluated "blocks" of subtrees which need  to  be
evaluated.   How much of the subtree to hold is a good ques-
tion.  For example, if we are probing with a  subtree  in  a
join  it  would be very important to cache the subtree which
is doing the probing, otherwise it would need  to  be  rein-
stantiated  with  each  probe against the hash tables.  Once









                             -4-


the initial probe is done, it is not as important to  retain
all  the evaluated context of that subtree.  The most impor-
tant part of the evaluated context to retain is the _start_,
as  that  allows  immediate (and fast) in-memory evaluation.
Each block in the cache should retain the  context  that  it
requires  to perform further evaluation of the subtree.  For
example, if one subtree needs to be evaluated more and  more
to prove/dis-prove the predicate it would need to grow.  The
results can stay in the cache and when  the  next  candidate
needs to be evaluated against that subtree, more of the sub-
tree may need to be evaluated to prove  the  predicate.   Of
course, if the cache needs to reclaim some blocks, they will
need to be deleted from the END of each document.   That  is
why  the  context  must be saved, if the subtree needs to be
evaluated further.