Incremental Subtree Evaluator


Problem

     XML  query  processing often implies that a string com-
parison will need to be made on data.  Often  this  data  is
quite  large,  encompassing an entire subtree of a document.
However, there are other cases when the data can  be  large,
for example a large amount of text data  between two tags.

     When  comparing  such  data Niagara would make complete
in-memory copies of the data which needed  to  be  compared.
In the case of a join, both arguments would need to be built
in memory.  After the string(s) were constructed in  memory,
then  they  would  be  compared  as strings to determine the
value of the predicate.

     There are problems with this approach.  The most  obvi-
ous  is that it requires one or more potentially quite large
(entire document anyone) data sources  to  be  traversed  on
disk  and  built  into  memory.   This  requires  a  lot  of
resources in the form of memory use, disk I/O and CPU  time.
It  doesn't  scale at all with document size.  This approach
also forces the system to do far more work than is needed in
many  cases.   There  are always some cases where the entire
string representation will need to be examined to  determine
the  truth  of  the  predicate.  However they are many cases
where the predicate can be proved false or true much earlier
than  the  end of the string.  In these cases the system has
done a large amount of work to fully instantiate  these  in-
memory strings, only to throw them away mostly unused.

Solution

     The solution to this problem is the Incremental Subtree
Evaluator.  The incremental evaluator may still need to make
in-memory  copies  of the data.  However, the evaluator will
incrementally instantiate blocks of  each  document  subtree
and  then  compare  these blocks with the appropriate predi-
cate.  If the predicate can be proved with the current block
of  data,  no more data need be brought into memory, and the
value of the predicate can be returned.

     This works well with subtrees, which need to be  appro-
priately formatted from the XML database.  It also works for
large elements, which do not need to be completely instanti-
ated to be compared.  Indeed, with large elements it is pos-
sible that they can be compared directly on the pinned pages
which  they  reside  in,  and  need  not be copied into non-
buffer-pool memory at all.









                             -2-


Cache

     There is another issue involved  with  Subtree  Evalua-
tion, and that is the issue of cacheing.  It is expensive to
evaluate a subtree to compare it.  Once that work is done it
would  be  good to cache it, if at all possible, so that the
evaluation need not be performed again.  Instead of throwing
away  all  of the data which we have read from disk and con-
structed into a string, we can cache the result for  further
comparison.   This is more of an issue with XML that must be
reconstructed, rather than with linear data --  indeed large
chunks  of  data which need not be parsed are better left on
disk and let the buffer cache do its best.

     XNodes can be used as cache keys,  as  a  string  value
will  always  (however note on attributes) be represented by
an XNode.

     There is also the question about what to do when a sub-
tree  of a document is requested.  If that data is a subtree
of a cached subtree, can the already cached data be used?

     With early termination of predicates, individual  large
strings may grow incrementally with different comparisons of
cached data.  When evaluating a string that is  cached,  the
cached  data  may be used until its end is reached.  At that
point the evaluator will need to generate additional  string
for the values and add it to the cached value.  Context will
be needed to pickup the evaluation job from  where  it  left
off  previously  Another  concern  is  when  linear  data is
encountered.  If small there is  no  harm  in   caching  it.
However,  larger linear data should be left in place and the
corresponding chunk of the cached value should refer to  the
on-disk value.

     As  mentioned  earlier, there is a cost to reading from
the DM to build the string representation.   The  amount  of
memory that is available for the evaluator cache is limited.
It may be advantageous to write the  cached  versions  of  a
string  out  to  a  file  in the db when the cache grows too
large, rather  than  disarding  the  cached  data.   Reading
strings  from  a file for comparison can be less costly than
re-instantiating the string representation of the subtree.

     Much of this cacheing borders on  research,  and  is  a
good candidate to be left for the future.  At the same time,
however, this concerns are presented now so that they may be
taken  into  consideration when the non-cacheing incremental
evaluator is written.  With the concepts in place  from  the
start  the  evaluator  will  be  reusable, instead of a hin-
drance,  when  cacheing  is  added  to  the  system  to   be
researched.











                             -3-


Implementation Concerns

     Strings that are evaluated will always be subtrees of a
document, and will always be addressed via an XNode.
     Note that is true  except  for  the  attribute  problem
     which  is  outside  the  scope  of this work.  Briefly,
     individual attributes don't have  an  XNode,  they  are
     part of the element which contains them.  We are posit-
     ing for now that attributes are smaller chunks of  data
     and  this will not be an issue.  If they are large this
     issue will need to be re-evaluated.  Perhaps  adding  a
     ANode  (attribute  node)  type to the system would be a
     good solution, as the ANode (an extended  XNode)  could
     provide the ordinal of the attribute in it's containing
     XNode.

     Although it is labelled as  a  string  comparator,  the
incremental  evaluator  should  be  able to do anything that
would benefit from it's  existence.   That  means  that  the
string comparison portion of this system should be an inter-
changeable module at runtime.  For example, I might want  to
generate a hash value on the text of a subtree; all the work
of walking and instantiating the  subtree  is  done  by  the
evaluator.   The user then needs to provide a hash generator
which can be plugged into the evaluator.

     The evaluator should be configurable -- it should  pro-
vide mechanism, not policy.  The size of blocks that it con-
structs should be configurable on a per-evaluation basis.  A
mechanism  to allow early-return of the "predicate", depend-
ing upon what data it finds at runtime.  Either of a charac-
ter  oriented or block oriented data interface to the predi-
cate, depending upon what it requires.

     It should be able to provide some measure of statistics
with  each  evaluation,  and also across all evaluations For
example,  XNodes  traversed,  bytes   instantiated,   blocks
instantiated,  #blocks  and  total  size  of  data evaluated
directly out of the buffer  pool  are  an  example  of  some
things that might be useful.  It would be best if the evalu-
ator allowed statistics to be gathered and manipulated by  a
controlling  entity,  rather  than  being accumulated by the
package itself.  This allows for packaging the statistics at
different  granularity levels, which the evaluator need know
nothing about.