Niagara Incremental Evaluator
Introduction
This component is horribly misnamed. It is used for
incrementally traversing data stored in the DM. Typically
it is used to generate some sort of printable or stringable
representation of that data. That data is then used by some
function to evaluate a predicate, or to generate a printable
represenation, or an in-memory data structure, or whatever.
In its current incarnation, the Incremental evaluator
is an traverser which maintains some state and incrementally
traverses a subtree of a document. Most often it returns
successive chunks of the document in a textual, aka
stringable, representation. This data can then be printed,
evaluated in a predicate, or otherwise used. However, the
generation of a textual material is not the sole function of
the evaluator, it is just one thing that it can do. The
evaluator is really a traverser. Once it reaches a node
that is traversed, it has a user-defined module which does
something. For example, along with the traverser is the
default module, which generates the stringable output men-
tioned here. An example of an alternate module would be one
thatn "prints" an in-memory data structure usable by some
portion of the system. For example, print a DTD or schema
information as some in-memory tree suitable for the opti-
mizer to use.
This is the incremental subtree evaluator. What's a a
subtree? Well, a subtree of a document could be just one
node, a node name, the payload of a node, or indeed an
entire subtree. The incremental nature needs to work with
both of those scenarios -- for example a single node that is
very large, and for a tree that is large, or a large tree
with large elements.
Structure and Operation
Next, we need to determine how the incremental evalua-
tor works. For now we'll consider the case where only one
subtree is involved; for example we want to print a subtree
as output, or evaluate a hash function on the subtree's
printable XML representation. The multiple string case,
string compare for example, will need to know about two
incremental sources and how to run them, but there is _lit-
tle_ additional complexity.
The incremental evaluator will most likely be developed
in 3 or 4 parts. One part will be the new evaluator which
-2-
takes care of most of the things mentioned here. Another
part will be changes to the DM that are needed to support an
incremental evaluator. The third part will be a set of
incremental evaluator functions, string compare and print-
able XML for example, which use the system and demonstrate
how to write other evaluation functions. The fourth part
will be the cache, which is not required for correctness,
but is desired for performance.
To start we have an XNode which refers to the portion
of the document of interest. I mentioned in one of the
meetings that an XNode can't point to an attribute and they
would need to be handled differently. This was not correct,
a XNode may refer to an attribute and there is no need to
handle them differently from other portions of an XML docu-
ment.
The next thing we'll have is a parameter to run the
evaluator with, the block size or chunk size. This will be
used to determine how much data is instantiated at a time.
This should be controllable for eacbh instantiation of the
evaluator. This will allow us to run experiments to see
what effective block sizes are. It allows us to control
overall resource use of an operator.
There are several ways to instantiate XML data. Full
blown XML (string()?) text(), and data(). I forget and
probably don't know what all of the requirements are off-
hand. At the least examine the Quilt and XQuery papers to
determine what is needed. You should discuss these issues
with some Niagara people, Sekar for certain and some others
too, to see if there are any fine points of this. In addi-
tion determine what facilities are needed for XMLQL, XPath
and XSLT and see what requirements they have for XML->text
conversion. This is important, for we are building an gen-
eral XML query engine, NOT a quilt/xquery specific engine.
It is likely that the evaluator will be useful for creating
(dumping) XML documents, so that is another consideration.
Another one is being able to produce a "regularized" version
of what is stored in the DM, or producing an exact copy of
the input document. It seems that a general purpose instan-
tiator is capable of dealing with all these scenarios.
It is likely that one can develop a model of what parts
of the document need to be instantiated for each various
"mode". If this can be generalized it would allow us to
choose what is instantiated from the document by the evalua-
tor. For example, a set of flags of what are instantiated
-- elements, text, attribute names, attribute values, nested
elements, ordinary whitespace, ignoreable whitespace, etc.
This would provide a building block tool that is generally
useful for a number of purposes, instead of a more
restricted thing.
-3-
When instantiating the blocks, some sort of state
(think of it as a seek offset) will need to exist which is
generated by the DM and the evaluator to represent the cur-
rent position in the DM document. For example, if a large
text element is being evaluated, evaluation may need to stop
after the first X bytes, the block size, are processed. If
evaluation of that subtree must continue, the context will
allow restarting from *the point* where it left off. The
context also provides structure related info to traverse the
stored document correctly. Note that this context may be
long lived -- if we are using the cache it will be necessary
to continue a traversal at an arbitrary later time.
Once the data is instantiated the "evaluation function"
will execute on the chunk of data. The function will need
an execution context so that it can retain state between
chunks. On way is to allow the function to control the
incremental evaluator, as that allows the most flexibility
with the "evaluation function" -- it requests the next chunk
of data from the evaluator, is given a pointer to it, and
may keep whatever state it wants while it uses the data.
Another way is to allow the evaluator to control evaluation,
and to present the extracted data to an evaluator function
callback. The callback is a class which maintains the
state between callback invocations.
I recommend the "function controlling the incremental
evaluator" approach. This is the most general of models,
including the capability to build a efficient callback
interface. One thing to note with this model is that is
necessary to provide an interface which allows for one or
more subtrees to be incrementally evaluated at once. Two
reasons here. One is that it allows the evaluator to build
in-memory strings in parallel if it is desired to do so.
Without such an interface, the evaluation function is forced
to serially use the incremental evaluator, and remove hopes
of parallelism with simple evaluation functions. This is no
different from the callback model, as the callback only
occurs once all strings in questions have their next chunk
evaluated. The other reason is that it allows evaluation of
multiple things to be tailored to the structure of each
other, if convenient.
Caching
The next thing to consider is the partial evaluator
cache. Note that the cache _is_not_ required for correct-
ness; it is present for performance reasons. The cache
would hold evaluated "blocks" of subtrees which need to be
evaluated. How much of the subtree to hold is a good ques-
tion. For example, if we are probing with a subtree in a
join it would be very important to cache the subtree which
is doing the probing, otherwise it would need to be rein-
stantiated with each probe against the hash tables. Once
-4-
the initial probe is done, it is not as important to retain
all the evaluated context of that subtree. The most impor-
tant part of the evaluated context to retain is the _start_,
as that allows immediate (and fast) in-memory evaluation.
Each block in the cache should retain the context that it
requires to perform further evaluation of the subtree. For
example, if one subtree needs to be evaluated more and more
to prove/dis-prove the predicate it would need to grow. The
results can stay in the cache and when the next candidate
needs to be evaluated against that subtree, more of the sub-
tree may need to be evaluated to prove the predicate. Of
course, if the cache needs to reclaim some blocks, they will
need to be deleted from the END of each document. That is
why the context must be saved, if the subtree needs to be
evaluated further.