Niagara Incremental Evaluator Introduction This component is horribly misnamed. It is used for incrementally traversing data stored in the DM. Typically it is used to generate some sort of printable or stringable representation of that data. That data is then used by some function to evaluate a predicate, or to generate a printable represenation, or an in-memory data structure, or whatever. In its current incarnation, the Incremental evaluator is an traverser which maintains some state and incrementally traverses a subtree of a document. Most often it returns successive chunks of the document in a textual, aka stringable, representation. This data can then be printed, evaluated in a predicate, or otherwise used. However, the generation of a textual material is not the sole function of the evaluator, it is just one thing that it can do. The evaluator is really a traverser. Once it reaches a node that is traversed, it has a user-defined module which does something. For example, along with the traverser is the default module, which generates the stringable output men- tioned here. An example of an alternate module would be one thatn "prints" an in-memory data structure usable by some portion of the system. For example, print a DTD or schema information as some in-memory tree suitable for the opti- mizer to use. This is the incremental subtree evaluator. What's a a subtree? Well, a subtree of a document could be just one node, a node name, the payload of a node, or indeed an entire subtree. The incremental nature needs to work with both of those scenarios -- for example a single node that is very large, and for a tree that is large, or a large tree with large elements. Structure and Operation Next, we need to determine how the incremental evalua- tor works. For now we'll consider the case where only one subtree is involved; for example we want to print a subtree as output, or evaluate a hash function on the subtree's printable XML representation. The multiple string case, string compare for example, will need to know about two incremental sources and how to run them, but there is _lit- tle_ additional complexity. The incremental evaluator will most likely be developed in 3 or 4 parts. One part will be the new evaluator which -2- takes care of most of the things mentioned here. Another part will be changes to the DM that are needed to support an incremental evaluator. The third part will be a set of incremental evaluator functions, string compare and print- able XML for example, which use the system and demonstrate how to write other evaluation functions. The fourth part will be the cache, which is not required for correctness, but is desired for performance. To start we have an XNode which refers to the portion of the document of interest. I mentioned in one of the meetings that an XNode can't point to an attribute and they would need to be handled differently. This was not correct, a XNode may refer to an attribute and there is no need to handle them differently from other portions of an XML docu- ment. The next thing we'll have is a parameter to run the evaluator with, the block size or chunk size. This will be used to determine how much data is instantiated at a time. This should be controllable for eacbh instantiation of the evaluator. This will allow us to run experiments to see what effective block sizes are. It allows us to control overall resource use of an operator. There are several ways to instantiate XML data. Full blown XML (string()?) text(), and data(). I forget and probably don't know what all of the requirements are off- hand. At the least examine the Quilt and XQuery papers to determine what is needed. You should discuss these issues with some Niagara people, Sekar for certain and some others too, to see if there are any fine points of this. In addi- tion determine what facilities are needed for XMLQL, XPath and XSLT and see what requirements they have for XML->text conversion. This is important, for we are building an gen- eral XML query engine, NOT a quilt/xquery specific engine. It is likely that the evaluator will be useful for creating (dumping) XML documents, so that is another consideration. Another one is being able to produce a "regularized" version of what is stored in the DM, or producing an exact copy of the input document. It seems that a general purpose instan- tiator is capable of dealing with all these scenarios. It is likely that one can develop a model of what parts of the document need to be instantiated for each various "mode". If this can be generalized it would allow us to choose what is instantiated from the document by the evalua- tor. For example, a set of flags of what are instantiated -- elements, text, attribute names, attribute values, nested elements, ordinary whitespace, ignoreable whitespace, etc. This would provide a building block tool that is generally useful for a number of purposes, instead of a more restricted thing. -3- When instantiating the blocks, some sort of state (think of it as a seek offset) will need to exist which is generated by the DM and the evaluator to represent the cur- rent position in the DM document. For example, if a large text element is being evaluated, evaluation may need to stop after the first X bytes, the block size, are processed. If evaluation of that subtree must continue, the context will allow restarting from *the point* where it left off. The context also provides structure related info to traverse the stored document correctly. Note that this context may be long lived -- if we are using the cache it will be necessary to continue a traversal at an arbitrary later time. Once the data is instantiated the "evaluation function" will execute on the chunk of data. The function will need an execution context so that it can retain state between chunks. On way is to allow the function to control the incremental evaluator, as that allows the most flexibility with the "evaluation function" -- it requests the next chunk of data from the evaluator, is given a pointer to it, and may keep whatever state it wants while it uses the data. Another way is to allow the evaluator to control evaluation, and to present the extracted data to an evaluator function callback. The callback is a class which maintains the state between callback invocations. I recommend the "function controlling the incremental evaluator" approach. This is the most general of models, including the capability to build a efficient callback interface. One thing to note with this model is that is necessary to provide an interface which allows for one or more subtrees to be incrementally evaluated at once. Two reasons here. One is that it allows the evaluator to build in-memory strings in parallel if it is desired to do so. Without such an interface, the evaluation function is forced to serially use the incremental evaluator, and remove hopes of parallelism with simple evaluation functions. This is no different from the callback model, as the callback only occurs once all strings in questions have their next chunk evaluated. The other reason is that it allows evaluation of multiple things to be tailored to the structure of each other, if convenient. Caching The next thing to consider is the partial evaluator cache. Note that the cache _is_not_ required for correct- ness; it is present for performance reasons. The cache would hold evaluated "blocks" of subtrees which need to be evaluated. How much of the subtree to hold is a good ques- tion. For example, if we are probing with a subtree in a join it would be very important to cache the subtree which is doing the probing, otherwise it would need to be rein- stantiated with each probe against the hash tables. Once -4- the initial probe is done, it is not as important to retain all the evaluated context of that subtree. The most impor- tant part of the evaluated context to retain is the _start_, as that allows immediate (and fast) in-memory evaluation. Each block in the cache should retain the context that it requires to perform further evaluation of the subtree. For example, if one subtree needs to be evaluated more and more to prove/dis-prove the predicate it would need to grow. The results can stay in the cache and when the next candidate needs to be evaluated against that subtree, more of the sub- tree may need to be evaluated to prove the predicate. Of course, if the cache needs to reclaim some blocks, they will need to be deleted from the END of each document. That is why the context must be saved, if the subtree needs to be evaluated further.