Incremental Subtree Evaluator Problem XML query processing often implies that a string com- parison will need to be made on data. Often this data is quite large, encompassing an entire subtree of a document. However, there are other cases when the data can be large, for example a large amount of text data between two tags. When comparing such data Niagara would make complete in-memory copies of the data which needed to be compared. In the case of a join, both arguments would need to be built in memory. After the string(s) were constructed in memory, then they would be compared as strings to determine the value of the predicate. There are problems with this approach. The most obvi- ous is that it requires one or more potentially quite large (entire document anyone) data sources to be traversed on disk and built into memory. This requires a lot of resources in the form of memory use, disk I/O and CPU time. It doesn't scale at all with document size. This approach also forces the system to do far more work than is needed in many cases. There are always some cases where the entire string representation will need to be examined to determine the truth of the predicate. However they are many cases where the predicate can be proved false or true much earlier than the end of the string. In these cases the system has done a large amount of work to fully instantiate these in- memory strings, only to throw them away mostly unused. Solution The solution to this problem is the Incremental Subtree Evaluator. The incremental evaluator may still need to make in-memory copies of the data. However, the evaluator will incrementally instantiate blocks of each document subtree and then compare these blocks with the appropriate predi- cate. If the predicate can be proved with the current block of data, no more data need be brought into memory, and the value of the predicate can be returned. This works well with subtrees, which need to be appro- priately formatted from the XML database. It also works for large elements, which do not need to be completely instanti- ated to be compared. Indeed, with large elements it is pos- sible that they can be compared directly on the pinned pages which they reside in, and need not be copied into non- buffer-pool memory at all. -2- Cache There is another issue involved with Subtree Evalua- tion, and that is the issue of cacheing. It is expensive to evaluate a subtree to compare it. Once that work is done it would be good to cache it, if at all possible, so that the evaluation need not be performed again. Instead of throwing away all of the data which we have read from disk and con- structed into a string, we can cache the result for further comparison. This is more of an issue with XML that must be reconstructed, rather than with linear data -- indeed large chunks of data which need not be parsed are better left on disk and let the buffer cache do its best. XNodes can be used as cache keys, as a string value will always (however note on attributes) be represented by an XNode. There is also the question about what to do when a sub- tree of a document is requested. If that data is a subtree of a cached subtree, can the already cached data be used? With early termination of predicates, individual large strings may grow incrementally with different comparisons of cached data. When evaluating a string that is cached, the cached data may be used until its end is reached. At that point the evaluator will need to generate additional string for the values and add it to the cached value. Context will be needed to pickup the evaluation job from where it left off previously Another concern is when linear data is encountered. If small there is no harm in caching it. However, larger linear data should be left in place and the corresponding chunk of the cached value should refer to the on-disk value. As mentioned earlier, there is a cost to reading from the DM to build the string representation. The amount of memory that is available for the evaluator cache is limited. It may be advantageous to write the cached versions of a string out to a file in the db when the cache grows too large, rather than disarding the cached data. Reading strings from a file for comparison can be less costly than re-instantiating the string representation of the subtree. Much of this cacheing borders on research, and is a good candidate to be left for the future. At the same time, however, this concerns are presented now so that they may be taken into consideration when the non-cacheing incremental evaluator is written. With the concepts in place from the start the evaluator will be reusable, instead of a hin- drance, when cacheing is added to the system to be researched. -3- Implementation Concerns Strings that are evaluated will always be subtrees of a document, and will always be addressed via an XNode. Note that is true except for the attribute problem which is outside the scope of this work. Briefly, individual attributes don't have an XNode, they are part of the element which contains them. We are posit- ing for now that attributes are smaller chunks of data and this will not be an issue. If they are large this issue will need to be re-evaluated. Perhaps adding a ANode (attribute node) type to the system would be a good solution, as the ANode (an extended XNode) could provide the ordinal of the attribute in it's containing XNode. Although it is labelled as a string comparator, the incremental evaluator should be able to do anything that would benefit from it's existence. That means that the string comparison portion of this system should be an inter- changeable module at runtime. For example, I might want to generate a hash value on the text of a subtree; all the work of walking and instantiating the subtree is done by the evaluator. The user then needs to provide a hash generator which can be plugged into the evaluator. The evaluator should be configurable -- it should pro- vide mechanism, not policy. The size of blocks that it con- structs should be configurable on a per-evaluation basis. A mechanism to allow early-return of the "predicate", depend- ing upon what data it finds at runtime. Either of a charac- ter oriented or block oriented data interface to the predi- cate, depending upon what it requires. It should be able to provide some measure of statistics with each evaluation, and also across all evaluations For example, XNodes traversed, bytes instantiated, blocks instantiated, #blocks and total size of data evaluated directly out of the buffer pool are an example of some things that might be useful. It would be best if the evalu- ator allowed statistics to be gathered and manipulated by a controlling entity, rather than being accumulated by the package itself. This allows for packaging the statistics at different granularity levels, which the evaluator need know nothing about.