Incremental Subtree Evaluator
Problem
XML query processing often implies that a string com-
parison will need to be made on data. Often this data is
quite large, encompassing an entire subtree of a document.
However, there are other cases when the data can be large,
for example a large amount of text data between two tags.
When comparing such data Niagara would make complete
in-memory copies of the data which needed to be compared.
In the case of a join, both arguments would need to be built
in memory. After the string(s) were constructed in memory,
then they would be compared as strings to determine the
value of the predicate.
There are problems with this approach. The most obvi-
ous is that it requires one or more potentially quite large
(entire document anyone) data sources to be traversed on
disk and built into memory. This requires a lot of
resources in the form of memory use, disk I/O and CPU time.
It doesn't scale at all with document size. This approach
also forces the system to do far more work than is needed in
many cases. There are always some cases where the entire
string representation will need to be examined to determine
the truth of the predicate. However they are many cases
where the predicate can be proved false or true much earlier
than the end of the string. In these cases the system has
done a large amount of work to fully instantiate these in-
memory strings, only to throw them away mostly unused.
Solution
The solution to this problem is the Incremental Subtree
Evaluator. The incremental evaluator may still need to make
in-memory copies of the data. However, the evaluator will
incrementally instantiate blocks of each document subtree
and then compare these blocks with the appropriate predi-
cate. If the predicate can be proved with the current block
of data, no more data need be brought into memory, and the
value of the predicate can be returned.
This works well with subtrees, which need to be appro-
priately formatted from the XML database. It also works for
large elements, which do not need to be completely instanti-
ated to be compared. Indeed, with large elements it is pos-
sible that they can be compared directly on the pinned pages
which they reside in, and need not be copied into non-
buffer-pool memory at all.
-2-
Cache
There is another issue involved with Subtree Evalua-
tion, and that is the issue of cacheing. It is expensive to
evaluate a subtree to compare it. Once that work is done it
would be good to cache it, if at all possible, so that the
evaluation need not be performed again. Instead of throwing
away all of the data which we have read from disk and con-
structed into a string, we can cache the result for further
comparison. This is more of an issue with XML that must be
reconstructed, rather than with linear data -- indeed large
chunks of data which need not be parsed are better left on
disk and let the buffer cache do its best.
XNodes can be used as cache keys, as a string value
will always (however note on attributes) be represented by
an XNode.
There is also the question about what to do when a sub-
tree of a document is requested. If that data is a subtree
of a cached subtree, can the already cached data be used?
With early termination of predicates, individual large
strings may grow incrementally with different comparisons of
cached data. When evaluating a string that is cached, the
cached data may be used until its end is reached. At that
point the evaluator will need to generate additional string
for the values and add it to the cached value. Context will
be needed to pickup the evaluation job from where it left
off previously Another concern is when linear data is
encountered. If small there is no harm in caching it.
However, larger linear data should be left in place and the
corresponding chunk of the cached value should refer to the
on-disk value.
As mentioned earlier, there is a cost to reading from
the DM to build the string representation. The amount of
memory that is available for the evaluator cache is limited.
It may be advantageous to write the cached versions of a
string out to a file in the db when the cache grows too
large, rather than disarding the cached data. Reading
strings from a file for comparison can be less costly than
re-instantiating the string representation of the subtree.
Much of this cacheing borders on research, and is a
good candidate to be left for the future. At the same time,
however, this concerns are presented now so that they may be
taken into consideration when the non-cacheing incremental
evaluator is written. With the concepts in place from the
start the evaluator will be reusable, instead of a hin-
drance, when cacheing is added to the system to be
researched.
-3-
Implementation Concerns
Strings that are evaluated will always be subtrees of a
document, and will always be addressed via an XNode.
Note that is true except for the attribute problem
which is outside the scope of this work. Briefly,
individual attributes don't have an XNode, they are
part of the element which contains them. We are posit-
ing for now that attributes are smaller chunks of data
and this will not be an issue. If they are large this
issue will need to be re-evaluated. Perhaps adding a
ANode (attribute node) type to the system would be a
good solution, as the ANode (an extended XNode) could
provide the ordinal of the attribute in it's containing
XNode.
Although it is labelled as a string comparator, the
incremental evaluator should be able to do anything that
would benefit from it's existence. That means that the
string comparison portion of this system should be an inter-
changeable module at runtime. For example, I might want to
generate a hash value on the text of a subtree; all the work
of walking and instantiating the subtree is done by the
evaluator. The user then needs to provide a hash generator
which can be plugged into the evaluator.
The evaluator should be configurable -- it should pro-
vide mechanism, not policy. The size of blocks that it con-
structs should be configurable on a per-evaluation basis. A
mechanism to allow early-return of the "predicate", depend-
ing upon what data it finds at runtime. Either of a charac-
ter oriented or block oriented data interface to the predi-
cate, depending upon what it requires.
It should be able to provide some measure of statistics
with each evaluation, and also across all evaluations For
example, XNodes traversed, bytes instantiated, blocks
instantiated, #blocks and total size of data evaluated
directly out of the buffer pool are an example of some
things that might be useful. It would be best if the evalu-
ator allowed statistics to be gathered and manipulated by a
controlling entity, rather than being accumulated by the
package itself. This allows for packaging the statistics at
different granularity levels, which the evaluator need know
nothing about.