Temporary XML representation and storage in the
new Niagara system.
Problems
Currently there is one non-scalable behavior with the
Niagara Data Manager; it requires a document to be com-
pletely parsed into memory before it is stored on disk. The
problem is, of course, that the document may not fit in the
memory resources available.
Another thing that has become an issue is that Niagara
should be able to issue queries on documents that are not
completely parsed. This has two implications; the first is
that this will reduce latency of query execution and allow
for partial results. The flip side of this problem is that
the entire query execution may be thrown away if the docu-
ment does not parse correctly However that is the nature of
the beast with XML; you can't determine that a query is cor-
rect until the document parses successfully.
Niagara has no good way of producing query results.
Currently results are created as a in-memory DOM tree. How-
ever there is no way to store the resulting query data to
disk in the IM and the DM. Query results are limited by
available memory.
Yet another issue is that Niagara may want to parse and
manipulate documents that are not stored in the Data Man-
ager. Indeed, we may want to run Niagara without any Data
Manager at all! This means that we will need some mechanism
for refering only to parsed documents in memory. An example
of this case is running Niagara as an XML grep tool on ran-
dom XML files. However, if we do have a DM, perhaps we do
want to store the document for future access so the parse
cost need only be paid for once.
Niagara should be capable of querying streams of XML
data. However there is no end of the stream to stop parsing
the data so it can be used. The stream may need to be
stored, or thrown away after the query is executed. I men-
tion this last since, once the above concerns are mentioned,
streams seem to fall out as a particular instance of the
above problems.
Current Implementation
The Data Manager is given the URL of a document to
load. It locks the URL so that only one thread will parse
the document. It interacts with the Index Manager to
-2-
determine the document ID of the document. The DM then
starts its own parser and parses the document into memory,
into a set of per-document tables called maps. One set of
maps exist per parsed document; they are not shared as is
the on-disk representation. Once parsing is complete the
content of the in-memory maps are appended to the on-disk
BTree which holds the Data Manager database. As the keys
are (doc-id, element-id) pairs, there is no interaction
between multiple documents in the on-disk BTree map, except
for increasing the size of the database. The in-memory maps
for that document are then deleted. Once this is done the
document is unlocked. At this point the thread that loaded
the document, as well as any other threads requesting the
document can access it.
An XNode is the in-memory class that contains the han-
dle to an element of a document. Conceptually it is the
(doc-id, element-id) pair mentioned above. When an element
of an XML document is accessed, the referenced node is found
via examining the on-disk BTree maps, and the desired prop-
erties are extracted for in-memory use.
Note that if a document is required by two queries, the
2nd query to request a particular document will pause until
the document is loaded by the first query.
Solution
The proposed solution to these problems are to intro-
duce the concept of temporary XNodes into the Data Manager.
Users of XNodes will not see any difference between these
temporary XNodes and normal on-disk XNodes; everything will
be an XNode.
When a document is parsed to be stored in the DM, it
will be parsed into temporary XNodes, which are then flushed
to disk. If a result is generated, it will be generated as
temporary XNodes, which may then be discarded or flushed to
disk as desired. If a document that is parsed for the dura-
tion of a query is encountered, it will be parsed into tem-
porary XNodes which will be removed when the query execution
is complete.
Note, however, that temporary XNodes may be flushed to
disk as needed to control the memory resources that Niagara
uses. As all XNodes are XNodes, there will be no user visi-
ble change to this transition. Things will just work and
performance will degrade gracefully instead of hitting a
stone wall.
Proposed Implementation
Temporary XNodes will be implemented as in in-memory
Object Cache addressed by the (document-id,element-id) pair.
-3-
Whenever an XNode is accessed a lookup will be done in the
Object Cache to determine if the XNode in question is stored
there, or on disk.
As a performance step, pre-filtering on document-id may
be done to reduce the overhead of determining if an XNode is
in the cache. In that case, a document would be specified
as being either all on disk, all in memory, or both in mem-
ory and disk. As everything is done on a per document basis
this seems a natural way of handling this issue.
When a document is parsed there is, essentially, a
stack of elements which are incomplete and will remain so
until their end tag is parsed successfully. Fortunately,
this limits the in-cache footprint for a parsed document.
Flushing incomplete nodes will cause a performance degreda-
tion, as the on-disk node will need to be updated. When
that end tag is parsed that element is completed, I/O can be
scheduled for the completed element and all nested elements
of that document. Note that even if elements of a document
are forced to disk that performance will still be OK; large
documents will force everything to become I/O bound at some
point. Smaller documents will not have that problem.
Other implementation Details
One thing the above section says nothing about is the
problem of querying partially parsed documents. To a cer-
tain extent, that is an orthogonal problem to some of the
above problems. Implementing the above allows parsing of
large results, storing results, and handling ad-hoc docu-
ments. Indeed, querying partially parsed documents may not
be strictly necessary, as the tuple based execution scheme
may meet most needs of query execution, relying only upon
the complete document to produce results.
To handle querying and references to incomplete docu-
ments, a flag needs to be added to the XNode, or to some
lookup table for incomplete documents. The flag would indi-
cate if an XNode is complete or not, and perhaps some state
on how complete it is. The states of completion are the
element itself is being parsed, the element is complete and
some subset of descendants has been parsed, and the element
end tag has been found. A simple implementation would block
execution of a thread trying to use an incomplete XNode.
Once the XNode is complete, the blocked thread could con-
tinue and will be allowed to access the content of that
node. However this simple case may not be useful enough,
since access to the root itself will be blocked until the
document is complete. However, as mentioned above, it may
work well with the tuple based execution scheme, which will
only refer to the content of a node if it is requested.
This mode is what would be used for payload-only elements,
unless some sort of locking is done as the payload is
-4-
streamed into the individual node, which requires far more
complexity.
With additional sophistication, one can refer to an
incomplete node. Once a descendant node has been parsed,
the scanner thread can descend into that node and examine
it. In that case, as a thread tried to access subsequent
descendants of a node, it would keep on doing so and descend
the document tree. The blocking action would happen in a
fashion that snakes along, resembling the order in which the
document is parsed.
We discussed system-wide versus per-session object
caches. The advantage of the per-session object cache is
that there is no need for locking. The advantage of the
system-wide cache is that you can have sharing. There is
nothing in this scheme which prevents either or both kinds
of caches from existing at the same time. The issue with
the system wide cache is that database lock locking would
need to take effect and we would have to be concerned with
multiple accessors. However, even in a single query we may
have some of those problems, for example multiple SEQL oper-
ators generating references to parts of the same document in
different parts of the execution tree.
Note that references to documents and sub-documents are
performed on a per-document basis, and this scheme doesn't
try to address garbage collection of unused subtrees of doc-
uments. Once a temporary document is no longer referenced
by a query, then it may be deleted from the object cache. A
different strategy could be used, but that is good enough
for now, and the users of the system will not see any dif-
ference in use, just in performance, although something will
need to create the appropriate references for garbage col-
lection.
In combination with the above info on accessing par-
tially parsed documents comes the capability for multiple
queries to share the common parse of a document -- through
the exact same mechansim However, until further work is
done, this will only work while the head of the conceptual
document stream is all located in the object cache. So, for
example, two queries accessing the same document that are
started in close to each other may both be able to use the
single parse of the document in low-latency mode to process
their queries. Indeed, the cache manager could be notified
of this situation and do its best to keep those resources
available. However, once data leaves the cache and is writ-
ten to disk, then other queries wanting to access that docu-
ment must block until the transaction that the load is in
has completed. This is due to database locks which will not
allow access to the on-disk data. This is not seen as a
real problem at the present, since the current scheme fixes
many existing problems and provides for some possibility of
-5-
simultaneous query execution on a document, which the cur-
rent architecture does not. Work can be done to investigat-
ing the isolation of document loads into their own transac-
tion, as we have talked about for the existing DM. It may
be possible to access such a document in the context of the
load transaction, which can access the on-disk data, to keep
things flowing.
One element that will be needed is an efficient way of
determining if a node is parsed or not. This method must
avoid race conditions, have minimal storage impact, and be
space and time efficient. Fortunately, such a problem can
be solved in the context of Niagara. It is based upon ele-
ment IDs in a document which is being parsed. As a document
is parsed from a linear form to a tree-like representation,
nodes are parsed to completion from the bottom up. As such
there will always be a range or ranges of element IDS which
have been completely parsed, starting at the shallowest
leftmost leaf. As parsing continues, anytime the head of a
parse completes an element, an additional range may be added
to the list of parsed ranges. At the same time this range
can be trivially coallesced with adjacent ranges. When a
thread desires to access a node of a incomplete document,
the list of completed ranges must be checked for a match.
If it is not found, then it blocks, awaiting completion of
that element. Whenever a completion occurs,the list of
waiters can be examined, and any satisifed waiters released.
There will typically be few waiters for incomplete nodes of
a document, on the order of the number of threads trying to
access this document. Race conditions will not be a problem
due to the range based nature of this locking, as any missed
waiters can be trivially observed courtesy of the range com-
pletion. With exact matches, for example, thee chance of a
missed rendezvous is possible.
Other methods of synchronization, such as a incomplete
bit and a wait list may also be used. The wait list need
only be examined if it is non-empty. Due to the frequency
of parse completions and the low number of waiters, an exact
solution is not necessary. Atomic memory updates could be
used to advantage to minimize the amount of lock and work
required to check the waiters list through other range based
techniques. You can spend more resources for a more exact
(available just after it is parsed) result, or less for a
correct but slower to be notified result.
As mentioned in the implementation section, cache man-
agement is an important part of this proposal. It seems
that individual simplistic strategies can do a good job,
better than what we are doing now. A mixture of simple
strategies, may not take a lot of time to improve performane
and still allow flexibility.
-6-
Requests on Partial XNodes
As mentiond earlier, the simplest strategy to use when
requesting information from incomplete XNodes is to block
until the XNode is complete. If the requested information
is not yet complete, that wait must occur. However, there
are certain types of requests which can be satisfied cor-
rectly with an incomplete Xnode. This requires more com-
plexity.
An element consists of a start tag with attributes,
some inline payload, a set of children (which can also
include payload), and a end tag. As one example of a par-
tial query, once the start tag is completely parsed, any
requests for attributes of that node may be completely sat-
isfied, even though the node itself isn't complete until the
end tag is parsed. Instead of having a single incomplete
flag for an XNode, we can introduce two incomplete flags to
record that state: One flag indicates that parsing of the
start tag of the XNode is incomplete, the other flag indi-
cates that parsing of the element (the end tag has been
parsed) is incomplete. If either flag is set the node as a
whole is incomplete, but once both are cleared the node is
complete and can satisfy any request.
It is possible to allow finer grained access to Partial
XNodes. However, such access comes at cost -- which is the
need for more complex synchronization to the XNode. This is
not a big problem, I just mention it up-front to make cer-
tain that it is understood. You gain something, the capa-
bility to query incomplete nodes, but it does have a cost
associated with it. Note that I believe that they best way
to take care of this syncrhonization is to ensure that tem-
porary XNodes which are incomplete must be resident in mem-
ory while a request is pending. This allows the syncroniza-
tion mechanism to be part of the object cache, reduces the
overhead of stored data, and is at least a starting point.
Note that database locks are too coarse and not applicable
for the kind of locking we need to implement this.
The essence of further access to partial Xnodes is that
we must classify the information request as to its nature.
Some requests can be satisfied from an incomplete element
without violating document properties. Access to named
attributes or children which are themselves complete; the
information is either present and immediately returnable.
If the name is not present (or is itself incomplete), the
query must block until it is. For example ...
A request to fetch a named attribute, such as @id
could fetch the id attribute if it exists and is com-
plete. Similarily a request for a named element, such
as /foo, can be satisfied if the element is found and
it is complete.
Other examples of this case are elements referenced by
-7-
position; either they are there and can be returned, or they
are not and you need to wait until the node is complete (or
complete enough) to return that information.
Other requests, that request arbitrary information, or
information which can not be known until the element is com-
plete, must block until it is. A few examples:
The number of attributes is not known until the element
start tag is parsed.
The number of children is not known until the end tag
is parsed.
Requests to iterate through attributes or children can
not complete until all of the same are complete.
With these types of request, the request must block until
the element, or the relevant portion of the element, is com-
plete.
Other Solution(s)
One techinique discussed was to use a DOM parser and
representation for ad-hoc parsing. This would work only for
streams and ad-hoc document queries. It does nothing for
the parsing problem or generating query results. The base
problem with this technique is that we need two in-memory
representations of XML data; the XNodes, and the DOM node.
There are more components to maintain, multiple parsers,
multiple in-memory representations, and other duplicate
effort. In addition, DOM parsers parse to completion before
returning the resulting document, which preclude any pro-
cessing while parsing. Another shortcoming is that if the
data set must fit into memory, and there is no way to
"spill" it to disk if the footprint becomes unexpectedly
large.