Temporary XML representation and storage in the new Niagara system. Problems Currently there is one non-scalable behavior with the Niagara Data Manager; it requires a document to be com- pletely parsed into memory before it is stored on disk. The problem is, of course, that the document may not fit in the memory resources available. Another thing that has become an issue is that Niagara should be able to issue queries on documents that are not completely parsed. This has two implications; the first is that this will reduce latency of query execution and allow for partial results. The flip side of this problem is that the entire query execution may be thrown away if the docu- ment does not parse correctly However that is the nature of the beast with XML; you can't determine that a query is cor- rect until the document parses successfully. Niagara has no good way of producing query results. Currently results are created as a in-memory DOM tree. How- ever there is no way to store the resulting query data to disk in the IM and the DM. Query results are limited by available memory. Yet another issue is that Niagara may want to parse and manipulate documents that are not stored in the Data Man- ager. Indeed, we may want to run Niagara without any Data Manager at all! This means that we will need some mechanism for refering only to parsed documents in memory. An example of this case is running Niagara as an XML grep tool on ran- dom XML files. However, if we do have a DM, perhaps we do want to store the document for future access so the parse cost need only be paid for once. Niagara should be capable of querying streams of XML data. However there is no end of the stream to stop parsing the data so it can be used. The stream may need to be stored, or thrown away after the query is executed. I men- tion this last since, once the above concerns are mentioned, streams seem to fall out as a particular instance of the above problems. Current Implementation The Data Manager is given the URL of a document to load. It locks the URL so that only one thread will parse the document. It interacts with the Index Manager to -2- determine the document ID of the document. The DM then starts its own parser and parses the document into memory, into a set of per-document tables called maps. One set of maps exist per parsed document; they are not shared as is the on-disk representation. Once parsing is complete the content of the in-memory maps are appended to the on-disk BTree which holds the Data Manager database. As the keys are (doc-id, element-id) pairs, there is no interaction between multiple documents in the on-disk BTree map, except for increasing the size of the database. The in-memory maps for that document are then deleted. Once this is done the document is unlocked. At this point the thread that loaded the document, as well as any other threads requesting the document can access it. An XNode is the in-memory class that contains the han- dle to an element of a document. Conceptually it is the (doc-id, element-id) pair mentioned above. When an element of an XML document is accessed, the referenced node is found via examining the on-disk BTree maps, and the desired prop- erties are extracted for in-memory use. Note that if a document is required by two queries, the 2nd query to request a particular document will pause until the document is loaded by the first query. Solution The proposed solution to these problems are to intro- duce the concept of temporary XNodes into the Data Manager. Users of XNodes will not see any difference between these temporary XNodes and normal on-disk XNodes; everything will be an XNode. When a document is parsed to be stored in the DM, it will be parsed into temporary XNodes, which are then flushed to disk. If a result is generated, it will be generated as temporary XNodes, which may then be discarded or flushed to disk as desired. If a document that is parsed for the dura- tion of a query is encountered, it will be parsed into tem- porary XNodes which will be removed when the query execution is complete. Note, however, that temporary XNodes may be flushed to disk as needed to control the memory resources that Niagara uses. As all XNodes are XNodes, there will be no user visi- ble change to this transition. Things will just work and performance will degrade gracefully instead of hitting a stone wall. Proposed Implementation Temporary XNodes will be implemented as in in-memory Object Cache addressed by the (document-id,element-id) pair. -3- Whenever an XNode is accessed a lookup will be done in the Object Cache to determine if the XNode in question is stored there, or on disk. As a performance step, pre-filtering on document-id may be done to reduce the overhead of determining if an XNode is in the cache. In that case, a document would be specified as being either all on disk, all in memory, or both in mem- ory and disk. As everything is done on a per document basis this seems a natural way of handling this issue. When a document is parsed there is, essentially, a stack of elements which are incomplete and will remain so until their end tag is parsed successfully. Fortunately, this limits the in-cache footprint for a parsed document. Flushing incomplete nodes will cause a performance degreda- tion, as the on-disk node will need to be updated. When that end tag is parsed that element is completed, I/O can be scheduled for the completed element and all nested elements of that document. Note that even if elements of a document are forced to disk that performance will still be OK; large documents will force everything to become I/O bound at some point. Smaller documents will not have that problem. Other implementation Details One thing the above section says nothing about is the problem of querying partially parsed documents. To a cer- tain extent, that is an orthogonal problem to some of the above problems. Implementing the above allows parsing of large results, storing results, and handling ad-hoc docu- ments. Indeed, querying partially parsed documents may not be strictly necessary, as the tuple based execution scheme may meet most needs of query execution, relying only upon the complete document to produce results. To handle querying and references to incomplete docu- ments, a flag needs to be added to the XNode, or to some lookup table for incomplete documents. The flag would indi- cate if an XNode is complete or not, and perhaps some state on how complete it is. The states of completion are the element itself is being parsed, the element is complete and some subset of descendants has been parsed, and the element end tag has been found. A simple implementation would block execution of a thread trying to use an incomplete XNode. Once the XNode is complete, the blocked thread could con- tinue and will be allowed to access the content of that node. However this simple case may not be useful enough, since access to the root itself will be blocked until the document is complete. However, as mentioned above, it may work well with the tuple based execution scheme, which will only refer to the content of a node if it is requested. This mode is what would be used for payload-only elements, unless some sort of locking is done as the payload is -4- streamed into the individual node, which requires far more complexity. With additional sophistication, one can refer to an incomplete node. Once a descendant node has been parsed, the scanner thread can descend into that node and examine it. In that case, as a thread tried to access subsequent descendants of a node, it would keep on doing so and descend the document tree. The blocking action would happen in a fashion that snakes along, resembling the order in which the document is parsed. We discussed system-wide versus per-session object caches. The advantage of the per-session object cache is that there is no need for locking. The advantage of the system-wide cache is that you can have sharing. There is nothing in this scheme which prevents either or both kinds of caches from existing at the same time. The issue with the system wide cache is that database lock locking would need to take effect and we would have to be concerned with multiple accessors. However, even in a single query we may have some of those problems, for example multiple SEQL oper- ators generating references to parts of the same document in different parts of the execution tree. Note that references to documents and sub-documents are performed on a per-document basis, and this scheme doesn't try to address garbage collection of unused subtrees of doc- uments. Once a temporary document is no longer referenced by a query, then it may be deleted from the object cache. A different strategy could be used, but that is good enough for now, and the users of the system will not see any dif- ference in use, just in performance, although something will need to create the appropriate references for garbage col- lection. In combination with the above info on accessing par- tially parsed documents comes the capability for multiple queries to share the common parse of a document -- through the exact same mechansim However, until further work is done, this will only work while the head of the conceptual document stream is all located in the object cache. So, for example, two queries accessing the same document that are started in close to each other may both be able to use the single parse of the document in low-latency mode to process their queries. Indeed, the cache manager could be notified of this situation and do its best to keep those resources available. However, once data leaves the cache and is writ- ten to disk, then other queries wanting to access that docu- ment must block until the transaction that the load is in has completed. This is due to database locks which will not allow access to the on-disk data. This is not seen as a real problem at the present, since the current scheme fixes many existing problems and provides for some possibility of -5- simultaneous query execution on a document, which the cur- rent architecture does not. Work can be done to investigat- ing the isolation of document loads into their own transac- tion, as we have talked about for the existing DM. It may be possible to access such a document in the context of the load transaction, which can access the on-disk data, to keep things flowing. One element that will be needed is an efficient way of determining if a node is parsed or not. This method must avoid race conditions, have minimal storage impact, and be space and time efficient. Fortunately, such a problem can be solved in the context of Niagara. It is based upon ele- ment IDs in a document which is being parsed. As a document is parsed from a linear form to a tree-like representation, nodes are parsed to completion from the bottom up. As such there will always be a range or ranges of element IDS which have been completely parsed, starting at the shallowest leftmost leaf. As parsing continues, anytime the head of a parse completes an element, an additional range may be added to the list of parsed ranges. At the same time this range can be trivially coallesced with adjacent ranges. When a thread desires to access a node of a incomplete document, the list of completed ranges must be checked for a match. If it is not found, then it blocks, awaiting completion of that element. Whenever a completion occurs,the list of waiters can be examined, and any satisifed waiters released. There will typically be few waiters for incomplete nodes of a document, on the order of the number of threads trying to access this document. Race conditions will not be a problem due to the range based nature of this locking, as any missed waiters can be trivially observed courtesy of the range com- pletion. With exact matches, for example, thee chance of a missed rendezvous is possible. Other methods of synchronization, such as a incomplete bit and a wait list may also be used. The wait list need only be examined if it is non-empty. Due to the frequency of parse completions and the low number of waiters, an exact solution is not necessary. Atomic memory updates could be used to advantage to minimize the amount of lock and work required to check the waiters list through other range based techniques. You can spend more resources for a more exact (available just after it is parsed) result, or less for a correct but slower to be notified result. As mentioned in the implementation section, cache man- agement is an important part of this proposal. It seems that individual simplistic strategies can do a good job, better than what we are doing now. A mixture of simple strategies, may not take a lot of time to improve performane and still allow flexibility. -6- Requests on Partial XNodes As mentiond earlier, the simplest strategy to use when requesting information from incomplete XNodes is to block until the XNode is complete. If the requested information is not yet complete, that wait must occur. However, there are certain types of requests which can be satisfied cor- rectly with an incomplete Xnode. This requires more com- plexity. An element consists of a start tag with attributes, some inline payload, a set of children (which can also include payload), and a end tag. As one example of a par- tial query, once the start tag is completely parsed, any requests for attributes of that node may be completely sat- isfied, even though the node itself isn't complete until the end tag is parsed. Instead of having a single incomplete flag for an XNode, we can introduce two incomplete flags to record that state: One flag indicates that parsing of the start tag of the XNode is incomplete, the other flag indi- cates that parsing of the element (the end tag has been parsed) is incomplete. If either flag is set the node as a whole is incomplete, but once both are cleared the node is complete and can satisfy any request. It is possible to allow finer grained access to Partial XNodes. However, such access comes at cost -- which is the need for more complex synchronization to the XNode. This is not a big problem, I just mention it up-front to make cer- tain that it is understood. You gain something, the capa- bility to query incomplete nodes, but it does have a cost associated with it. Note that I believe that they best way to take care of this syncrhonization is to ensure that tem- porary XNodes which are incomplete must be resident in mem- ory while a request is pending. This allows the syncroniza- tion mechanism to be part of the object cache, reduces the overhead of stored data, and is at least a starting point. Note that database locks are too coarse and not applicable for the kind of locking we need to implement this. The essence of further access to partial Xnodes is that we must classify the information request as to its nature. Some requests can be satisfied from an incomplete element without violating document properties. Access to named attributes or children which are themselves complete; the information is either present and immediately returnable. If the name is not present (or is itself incomplete), the query must block until it is. For example ... A request to fetch a named attribute, such as @id could fetch the id attribute if it exists and is com- plete. Similarily a request for a named element, such as /foo, can be satisfied if the element is found and it is complete. Other examples of this case are elements referenced by -7- position; either they are there and can be returned, or they are not and you need to wait until the node is complete (or complete enough) to return that information. Other requests, that request arbitrary information, or information which can not be known until the element is com- plete, must block until it is. A few examples: The number of attributes is not known until the element start tag is parsed. The number of children is not known until the end tag is parsed. Requests to iterate through attributes or children can not complete until all of the same are complete. With these types of request, the request must block until the element, or the relevant portion of the element, is com- plete. Other Solution(s) One techinique discussed was to use a DOM parser and representation for ad-hoc parsing. This would work only for streams and ad-hoc document queries. It does nothing for the parsing problem or generating query results. The base problem with this technique is that we need two in-memory representations of XML data; the XNodes, and the DOM node. There are more components to maintain, multiple parsers, multiple in-memory representations, and other duplicate effort. In addition, DOM parsers parse to completion before returning the resulting document, which preclude any pro- cessing while parsing. Another shortcoming is that if the data set must fit into memory, and there is no way to "spill" it to disk if the footprint becomes unexpectedly large.