Niagara Parse Streams Background There is a certain amount of redundancy in Niagara due to its structure and the independent development efforts of its separate components. For example the Data Manager and the Index Manager each have their own XML parser. These are both event-driven SAX parsers. Actually ... there is yet another XML parser in the system, that which the File Scan operator uses, which is a DOM parser. However the need for that separate parser will be eliminated once the DM's XNodes will be used for all XML data representation. What is the problem with two parsers you ask? Well, there are several. First off there is a performance problem -- a document must be parsed twice, once for the DM and once for the IM, if it will be stored in Niagara. XML parsing has a lot of overhead, and this cost is not reasonable for real queries. Secondly, there is the problem that the system as a whole must maintain two XML parsers. Outwardly this may not seem a big deal. However, you must realize that the two parsers must be in total agreement about the numbering of items in an XML document This coupling makes you realize that the parsers are not truly independent, but rather duplicate the functionality of each other. If they duplicate it incor- rectly the system won't work. Next, there is the issue or orthogonality. The IM and the DM have nothing to do with document parsing, and everything todo with document storage. Yet, considerable portions of both revolve around document parsing. By moving document parsing to its own system the complexity of these components will be reduced. In addition, it will be possible to have multiple XML parsers and multiple parse users in the system with only N+M complexity instead of NxM complexity. An example of this is "parsing" out of a document stored in the DM so that it can be [re-]indexed. Issues The XML parser is the repository for knowledge about numbering of XML content. At the very least there is the Element-ID numbering currently used by the DM. However, even this simplistic numbering scheme must take into account the way documents are broken up in Niagara. For example, -2- text between XML elements at a given level are stored as Text nodes and the numbering must be adjusted to account for this. In addition to element numbering there is also the start, end numbering used by the IM. This numbering scheme is needed as it reflects the structure of the document and allows contains predicates to be evaluated against two post- ings which have start, end numbers. Knowledge of document structure is required for contains queries. With this new Niagara we are also introducing the concept of payload word numbering. This allows exact non-stopword queries to be performed by the IM -- see the IM vignette for details. Stopwords are another issue, as the parser may need to num- ber stopwords in some cases and not number them in other cases. As such, the parser needs to know what a stopword is. It may be important to have a configurable stopword list. For now a fixed stopword list is good enough, but this is something important to consider for the future. In general, the parser will most likely provide more information than is needed by any one entity which is receiving its parse events. Generically, this means the parser will provide a union of all parse information required by each of its users. Multiple Parsing All of the above goals are workable with a standalone parsing module. One parser in the system is better than two and a big step forward from what we have now. However, one of the goals is to have the parser provide parse events to multiple consumers of those events at the same time. This is what will allow one parse to both load the document into the DM, and to generate postings which are stored in the IM. Since this is a modular design, the multiple parsing module could just be an ordinary parse event receiver which then duplicates the events to a number of other parse event receivers. It is currently assumed that the parse event interface will be procedural in nature, rather than an actual event- driven channel such as a stream. This is specified for the following reasons. First is to minimize the amount of data that is buffered. If streams were used and the two con- sumers have different consumption rates then a backlog will accrue to to slower consumer.. By using a synchronous interface this buffering problem is eliminated. Another issue to consider is the need for incremental parsing. If there is, for example, a large string we don't want to have to build it all in memory before passing it to the consumer. -3- An incremental approach where portions of the large data are passed to the consumers reduces the impact of large data and makes the system more scalable. The third factor to con- sider is the issue of copying data. XML parsing is an expensive operation, and copying the data to multiple sources is that much more expensive. The bottom line is that the procedural interface is simple, highly adaptable, and useful from the beginning. We may decide to change it in the future, but it is a good building block for now, and may be all that Niagara requires. Implementation Notes and Goals One goal of the parsing system is to divorce the rest of the system from dependence upon a particular XML parser. As such, the parse events interface should completely iso- late the consumers of parse events from the produce of the same. This means that there should be no direct or indirect coupling between the existing XML parser, Xerces-C, and the remainder of the Niagara system. In other words, the XML parser is an implementation detail of the parse stream gen- erator and nothing more. Currently the IM and the DM each drive their own load- ing with a load(url) function. They know about creating transactions for load, and then either commit or abort the transaction. In the new system this will change -- loads will happen in the context of a load operator which is responsible for connecting a document source to an XML parser. The load operation will also be responsible for collecting the receiver(s) of parse events and arranging for the parse events to be sent to those event receiver(s). The IM and DM will know nothing about transactions, and operate in the context of one. Transaction abort and commit will be carried out by higher-level entities which can make policy decisions. Currently, the IM and the DM are both entities with somewhat unrelated halfs. The important part of both the IM and the DM are the storage and retrieval of database infor- mation. However, both the IM and the DM have a large frac- tion devoted to parsing XML into a format which the storage engine can then store. This increases the complexity of the two systems quite a bit, Moving the bulk of this functional- ity to the parsing system, with only a IM or DM relevant data interpreter to convert the parsed data into the format needed by those subsystems. Even better would be to isolate the the data conversion process in the IM and DM from the actual DB components. This would allow the DB components to just store a "stream" of relevant objects, whether postings or XNodes. This will make the DB components just components which deal with database access, and make the system more orthogonal. Of course this is a long-range goal, but it -4- should be known up-front to guide development of the system. It is important that the parser be "scalable" in terms of document size. This scalability item means that parsing should be independent of document size -- if document size grows parsing size shouldn't grow with it. However it is also important that the parser be scalable on a smaller level -- for example individual elements of a document, whether they be words or XML elements. For example, a 100MB word in a document must not require the 100MB word to be formed in memory before it is passed to the event receivers. The event driven nature of the parser will tend to take care of the large documents by breaking them up into individual elements. However, a further incremental approach is required for handling individual large content. For exam- ple, a start_element function to start a new element of the text, and a continue_element function which incrementally adds text to a started element until it is complete. Of course, a small element would have be complete with just the call to start_element, and not require further calls.