XML Parsing and Parse Events Introduction Niagara will have a common XML parser interface. This will allow multiple consumers of parse data and multiple parsers with only a N+M complexity, rather than a N*M com- plexity. By factoring the system this way we create more orthogonal components. The cost of components is defining interfaces so they can exist as a non-monolithic block. The interface is often the most important factor in designing a system. With good interfaces you can get by with poor implementations, since the implementation can be replaced with a better one only when a better one is required. This vignette will discuss the issues that the parse events interface will need to address. It will do so by providing a number of sample data streams and a set of parse events which could be used to represent that data. This is not necessarily the way to do this, but it is a starting point for further refinement. There are defintions of the terms and attributes at the end of the document. The parser should provide all of the information needed by the consumers of parse events to know about document structure. This will eliminate the need for the receivers to implement their own independent tools for doing something the parser can easily provide. Of course there will be par- ticular things which the parser can't provide -- such as building the ID<->IDREF information needed by the IM. Whitespace It turns out that whitespace is a non-trivial issue in XML parsing. This is because there are multiple kinds of whitespace in a document. To make this more interesting, the kind of whitespace is context dependent; it depends upon what element or combination of elements the textual informa- tion is enclosed by! There are also whitespace scoping rules in XML which make things more interesting, or tedious, depending upon your viewpoint. The first kind of whitespace is ignoreable whitespace which really can't be ignored, despite the name. The -2- salient point is that you need not care about what the actual whitespace is; only to know that there was whites- pace. For example, some complicated whitespace of 10K char- acters can be annotated as "whitespace", the original form of the WS need not be recorded. The second kind of whitespace is non-ignoreable whites- pace. This kind of whitespace must be recorded exactly as presented in the source document; replacing it with a generic whitespace is not allowed. There are some cases where whitespace is ignoreable, but where we do not want to or must not ignore it! For example, storing a document that we would like to retrieve an exact version of, not just a semantically identical ver- sion. This is an important issue to some users and docu- ments, and is not an issue with other users or documents. I think the way to handle this is to specify, on a per-docu- ment basis, if ignoreable whitespace should be treated as such, or as non-ignoreable whitespace. This could be indi- cated in the catalog entry for the document. Note that this very exact whitespace implies that we may need to ignore some of the whitespace during processing, or take it all into consideration, or ignore all of it. Fortunately this is not a big problem with our new design! A discussion brought up another concept of whitespace handling; a query may want to treat non-ignoreable whites- pace as ignoreable whitespace. This would allow a query to match documents more freely than a query which required exact whitespace semantics. Essentially this kind of query asks that whitespace rules be relaxed for its execution. XML allows control of whitespace in the schema specifi- cation for a document. In addition, local XML whitespace attributes in the document can modify whitespace handling to an extent. The parser will need to be aware of the whites- pace context and keep track of it. Terminating Characters Niagara has a concept which add to the complexity of XML document parsing. This is the idea of a terminating character. This is a character which causes a non-whites- pace word break in a document. The characters before the terminating character are indexed as their own word, the terminating character may be indexed, and the trailing char- acters will be indexed as a new word. Terminating characters may be context dependent in a document, and may vary between documents. What exactly to do with terminating characters, and how best to utilize them are left as research. However, the parser still needs to handle them and interact correctly with whitespace. -3- Document Structure Finally we come to XML document structure. This turns out to be the most simple thing to handle, as XML document structure is regular. If we want to retrieve a document in the exact order that it was stored into the system, it may be necessary to record extra information about the document source so the original can be retrieved. With the current niagara DM, that means a mapping of attribute position to attribute term number will be needed. However, if exact whitespace representation is desired in a document, complexity is considerably increased. This is not just payload whitespace, but all whitespace in a doc- ument. Attributes Attributes are not in the payload of a document. They aren't assigned unique start numbers. Instead, attributes belong to the element which they are nested inside, and are assigned attribute#s relative to that node. Attribute words are numbered relative to the attribute they are contained within. Items of Large Size The intent of the new Niagara is to be scaleable. In the context of parsing and storage, it means we treat every- thing as it could be a huge object. That is the case even if we expect it to be small, or if we expect the common case to be small. By using this approach we are assured in hav- ing a system which will just work correctly in the face of adversity or unexpected inputs. By doing this we eliminate the source of many problems, and also eliminate the need to do things in multiple ways, which makes the system simpler over all. The flip side of this is that we would like to make the common case, where items are small and easy to work with, really fast. This is allowed for throughout the new Niagara design; small string values in tuples, but a fallback to the Incremental Evaluator cache or Data Manager if the value is large. The same issues ccurs during data storage and pars- ing. The common, short case will be fast, yet everything will be treated as uniformly large to allow for the cases when that occurs. Scalability is performed by allowing access to sequen- tial chunks of a thing, and providing context about where that chunk is located in the final result. -4- Which Whitespace to Consider If whitespace is ordinary whitespace, the following rules apply. Multiple whitespaces characters are reduced to a single whitespace, which is not a character, but can be represented as one, a space for example. If whitespace abuts document structure in elements, it is eliminated. In other words this is a test becomes this is a test This is consistent with document handling, does no harm, and makes a regular storage model possible. Let us call this canonical whitespace. If non-ignoreable whitespace mode is enabled, any whitespace in a element payload will be exactly stored as represented in the source document. Whitespace handling in attribute values may be different, as whitespace could be more significant there. As an initial pass, we can treat attribute whitespace as ignoreable whitespace and apply the earlier rules to it. There is no way to query the whitespace, and if structure is desired, it belongs in the document payload. If schema information or XML standards allow more exact control of whitespace in attributes, we may need to do something else. If there is a dataset which needs significant whitespace handling in attributes, it will be easy to deal with it. Note that if a query contains whitespace that the same rules can be applied to the query to squish out whitespace so that it matches the whitespace in a document. What to Implement Obviously the parser and parse events need to correctly implement the parsing of document structure. It also needs to be able to parse payload and whitespace. What we aren't going to implement for now is 100% whitespace exactness. Whitespace in attribute values and in element payload will be the only whitespace the system cares about. This simplifies parsing and storage considerably, as whitespace events can be ignored inside document structure. Whitespace interpretation (exact, non-exact) is con- trolled by the evaluation function which prints stored data into a string to be evaluated. This is not an issue with content which has ignoreable whitespace; it is already in -5- the canonical form. However, for content which has non- ignoreable whitespace, the evaluator can convert the whites- pace into canonical whitespace. This allows specifying queries which don't care about exact whitespace, and can explore a larger possible result realm. ParseEvents I'm going to divide the discussion of parse events into two arenas, which match the structure versus content split which I imply earlier. However, before we get into that, lets provide some common terminology for all parse events. There are two kinds of parse events. The first group has to do with document structure, and reflect the presence of elements and attributes. The second kind of parse event has to do with document payload, a misused term in this paper. Document payload only exists in certain points of document structure. The two different groups share some attributes, but other attributes are unique to each group, for example whitespace considerations and adjacency. Parse events have a number of attributes which present information about the document which is needed to provide the context of that event within the structure of the docu- ment. Term Definition ------------------------------------------------------------------ start The numbering scheme which the IM uses to number payload and strcture portions of the document. Each element and significant word is assigned a start number. This implies that both whitespace and stopwords, which act as whitespace to a cer- tain extent, need not have unique start number. If a start number is needed,the start number of the preceding event could be repeated without harm. end The start number of an element's closing tag. ele_id The scheme the DM currently uses to store a docu- ment; an ele_id is assigned to each XML element, and to each DM-special text element. Note match_start below; match_ele_id isn't required since this field can be used in both cases. PW Payload word number; all payloads words in a docu- ment are numbered from the first. SW Flag indicating a word is a stop-word -6- isComplete Parse event is complete; full value has been transfered. If the data takes multiple parse events to transfer, isComplete will be present on the last event of that sequence. For uniform han- dling this should be true for all parse events, except those which require >1 parse event to transfer the data. offset= Location of text chunk in result of this parse event. Offset==0 in the first event of large data, and grows uniformly based on the length of data presented so ar. AN Attribute number of this attribute in this element AW Attribute word number; all words in a attribute in a element are numbered from the first. PE Parse Event number; unique number identifying the parse event. This is a unique ID for debug pur- poses, I don't think a event receiver should ever need to care about it. It just says this is the n'th parse event received. text The text associated with the parse event, or this chunk. There is a length associated with the text. isAdjacent This parse event is adjacent to the previous parse event. Whitespace will always be isAdjacent. If adjaceny isn't specified, the event consumer does what it needs to do to represent things properly; for example the first word in a payload or an attribute won't have a preceeding whitespace. nest The nesting level in the document of a particular element or payload. An element is at the nesting level of the payload of the element it is con- tained in. Payload of an element is at a +1 nest- ing level. match_start The start number of the element which matches this element. This is only used for end-element. element_word This is an example of a attribute that we could add if we choose to. This illustrates the attribute based nature of the interface. The ele- ment word would be the cardinal number of a pay- load word in it's element. isAttribute Future consideration for common word and whites- pace model. Indicates that the word or whitespace is an attribute value and not a payload value. This is not strictly necessary; consumers can dis- cover it from document structure easily. Any and all large values are handled by the offset= and isComplete mechanism. Briefly, (offset==0 && isComplete) means that an entry is the common, shore case and the one parse event has provided the complete value. If the value is broken into several chunks, the first chunk will have (offset==0), and the last chunk will have (isComplete). The intermediate chunks of the value will have positively -7- incrementing offset= attributes, as well as the data which starts at that offset=. Every parse event will have a start number associated with it. Some attributes of a parse event may not change between successive parse events. A trivial example of this is the isComplete issue. Another example is the start number may not change between some parse events. This reflects a lack of movement or change of context in the document. For exam- ple, the start number will remain constant until the end tag of the element is seen, as the element counts as 1 piece of document structure. Another example of this occur during attribute handling; all attributes and attribute words and the end tag of an element will all have the start number of the element. In addition, all attribute words of an attribute will have the same attribute number. Adjaceny, denoted through isAdjacent, indicates that a payload element abuts the previous payload element. Without the adjacency event, the consumers are free to consider that canonical whitespace exists between the payload elements. As mentioned earlier, all whitespace will be adjacent, since any representation of whitespace as a parse event means that the whitespace must be recorded exactly. To simplify han- dling of whitespace in consumers of parse events, the parser could or should mark payload following white space as adja- cent. Similarily, the adjacent property could be set on the first payload in an attribute or element to signify that the payload is abutted to the document structure; or in other words, to not have to keep state to avoid inserting a canon- ical whitespace there. Document Structure Document structure is easy to provide parse events for; here they are: +-------------------------------------------------------------+ | Event XML Text Text Content | +-------------------------------------------------------------+ |Element- - | |End-Element |> - - | |Attribute def=" attribute name def | |End-Attribute " - | +-------------------------------------------------------------+ The structure parse events will have the following attributes; isComplete is noted to indicate which things could require incomplete handling. -8- +----------------------------------------------------------------------+ | Document Structure | +----------------+-----------------------------------------------------+ | Event | start text attribute match_start isComplete | +----------------+-----------------------------------------------------+ |Element | x x - - x | |End-Element-Tag | x - - - - | |End-Element | x - - x - | |Attribute | x x x - x | |End-Attribute | x - x - - | +----------------+-----------------------------------------------------+ Document Payload As I mentioned earlier, document payload is a misused term in this document. It refers to both the text() payload of an XML document, and to all non-structural content of the document. In this case, both to what we call payload words and attribute words. It also includes whitespace, which only exists in document payload (ignoring exact retrieval issues). One way of dealing with this is to remove the differ- ence between an attribute word and a payload word. It would then be the responsbility of the parse event consumer to retrieve the desired attributes from the parse event -- depending upon what it's state is. That would make things a bit more orthogonal, however the cost is it adds slight com- plexity to some components (IM) which is trivial, and is already required for whitespace handling. The flip side of this is that whitespace should be labelled as payload or attribute? Ick. Another consideration are terminating characters; it may desireable to either issue a 'terminating character' event, or add an attribute which indicates that a word is a terminating character. In summary of the last paragraph, it doesn't matter; i'll use either or both forms in the examples to see how it works. +-------------------------------------------------------------------------+ | Document Payload | +---------------+---------------------------------------------------------+ | Event | start text attrib AW SW Adjacent isComplete | +---------------+---------------------------------------------------------+ |Payload Word | x x - - x x x | |Attribute Word | x x x x x x x | |Whitespace | + x + - x x x | +---------------+---------------------------------------------------------+ If we posit the uniform word handling, it would look like: -9- +---------------------------------------------------------------------+ | Alternate Document Payload | +-----------+---------------------------------------------------------+ | Event | start text attrib AW SW Adjacent isComplete | +-----------+---------------------------------------------------------+ |Word | x x ? ? x x x | |Whitespace | + x + - x x x | +-----------+---------------------------------------------------------+ Optimization This section describe optimizations which are not required for correctness. The system as it stands can work cor- rectly; optimizations can wait for profiling and speed needs. These changes will not affect the system overall, as they only affect the generator and receiver of parse events. In the interests of efficiency and the common case it is possible to streamline the isComplete protocol a bit more. For example, making the short and/or common case eas- ier to detect. However, this will work for now, is regular, and isn't too bad. It is also possible to generate aggregate parse events. This depends upon the individual consumer and the set of consumers of parse events in an indiviudal parse scenario. Payload data (words and attributes) received by the DM are the main benefit of the optimization. However, it could also be used with the IM as well. Worst case is that the parse events are used to describe document structure, and the new common payload parser is used each by the DM and IM loading components to parse payload. The system is still operating off a common parse, though. I mention this here for completeness's sake. If the XML tag was a standalone element such asit would still generate two parse events with this schema. It may be possible to streamline that a bit, but there is a problem in that the element and close element actually need to generate two start numbers, one for the begin element, and another start number for the end element. An elegant solution for this is to use the start number of the element as the end number. This seems to work well, doesn't appear to break any containment constraints, and reduces the number of parse events for standalone element to one. Similarily, tags without attributes could be optimized into a single event.