XML Parsing
and
Parse Events
Introduction
Niagara will have a common XML parser interface. This
will allow multiple consumers of parse data and multiple
parsers with only a N+M complexity, rather than a N*M com-
plexity. By factoring the system this way we create more
orthogonal components.
The cost of components is defining interfaces so they
can exist as a non-monolithic block. The interface is often
the most important factor in designing a system. With good
interfaces you can get by with poor implementations, since
the implementation can be replaced with a better one only
when a better one is required.
This vignette will discuss the issues that the parse
events interface will need to address. It will do so by
providing a number of sample data streams and a set of parse
events which could be used to represent that data. This is
not necessarily the way to do this, but it is a starting
point for further refinement.
There are defintions of the terms and attributes at the
end of the document.
The parser should provide all of the information needed
by the consumers of parse events to know about document
structure. This will eliminate the need for the receivers
to implement their own independent tools for doing something
the parser can easily provide. Of course there will be par-
ticular things which the parser can't provide -- such as
building the ID<->IDREF information needed by the IM.
Whitespace
It turns out that whitespace is a non-trivial issue in
XML parsing. This is because there are multiple kinds of
whitespace in a document. To make this more interesting,
the kind of whitespace is context dependent; it depends upon
what element or combination of elements the textual informa-
tion is enclosed by! There are also whitespace scoping
rules in XML which make things more interesting, or tedious,
depending upon your viewpoint.
The first kind of whitespace is ignoreable whitespace
which really can't be ignored, despite the name. The
-2-
salient point is that you need not care about what the
actual whitespace is; only to know that there was whites-
pace. For example, some complicated whitespace of 10K char-
acters can be annotated as "whitespace", the original form
of the WS need not be recorded.
The second kind of whitespace is non-ignoreable whites-
pace. This kind of whitespace must be recorded exactly as
presented in the source document; replacing it with a
generic whitespace is not allowed.
There are some cases where whitespace is ignoreable,
but where we do not want to or must not ignore it! For
example, storing a document that we would like to retrieve
an exact version of, not just a semantically identical ver-
sion. This is an important issue to some users and docu-
ments, and is not an issue with other users or documents. I
think the way to handle this is to specify, on a per-docu-
ment basis, if ignoreable whitespace should be treated as
such, or as non-ignoreable whitespace. This could be indi-
cated in the catalog entry for the document. Note that this
very exact whitespace implies that we may need to ignore
some of the whitespace during processing, or take it all
into consideration, or ignore all of it. Fortunately this
is not a big problem with our new design!
A discussion brought up another concept of whitespace
handling; a query may want to treat non-ignoreable whites-
pace as ignoreable whitespace. This would allow a query to
match documents more freely than a query which required
exact whitespace semantics. Essentially this kind of query
asks that whitespace rules be relaxed for its execution.
XML allows control of whitespace in the schema specifi-
cation for a document. In addition, local XML whitespace
attributes in the document can modify whitespace handling to
an extent. The parser will need to be aware of the whites-
pace context and keep track of it.
Terminating Characters
Niagara has a concept which add to the complexity of
XML document parsing. This is the idea of a terminating
character. This is a character which causes a non-whites-
pace word break in a document. The characters before the
terminating character are indexed as their own word, the
terminating character may be indexed, and the trailing char-
acters will be indexed as a new word.
Terminating characters may be context dependent in a
document, and may vary between documents. What exactly to
do with terminating characters, and how best to utilize them
are left as research. However, the parser still needs to
handle them and interact correctly with whitespace.
-3-
Document Structure
Finally we come to XML document structure. This turns
out to be the most simple thing to handle, as XML document
structure is regular.
If we want to retrieve a document in the exact order
that it was stored into the system, it may be necessary to
record extra information about the document source so the
original can be retrieved. With the current niagara DM,
that means a mapping of attribute position to attribute term
number will be needed.
However, if exact whitespace representation is desired
in a document, complexity is considerably increased. This
is not just payload whitespace, but all whitespace in a doc-
ument.
Attributes
Attributes are not in the payload of a document. They
aren't assigned unique start numbers. Instead, attributes
belong to the element which they are nested inside, and are
assigned attribute#s relative to that node. Attribute words
are numbered relative to the attribute they are contained
within.
Items of Large Size
The intent of the new Niagara is to be scaleable. In
the context of parsing and storage, it means we treat every-
thing as it could be a huge object. That is the case even
if we expect it to be small, or if we expect the common case
to be small. By using this approach we are assured in hav-
ing a system which will just work correctly in the face of
adversity or unexpected inputs. By doing this we eliminate
the source of many problems, and also eliminate the need to
do things in multiple ways, which makes the system simpler
over all.
The flip side of this is that we would like to make the
common case, where items are small and easy to work with,
really fast. This is allowed for throughout the new Niagara
design; small string values in tuples, but a fallback to the
Incremental Evaluator cache or Data Manager if the value is
large. The same issues ccurs during data storage and pars-
ing. The common, short case will be fast, yet everything
will be treated as uniformly large to allow for the cases
when that occurs.
Scalability is performed by allowing access to sequen-
tial chunks of a thing, and providing context about where
that chunk is located in the final result.
-4-
Which Whitespace to Consider
If whitespace is ordinary whitespace, the following
rules apply. Multiple whitespaces characters are reduced to
a single whitespace, which is not a character, but can be
represented as one, a space for example. If whitespace
abuts document structure in elements, it is eliminated. In
other words
this is a test
becomes
this is a test
This is consistent with document handling, does no harm, and
makes a regular storage model possible. Let us call this
canonical whitespace.
If non-ignoreable whitespace mode is enabled, any
whitespace in a element payload will be exactly stored as
represented in the source document.
Whitespace handling in attribute values may be different, as
whitespace could be more significant there. As an initial
pass, we can treat attribute whitespace as ignoreable
whitespace and apply the earlier rules to it. There is no
way to query the whitespace, and if structure is desired, it
belongs in the document payload. If schema information or
XML standards allow more exact control of whitespace in
attributes, we may need to do something else. If there is a
dataset which needs significant whitespace handling in
attributes, it will be easy to deal with it.
Note that if a query contains whitespace that the same
rules can be applied to the query to squish out whitespace
so that it matches the whitespace in a document.
What to Implement
Obviously the parser and parse events need to correctly
implement the parsing of document structure. It also needs
to be able to parse payload and whitespace.
What we aren't going to implement for now is 100%
whitespace exactness. Whitespace in attribute values and in
element payload will be the only whitespace the system cares
about. This simplifies parsing and storage considerably, as
whitespace events can be ignored inside document structure.
Whitespace interpretation (exact, non-exact) is con-
trolled by the evaluation function which prints stored data
into a string to be evaluated. This is not an issue with
content which has ignoreable whitespace; it is already in
-5-
the canonical form. However, for content which has non-
ignoreable whitespace, the evaluator can convert the whites-
pace into canonical whitespace. This allows specifying
queries which don't care about exact whitespace, and can
explore a larger possible result realm.
ParseEvents
I'm going to divide the discussion of parse events into
two arenas, which match the structure versus content split
which I imply earlier. However, before we get into that,
lets provide some common terminology for all parse events.
There are two kinds of parse events. The first group
has to do with document structure, and reflect the presence
of elements and attributes. The second kind of parse event
has to do with document payload, a misused term in this
paper. Document payload only exists in certain points of
document structure. The two different groups share some
attributes, but other attributes are unique to each group,
for example whitespace considerations and adjacency.
Parse events have a number of attributes which present
information about the document which is needed to provide
the context of that event within the structure of the docu-
ment.
Term Definition
------------------------------------------------------------------
start The numbering scheme which the IM uses to number
payload and strcture portions of the document.
Each element and significant word is assigned a
start number. This implies that both whitespace
and stopwords, which act as whitespace to a cer-
tain extent, need not have unique start number.
If a start number is needed,the start number of
the preceding event could be repeated without
harm.
end The start number of an element's closing tag.
ele_id The scheme the DM currently uses to store a docu-
ment; an ele_id is assigned to each XML element,
and to each DM-special text element. Note
match_start below; match_ele_id isn't required
since this field can be used in both cases.
PW Payload word number; all payloads words in a docu-
ment are numbered from the first.
SW Flag indicating a word is a stop-word
-6-
isComplete Parse event is complete; full value has been
transfered. If the data takes multiple parse
events to transfer, isComplete will be present on
the last event of that sequence. For uniform han-
dling this should be true for all parse events,
except those which require >1 parse event to
transfer the data.
offset= Location of text chunk in result of this parse
event. Offset==0 in the first event of large
data, and grows uniformly based on the length of
data presented so ar.
AN Attribute number of this attribute in this element
AW Attribute word number; all words in a attribute in
a element are numbered from the first.
PE Parse Event number; unique number identifying the
parse event. This is a unique ID for debug pur-
poses, I don't think a event receiver should ever
need to care about it. It just says this is the
n'th parse event received.
text The text associated with the parse event, or this
chunk. There is a length associated with the
text.
isAdjacent This parse event is adjacent to the previous parse
event. Whitespace will always be isAdjacent. If
adjaceny isn't specified, the event consumer does
what it needs to do to represent things properly;
for example the first word in a payload or an
attribute won't have a preceeding whitespace.
nest The nesting level in the document of a particular
element or payload. An element is at the nesting
level of the payload of the element it is con-
tained in. Payload of an element is at a +1 nest-
ing level.
match_start The start number of the element which matches this
element. This is only used for end-element.
element_word This is an example of a attribute that we could
add if we choose to. This illustrates the
attribute based nature of the interface. The ele-
ment word would be the cardinal number of a pay-
load word in it's element.
isAttribute Future consideration for common word and whites-
pace model. Indicates that the word or whitespace
is an attribute value and not a payload value.
This is not strictly necessary; consumers can dis-
cover it from document structure easily.
Any and all large values are handled by the offset= and
isComplete mechanism. Briefly, (offset==0 && isComplete)
means that an entry is the common, shore case and the one
parse event has provided the complete value. If the value
is broken into several chunks, the first chunk will have
(offset==0), and the last chunk will have (isComplete). The
intermediate chunks of the value will have positively
-7-
incrementing offset= attributes, as well as the data which
starts at that offset=.
Every parse event will have a start number associated
with it.
Some attributes of a parse event may not change between
successive parse events. A trivial example of this is the
isComplete issue. Another example is the start number may
not change between some parse events. This reflects a lack
of movement or change of context in the document. For exam-
ple, the start number will remain constant until the end tag
of the element is seen, as the element counts as 1 piece of
document structure. Another example of this occur during
attribute handling; all attributes and attribute words and
the end tag of an element will all have the start number of
the element. In addition, all attribute words of an
attribute will have the same attribute number.
Adjaceny, denoted through isAdjacent, indicates that a
payload element abuts the previous payload element. Without
the adjacency event, the consumers are free to consider that
canonical whitespace exists between the payload elements.
As mentioned earlier, all whitespace will be adjacent, since
any representation of whitespace as a parse event means that
the whitespace must be recorded exactly. To simplify han-
dling of whitespace in consumers of parse events, the parser
could or should mark payload following white space as adja-
cent. Similarily, the adjacent property could be set on the
first payload in an attribute or element to signify that the
payload is abutted to the document structure; or in other
words, to not have to keep state to avoid inserting a canon-
ical whitespace there.
Document Structure
Document structure is easy to provide parse events for;
here they are:
+-------------------------------------------------------------+
| Event XML Text Text Content |
+-------------------------------------------------------------+
|Element - - |
|End-Element |> - - |
|Attribute def=" attribute name def |
|End-Attribute " - |
+-------------------------------------------------------------+
The structure parse events will have the following
attributes; isComplete is noted to indicate which things
could require incomplete handling.
-8-
+----------------------------------------------------------------------+
| Document Structure |
+----------------+-----------------------------------------------------+
| Event | start text attribute match_start isComplete |
+----------------+-----------------------------------------------------+
|Element | x x - - x |
|End-Element-Tag | x - - - - |
|End-Element | x - - x - |
|Attribute | x x x - x |
|End-Attribute | x - x - - |
+----------------+-----------------------------------------------------+
Document Payload
As I mentioned earlier, document payload is a misused
term in this document. It refers to both the text() payload
of an XML document, and to all non-structural content of the
document. In this case, both to what we call payload words
and attribute words. It also includes whitespace, which
only exists in document payload (ignoring exact retrieval
issues).
One way of dealing with this is to remove the differ-
ence between an attribute word and a payload word. It would
then be the responsbility of the parse event consumer to
retrieve the desired attributes from the parse event --
depending upon what it's state is. That would make things a
bit more orthogonal, however the cost is it adds slight com-
plexity to some components (IM) which is trivial, and is
already required for whitespace handling. The flip side of
this is that whitespace should be labelled as payload or
attribute? Ick.
Another consideration are terminating characters; it
may desireable to either issue a 'terminating character'
event, or add an attribute which indicates that a word is a
terminating character.
In summary of the last paragraph, it doesn't matter;
i'll use either or both forms in the examples to see how it
works.
+-------------------------------------------------------------------------+
| Document Payload |
+---------------+---------------------------------------------------------+
| Event | start text attrib AW SW Adjacent isComplete |
+---------------+---------------------------------------------------------+
|Payload Word | x x - - x x x |
|Attribute Word | x x x x x x x |
|Whitespace | + x + - x x x |
+---------------+---------------------------------------------------------+
If we posit the uniform word handling, it would look like:
-9-
+---------------------------------------------------------------------+
| Alternate Document Payload |
+-----------+---------------------------------------------------------+
| Event | start text attrib AW SW Adjacent isComplete |
+-----------+---------------------------------------------------------+
|Word | x x ? ? x x x |
|Whitespace | + x + - x x x |
+-----------+---------------------------------------------------------+
Optimization
This section describe optimizations which are not required
for correctness. The system as it stands can work cor-
rectly; optimizations can wait for profiling and speed
needs. These changes will not affect the system overall, as
they only affect the generator and receiver of parse events.
In the interests of efficiency and the common case it
is possible to streamline the isComplete protocol a bit
more. For example, making the short and/or common case eas-
ier to detect. However, this will work for now, is regular,
and isn't too bad.
It is also possible to generate aggregate parse events.
This depends upon the individual consumer and the set of
consumers of parse events in an indiviudal parse scenario.
Payload data (words and attributes) received by the DM are
the main benefit of the optimization. However, it could
also be used with the IM as well.
Worst case is that the parse events are used to
describe document structure, and the new common payload
parser is used each by the DM and IM loading components to
parse payload. The system is still operating off a common
parse, though. I mention this here for completeness's sake.
If the XML tag was a standalone element such as
it would still generate two parse events with this schema.
It may be possible to streamline that a bit, but there is a
problem in that the element and close element actually need
to generate two start numbers, one for the begin element,
and another start number for the end element. An elegant
solution for this is to use the start number of the element
as the end number. This seems to work well, doesn't appear
to break any containment constraints, and reduces the number
of parse events for standalone element to one.
Similarily, tags without attributes could be optimized
into a single event.