Example Parse Events
Introduction
This document illustrates the Parse Events with a few
examples to illustrate the flavor of the system.
Except in examples that explicitly illustrate large
content handling I will leave out isComplete and offset=
information, For the simple case they are always (true, 0).
The examples below use PayloadWord and AttributeWord
events; to use the new uniform payload model, they two dif-
ferent XxxWord events are translated to Word events. No
other changes occur in the consumers, as they already need
to know which attributes to extract.
-2-
Simple Example
At the very least an XML document consists of an ele-
ment:
+------------------------+
|XML |
+------------------------+
|start 0 1 |
|ele_id 0 0 |
+------------------------+
This would generate two parse events:
+---+-------+---------------+------------------------+-------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------------+-------+
+---+-------+---------------+------------------------+-------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------------+-------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------------+-------+
|2 | 1 | ElementEnd | ele_id=0 match_start=0 | - |
+---+-------+---------------+------------------------+-------+
If the XML tag was a standalone element such as it
would still generate two parse events with this schema. It
may be possible to streamline that a bit, but there is a
problem in that the element and close element actually need
to generate two start numbers, one for the begin element,
and another start number for the end element. An elegant
solution for this is to use the start number of the element
as the end number. This seems to work well, doesn't appear
to break any containment constraints, and reduces the number
of parse events for standalone element to one. Don't worry
about this for now -- it is an optimization which can be
done later.
-3-
Simple with ordinary text
Add some text to the above example:
+------------------------------------------------+
|XML Unbroken Evangelist |
+------------------------------------------------+
|start 0 1 2 3 |
|ele_id 0 - - 0 |
|PW - 0 1 - |
|SW - 0 0 - |
+------------------------------------------------+
There will be 4 parse events for this sample:
+---+-------+---------------+------------------+--------------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------+--------------+
+---+-------+---------------+------------------+--------------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------+--------------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------+--------------+
|2 | 1 | PayloadWord | PW=0 SW=0 ADJ=1 | "Unbroken" |
+---+-------+---------------+------------------+--------------+
|3 | 2 | PayloadWord | PW=1 SW=0 | "Evangelist" |
+---+-------+---------------+------------------+--------------+
|4 | 3 | EndElement | start=0 ele_id=0 | - |
+---+-------+---------------+------------------+--------------+
Note that this example uses the isAdjacent hint on the first
word of the payload so consumers don't need to worry about
differentiating canonical whitespace.
-4-
Non-Ignoreable Whitespace
This uses the same data set as the previous example,
except here we find that the whitespace between the two pay-
load words is non-ignoreable whitespace.
+---------------------------------------------------------+
|XML Unbroken "\n\t" Evangelist |
+---------------------------------------------------------+
|start 0 1 - 2 3 |
|ele_id 0 - - - 0 |
|PW - 0 - 1 - |
|SW - 0 - 0 - |
+---------------------------------------------------------+
When parsing the document it is essential to provide
parse events for the non-ignoreable whitespace. This
whitespace must remain with the document as it is part of
the document. However, the whitespace is not indexed. Non-
ignoreable whitespace can exist anywhere in the context of
elements and payload words; for example between an element
and a payload word, or between payload words, or between
elements. That implies that an element could only contain
non-ignoreable whitespace.
There will be 6 parse events for this sample:
+---+-------+---------------+------------------------+--------------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------------+--------------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------------+--------------+
|2 | 1 | PayloadWord | PW=0 SW=0 ADJ=1 | "Unbroken" |
+---+-------+---------------+------------------------+--------------+
|3 | - | WhiteSpace | ADJ=1 | "\n\t" |
+---+-------+---------------+------------------------+--------------+
|4 | 2 | PayloadWord | PW=1 SW=0 ADJ=1 | "Evangelist" |
+---+-------+---------------+------------------------+--------------+
|5 | 3 | EndElement | match_start=0 ele_id=0 | - |
+---+-------+---------------+------------------------+--------------+
-5-
Ordinary Whitespace and Terminating Characters
Note that there is also the issue of normal whitespace,
and of word delimiters ... such as $. In those cases the
parser should have an additional list of words or symbols
which cause word breaks -- even if the symbols are adjacent.
Due to this it is important to generate IgnoreableWhiteSpace
events to seperate items as needed. Another way of doing
this would be to add a flag which indicates that this symbol
was adjacent to a previous symbol, and that no whitespace
seperates it. Just to clarify, the XML document in the fol-
lowing example is:
Unbroken $Evangelist
+----------------------------------------------------+
|XML Unbroken $ Evangelist |
+----------------------------------------------------+
|start 0 1 2 2 3 |
|ele_id 0 - - - 0 |
|PW - 1 |
+----------------------------------------------------+
The parse events for this are:
+---+-------+---------------+------------------------+--------------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------------+--------------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------------+--------------+
|2 | 1 | PayloadWord | PW=0 SW=0 ADJ=1 | "Unbroken" |
+---+-------+---------------+------------------------+--------------+
|3 | 2 | PayloadWord | PW=1 SW=1 | "$" |
+---+-------+---------------+------------------------+--------------+
|4 | 2 | PayloadWord | PW=1 SW=0 ADJ=1 | "Evangelist" |
+---+-------+---------------+------------------------+--------------+
|5 | 3 | EndElement | match_start=0 ele_id=0 | - |
+---+-------+---------------+------------------------+--------------+
Symbols which serve as word delimiters are currently
hardwired into the source. In the future the delimiters
should be configurable, and should also be context-sensi-
tive. For example an element may want to disable
dot or at as delimiter symbols. The question of delimiter
symbols and how they are treated is is open for research, or
perhaps examination of traditional IR techniques. For now
we just want to note this so the system isn't limited by it
for future work.
-6-
Add Stopwords
This example adds some stopwords to the previous case.
+-------------------------------------------------------------+
|XML The Unbroken are Evangelical |
+-------------------------------------------------------------+
|start 0 1 2 3 4 5 |
|ele_id 0 - - - - 0 |
|PW - 0 1 2 3 - |
|SW - 1 0 1 0 - |
+-------------------------------------------------------------+
This example would consist of 6 parse events:
+---+-------+---------------+------------------+------------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------+------------+
+---+-------+---------------+------------------+------------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------+------------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------+------------+
|2 | 1 | PayloadWord | PW=0 SW=1 ADJ=1 | "The" |
+---+-------+---------------+------------------+------------+
|3 | 2 | PayloadWord | PW=1 SW=0 | "Unbroken" |
+---+-------+---------------+------------------+------------+
|4 | 3 | PayloadWord | PW=2 SW=1 | "Are" |
+---+-------+---------------+------------------+------------+
|5 | 4 | PayloadWord | PW=3 SW=0 | "Unbroken" |
+---+-------+---------------+------------------+------------+
|6 | 5 | EndElement | ele_id=0 start=0 | - |
+---+-------+---------------+------------------+------------+
-7-
Long Payload Words
This case illustrates how long payload words are incre-
mentally parsed.
+-------------------------------------------------------------+
|XML The_undead_confuse_the_evangelical |
+-------------------------------------------------------------+
|start 0 1 2 |
|ele_id 0 - 0 |
|PW - 0 - |
|SW - ? - |
+-------------------------------------------------------------+
The ? in the stopword entry reflects some uncertainty as to
what should happen here. One approach would be for the
parser to decide that this blob was a stopword and shouldn't
be indexed. On the other hand, there is a good argument
that the IM should decide which, or how much, of xLOBs are
indexed. Perhaps the parser should provide hints about the
type of the LOB so the IM can use the info to make an
informed decision.
I'll use some arbitrary boundaries to break up the text
into reasonable parse events -- 5 of them in this case:
+---+-------+---------------+-----------------------------------+----------------+
|PE | start | type | attributes | value |
+---+-------+---------------+-----------------------------------+----------------+
+---+-------+---------------+-----------------------------------+----------------+
|0 | 0 | Element | ele_id=0 isComplete | "tag" |
+---+-------+---------------+-----------------------------------+----------------+
|1 | 0 | EndElementTag | ele_id=0 isComplete | - |
+---+-------+---------------+-----------------------------------+----------------+
|2 | 1 | PayloadWord | offset=0 | "The_undead_" |
+---+-------+---------------+-----------------------------------+----------------+
|3 | 1 | PayloadWord | offset=11 | "confuse_the_" |
+---+-------+---------------+-----------------------------------+----------------+
|4 | 1 | PayloadWord | offset=23 isComplete | "evangelical" |
+---+-------+---------------+-----------------------------------+----------------+
|5 | 2 | EndElement | ele_id=0 match_start=0 isComplete | - |
+---+-------+---------------+-----------------------------------+----------------+
-8-
Long element names
The previous example illustrates what happens when a
long payload word is encountered. This example will show
what happens when a long element name is parsed; a similar
sequence would happen for a long attribute name.
+------------------------------------------------------+
|XML |
+------------------------------------------------------+
|start 0 1 |
|ele_id 0 0 |
|PW - - |
|SW - - |
+------------------------------------------------------+
I'll break this up into 4 parse events, similar to the text
example.
+---+-------+---------------+--------------------------------+------------+
|PE | start | type | attributes | value |
+---+-------+---------------+--------------------------------+------------+
+---+-------+---------------+--------------------------------+------------+
|0 | 0 | Element | ele_id=0 offset=0 | "this-is-" |
+---+-------+---------------+--------------------------------+------------+
|1 | 0 | Element | ele_id=0 offset=8 | "a-long-" |
+---+-------+---------------+--------------------------------+------------+
|2 | 0 | Element | ele_id=0? offset=15 isComplete | "tag" |
+---+-------+---------------+--------------------------------+------------+
|3 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+--------------------------------+------------+
|4 | 1 | EndElement | ele_id=0 match_start=0 | - |
+---+-------+---------------+--------------------------------+------------+
Note the ele_id=0? attributes. This information doesn't
change and there is no reason to transmit it again and
again. Of course if a data structure is provided the pro-
ducer need not change the constant fields and the consumer
can ignore those fields in the continuation.
-9-
Simple Attributes
This example illustrates how simple attributes are
parsed:
+------------------------------------------------------------------------------+
|XML |
+------------------------------------------------------------------------------+
|start 0 0 0 0 0 0 1 |
|ele_id 0 - - - - 0 0 |
|attrib - 1 2 2 2 - - |
|AW - - - 0 1 - - |
+------------------------------------------------------------------------------+
Here is a sample set of parse events for these attributes
+---+-------+---------------+------------------------+--------------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0 | 0 | Element | ele_id=0 | "tag" |
+---+-------+---------------+------------------------+--------------+
|1 | 0 | Attribute | attrib=0 | "Demonic" |
+---+-------+---------------+------------------------+--------------+
|2 | 0 | EndAttribute | attrib=0 count=0 | - |
+---+-------+---------------+------------------------+--------------+
|3 | 0 | Attribute | attrib=1 | "Religion" |
+---+-------+---------------+------------------------+--------------+
|4 | 0 | AttributeWord | attrib=1 AW=0 | "vampiric" |
+---+-------+---------------+------------------------+--------------+
|5 | 0 | AttributeWord | attrib=1 AW=1 | "necromancy" |
+---+-------+---------------+------------------------+--------------+
|6 | 0 | EndAttribute | attrib=1 count=2 | - |
+---+-------+---------------+------------------------+--------------+
|7 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------------+--------------+
|8 | 1 | EndElement | ele_id=0 match_start=0 | - |
+---+-------+---------------+------------------------+--------------+
-10-
Nested Elements
A straight-forward example of nested elements.
+---------------------------------+
|XML |
+---------------------------------+
|start 0 1 2 3 |
|ele_id 0 1 1 0 |
+---------------------------------+
The parse events for this would be:
+---+-------+---------------+------------------------+-------+
|PE | start | type | attributes | value |
+---+-------+---------------+------------------------+-------+
+---+-------+---------------+------------------------+-------+
|0 | 0 | Element | ele_id=0 | "a" |
+---+-------+---------------+------------------------+-------+
|1 | 0 | EndElementTag | ele_id=0 | - |
+---+-------+---------------+------------------------+-------+
|2 | 1 | Element | ele_id=1 | "b" |
+---+-------+---------------+------------------------+-------+
|3 | 1 | EndElementTag | ele_id=1 | - |
+---+-------+---------------+------------------------+-------+
|4 | 2 | EndElement | match_start=1 ele_id=1 | - |
+---+-------+---------------+------------------------+-------+
|5 | 3 | EndElement | match_start=0 ele_id=0 | - |
+---+-------+---------------+------------------------+-------+
-11-
Larger Example
This presents a larger example which illustrates most
or all of the cases mentioned above. Note that is
pre-formatted text which has non-ignoreable whitespace; of
course we must have a DTD to know that.
Dragon Bone
This
is
wacky
-----------------------------------------------------------------
PE start type attributes value
-----------------------------------------------------------------
-----------------------------------------------------------------
0 0 Element ele_id=0 "root"
-----------------------------------------------------------------
1 0 EndElementTag ele_id=0 -
-----------------------------------------------------------------
2 1 Element ele_id=1 "users"
-----------------------------------------------------------------
3 1 EndElementTag ele_id=1 -
-----------------------------------------------------------------
4 2 Element ele_id=2 "user"
-----------------------------------------------------------------
5 2 Attribute attrib=0 "id"
-----------------------------------------------------------------
6 2 AttributeWord attrib=0 AW=0 ADJ=1 "smog"
-----------------------------------------------------------------
7 2 AttributeWord attrib=0 AW=1 "fog"
-----------------------------------------------------------------
8 2 EndAttribute attrib=0 count=2 -
-----------------------------------------------------------------
9 2 EndElementTag ele_id=2 -
-----------------------------------------------------------------
10 3 PayloadWord SW=0 PW=0 ADJ=1 "Dragon"
-----------------------------------------------------------------
11 4 PayloadWord SW=0 PW=1 "Bone"
-----------------------------------------------------------------
12 5 EndElement ele_id=2 match_start=2 -
-----------------------------------------------------------------
13 6 EndElement ele_id=1 match_start=1 -
-----------------------------------------------------------------
14 7 Element ele_id=4 "text"
-----------------------------------------------------------------
15 7 EndElementTag ele_id=4 -
-----------------------------------------------------------------
16 8 PayloadWord SW=0 PW=2 ADJ=1 "This"
| | | | | |
| | | | | |
| | | | | |
| | | -12- | |
| | | | | |
| | | | | |
+---+-------+---------------+------------------------+----------+
|17 | - | WhiteSpace | ADJ=1 | "\n" |
+---+-------+---------------+------------------------+----------+
|18 | 9 | PayloadWord | SW=1 PW=3 ADJ=1 | "is" |
+---+-------+---------------+------------------------+----------+
|19 | - | Whitespace | ADJ=1 | "\n\t" |
+---+-------+---------------+------------------------+----------+
|20 | 10 | PayloadWord | SW=0 PW=4 ADJ=1 | "wacky" |
+---+-------+---------------+------------------------+----------+
|21 | 11 | EndElement | ele_id=4 match_start=7 | - |
+---+-------+---------------+------------------------+----------+
|22 | 12 | EndElement | ele_id=0 match_start=0 | - |
+---+-------+---------------+------------------------+----------+