Example Parse Events


Introduction

     This  document  illustrates the Parse Events with a few
examples to illustrate the flavor of the system.

     Except in examples  that  explicitly  illustrate  large
content  handling  I  will  leave out isComplete and offset=
information, For the simple case they are always (true,  0).

     The  examples  below  use PayloadWord and AttributeWord
events; to use the new uniform payload model, they two  dif-
ferent  XxxWord  events  are  translated to Word events.  No
other changes occur in the consumers, as they  already  need
to know which attributes to extract.











































                             -2-


Simple Example

     At  the  very least an XML document consists of an ele-
ment:

                 +------------------------+
                 |XML          |
                 +------------------------+
                 |start      0       1    |
                 |ele_id     0       0    |
                 +------------------------+
This would generate two parse events:

+---+-------+---------------+------------------------+-------+
|PE | start |     type      |       attributes       | value |
+---+-------+---------------+------------------------+-------+
+---+-------+---------------+------------------------+-------+
|0  | 0     | Element       | ele_id=0               | "tag" |
+---+-------+---------------+------------------------+-------+
|1  | 0     | EndElementTag | ele_id=0               | -     |
+---+-------+---------------+------------------------+-------+
|2  | 1     | ElementEnd    | ele_id=0 match_start=0 | -     |
+---+-------+---------------+------------------------+-------+
If the XML tag was a standalone element such  as    it
would  still generate two parse events with this schema.  It
may be possible to streamline that a bit,  but  there  is  a
problem  in that the element and close element actually need
to generate two start numbers, one for  the  begin  element,
and  another  start  number for the end element.  An elegant
solution for this is to use the start number of the  element
as  the end number.  This seems to work well, doesn't appear
to break any containment constraints, and reduces the number
of  parse events for standalone element to one.  Don't worry
about this for now -- it is an  optimization  which  can  be
done later.




























                             -3-


Simple with ordinary text

     Add some text to the above example:

     +------------------------------------------------+
     |XML         Unbroken   Evangelist    |
     +------------------------------------------------+
     |start      0        1           2          3    |
     |ele_id     0        -           -          0    |
     |PW         -        0           1          -    |
     |SW         -        0           0          -    |
     +------------------------------------------------+
There will be 4 parse events for this sample:

+---+-------+---------------+------------------+--------------+
|PE | start |     type      |    attributes    |    value     |
+---+-------+---------------+------------------+--------------+
+---+-------+---------------+------------------+--------------+
|0  | 0     | Element       | ele_id=0         | "tag"        |
+---+-------+---------------+------------------+--------------+
|1  | 0     | EndElementTag | ele_id=0         | -            |
+---+-------+---------------+------------------+--------------+
|2  | 1     | PayloadWord   | PW=0 SW=0 ADJ=1  | "Unbroken"   |
+---+-------+---------------+------------------+--------------+
|3  | 2     | PayloadWord   | PW=1 SW=0        | "Evangelist" |
+---+-------+---------------+------------------+--------------+
|4  | 3     | EndElement    | start=0 ele_id=0 | -            |
+---+-------+---------------+------------------+--------------+
Note that this example uses the isAdjacent hint on the first
word of the payload so consumers don't need to  worry  about
differentiating canonical whitespace.
































                             -4-


Non-Ignoreable Whitespace

     This  uses  the  same data set as the previous example,
except here we find that the whitespace between the two pay-
load words is non-ignoreable whitespace.

 +---------------------------------------------------------+
 |XML         Unbroken   "\n\t"   Evangelist    |
 +---------------------------------------------------------+
 |start      0        1         -          2          3    |
 |ele_id     0        -         -          -          0    |
 |PW         -        0         -          1          -    |
 |SW         -        0         -          0          -    |
 +---------------------------------------------------------+

     When  parsing  the  document it is essential to provide
parse  events  for  the  non-ignoreable  whitespace.    This
whitespace  must  remain  with the document as it is part of
the document.  However, the whitespace is not indexed.  Non-
ignoreable  whitespace  can exist anywhere in the context of
elements and payload words; for example between  an  element
and  a  payload  word,  or between payload words, or between
elements.  That implies that an element could  only  contain
non-ignoreable whitespace.

     There will be 6 parse events for this sample:

+---+-------+---------------+------------------------+--------------+
|PE | start |     type      |       attributes       |    value     |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0  | 0     | Element       | ele_id=0               | "tag"        |
+---+-------+---------------+------------------------+--------------+
|1  | 0     | EndElementTag | ele_id=0               | -            |
+---+-------+---------------+------------------------+--------------+
|2  | 1     | PayloadWord   | PW=0 SW=0 ADJ=1        | "Unbroken"   |
+---+-------+---------------+------------------------+--------------+
|3  | -     | WhiteSpace    | ADJ=1                  | "\n\t"       |
+---+-------+---------------+------------------------+--------------+
|4  | 2     | PayloadWord   | PW=1 SW=0 ADJ=1        | "Evangelist" |
+---+-------+---------------+------------------------+--------------+
|5  | 3     | EndElement    | match_start=0 ele_id=0 | -            |
+---+-------+---------------+------------------------+--------------+




















                             -5-


Ordinary Whitespace and Terminating Characters

     Note that there is also the issue of normal whitespace,
and of word delimiters ... such as $.  In  those  cases  the
parser  should  have  an additional list of words or symbols
which cause word breaks -- even if the symbols are adjacent.
Due to this it is important to generate IgnoreableWhiteSpace
events to seperate items as needed.  Another  way  of  doing
this would be to add a flag which indicates that this symbol
was adjacent to a previous symbol, and  that  no  whitespace
seperates it.  Just to clarify, the XML document in the fol-
lowing example is:
     Unbroken $Evangelist

   +----------------------------------------------------+
   |XML         Unbroken   $   Evangelist    |
   +----------------------------------------------------+
   |start    0       1          2   2            3      |
   |ele_id   0       -          -   -            0      |
   |PW       -       1                                  |
   +----------------------------------------------------+
The parse events for this are:

+---+-------+---------------+------------------------+--------------+
|PE | start |     type      |       attributes       |    value     |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0  | 0     | Element       | ele_id=0               | "tag"        |
+---+-------+---------------+------------------------+--------------+
|1  | 0     | EndElementTag | ele_id=0               | -            |
+---+-------+---------------+------------------------+--------------+
|2  | 1     | PayloadWord   | PW=0 SW=0 ADJ=1        | "Unbroken"   |
+---+-------+---------------+------------------------+--------------+
|3  | 2     | PayloadWord   | PW=1 SW=1              | "$"          |
+---+-------+---------------+------------------------+--------------+
|4  | 2     | PayloadWord   | PW=1 SW=0 ADJ=1        | "Evangelist" |
+---+-------+---------------+------------------------+--------------+
|5  | 3     | EndElement    | match_start=0 ele_id=0 | -            |
+---+-------+---------------+------------------------+--------------+

     Symbols which serve as word  delimiters  are  currently
hardwired  into  the  source.   In the future the delimiters
should be configurable, and should  also  be  context-sensi-
tive.   For  example  an  element may want to disable
dot or at as delimiter symbols.  The question  of  delimiter
symbols and how they are treated is is open for research, or
perhaps examination of traditional IR techniques.   For  now
we  just want to note this so the system isn't limited by it
for future work.














                             -6-


Add Stopwords

     This example adds some stopwords to the previous  case.

+-------------------------------------------------------------+
|XML         The   Unbroken   are   Evangelical    |
+-------------------------------------------------------------+
|start    0       1     2          3     4             5      |
|ele_id   0       -     -          -     -             0      |
|PW       -       0     1          2     3             -      |
|SW       -       1     0          1     0             -      |
+-------------------------------------------------------------+
This example would consist of 6 parse events:

+---+-------+---------------+------------------+------------+
|PE | start |     type      |    attributes    |   value    |
+---+-------+---------------+------------------+------------+
+---+-------+---------------+------------------+------------+
|0  | 0     | Element       | ele_id=0         | "tag"      |
+---+-------+---------------+------------------+------------+
|1  | 0     | EndElementTag | ele_id=0         | -          |
+---+-------+---------------+------------------+------------+
|2  | 1     | PayloadWord   | PW=0 SW=1 ADJ=1  | "The"      |
+---+-------+---------------+------------------+------------+
|3  | 2     | PayloadWord   | PW=1 SW=0        | "Unbroken" |
+---+-------+---------------+------------------+------------+
|4  | 3     | PayloadWord   | PW=2 SW=1        | "Are"      |
+---+-------+---------------+------------------+------------+
|5  | 4     | PayloadWord   | PW=3 SW=0        | "Unbroken" |
+---+-------+---------------+------------------+------------+
|6  | 5     | EndElement    | ele_id=0 start=0 | -          |
+---+-------+---------------+------------------+------------+































                             -7-


Long Payload Words

     This case illustrates how long payload words are incre-
mentally parsed.

+-------------------------------------------------------------+
|XML         The_undead_confuse_the_evangelical    |
+-------------------------------------------------------------+
|start    0       1                                    2      |
|ele_id   0       -                                    0      |
|PW       -       0                                    -      |
|SW       -       ?                                    -      |
+-------------------------------------------------------------+
The ? in the stopword entry reflects some uncertainty as  to
what  should  happen  here.   One  approach would be for the
parser to decide that this blob was a stopword and shouldn't
be  indexed.   On  the  other hand, there is a good argument
that the IM should decide which, or how much, of  xLOBs  are
indexed.   Perhaps the parser should provide hints about the
type of the LOB so the IM  can  use  the  info  to  make  an
informed decision.

     I'll use some arbitrary boundaries to break up the text
into reasonable parse events -- 5 of them in this case:

+---+-------+---------------+-----------------------------------+----------------+
|PE | start |     type      |            attributes             |     value      |
+---+-------+---------------+-----------------------------------+----------------+
+---+-------+---------------+-----------------------------------+----------------+
|0  | 0     | Element       | ele_id=0 isComplete               | "tag"          |
+---+-------+---------------+-----------------------------------+----------------+
|1  | 0     | EndElementTag | ele_id=0 isComplete               | -              |
+---+-------+---------------+-----------------------------------+----------------+
|2  | 1     | PayloadWord   | offset=0                          | "The_undead_"  |
+---+-------+---------------+-----------------------------------+----------------+
|3  | 1     | PayloadWord   | offset=11                         | "confuse_the_" |
+---+-------+---------------+-----------------------------------+----------------+
|4  | 1     | PayloadWord   | offset=23 isComplete              | "evangelical"  |
+---+-------+---------------+-----------------------------------+----------------+
|5  | 2     | EndElement    | ele_id=0 match_start=0 isComplete | -              |
+---+-------+---------------+-----------------------------------+----------------+






















                             -8-


Long element names

     The previous example illustrates what  happens  when  a
long  payload  word  is encountered.  This example will show
what happens when a long element name is parsed;  a  similar
sequence would happen for a long attribute name.

  +------------------------------------------------------+
  |XML          |
  +------------------------------------------------------+
  |start    0                      1                     |
  |ele_id   0                      0                     |
  |PW       -                      -                     |
  |SW       -                      -                     |
  +------------------------------------------------------+
I'll  break this up into 4 parse events, similar to the text
example.

+---+-------+---------------+--------------------------------+------------+
|PE | start |     type      |           attributes           |   value    |
+---+-------+---------------+--------------------------------+------------+
+---+-------+---------------+--------------------------------+------------+
|0  | 0     | Element       | ele_id=0 offset=0              | "this-is-" |
+---+-------+---------------+--------------------------------+------------+
|1  | 0     | Element       | ele_id=0 offset=8              | "a-long-"  |
+---+-------+---------------+--------------------------------+------------+
|2  | 0     | Element       | ele_id=0? offset=15 isComplete | "tag"      |
+---+-------+---------------+--------------------------------+------------+
|3  | 0     | EndElementTag | ele_id=0                       | -          |
+---+-------+---------------+--------------------------------+------------+
|4  | 1     | EndElement    | ele_id=0 match_start=0         | -          |
+---+-------+---------------+--------------------------------+------------+
Note the ele_id=0?  attributes.   This  information  doesn't
change  and  there  is  no  reason  to transmit it again and
again.  Of course if a data structure is provided  the  pro-
ducer  need  not change the constant fields and the consumer
can ignore those fields in the continuation.


























                             -9-


Simple Attributes

     This example  illustrates  how  simple  attributes  are
parsed:

+------------------------------------------------------------------------------+
|XML          |
+------------------------------------------------------------------------------+
|start    0      0            0           0           0             0   1      |
|ele_id   0      -            -           -           -             0   0      |
|attrib   -      1            2           2           2             -   -      |
|AW       -      -            -           0           1             -   -      |
+------------------------------------------------------------------------------+

Here is a sample set of parse events for these attributes

+---+-------+---------------+------------------------+--------------+
|PE | start |     type      |       attributes       |    value     |
+---+-------+---------------+------------------------+--------------+
+---+-------+---------------+------------------------+--------------+
|0  | 0     | Element       | ele_id=0               | "tag"        |
+---+-------+---------------+------------------------+--------------+
|1  | 0     | Attribute     | attrib=0               | "Demonic"    |
+---+-------+---------------+------------------------+--------------+
|2  | 0     | EndAttribute  | attrib=0 count=0       | -            |
+---+-------+---------------+------------------------+--------------+
|3  | 0     | Attribute     | attrib=1               | "Religion"   |
+---+-------+---------------+------------------------+--------------+
|4  | 0     | AttributeWord | attrib=1 AW=0          | "vampiric"   |
+---+-------+---------------+------------------------+--------------+
|5  | 0     | AttributeWord | attrib=1 AW=1          | "necromancy" |
+---+-------+---------------+------------------------+--------------+
|6  | 0     | EndAttribute  | attrib=1 count=2       | -            |
+---+-------+---------------+------------------------+--------------+
|7  | 0     | EndElementTag | ele_id=0               | -            |
+---+-------+---------------+------------------------+--------------+
|8  | 1     | EndElement    | ele_id=0 match_start=0 | -            |
+---+-------+---------------+------------------------+--------------+

























                            -10-


Nested Elements

     A straight-forward example of nested elements.

             +---------------------------------+
             |XML                |
             +---------------------------------+
             |start    0     1     2      3    |
             |ele_id   0     1     1      0    |
             +---------------------------------+
The parse events for this would be:

+---+-------+---------------+------------------------+-------+
|PE | start |     type      |       attributes       | value |
+---+-------+---------------+------------------------+-------+
+---+-------+---------------+------------------------+-------+
|0  | 0     | Element       | ele_id=0               | "a"   |
+---+-------+---------------+------------------------+-------+
|1  | 0     | EndElementTag | ele_id=0               | -     |
+---+-------+---------------+------------------------+-------+
|2  | 1     | Element       | ele_id=1               | "b"   |
+---+-------+---------------+------------------------+-------+
|3  | 1     | EndElementTag | ele_id=1               | -     |
+---+-------+---------------+------------------------+-------+
|4  | 2     | EndElement    | match_start=1 ele_id=1 | -     |
+---+-------+---------------+------------------------+-------+
|5  | 3     | EndElement    | match_start=0 ele_id=0 | -     |
+---+-------+---------------+------------------------+-------+



































                            -11-


Larger Example

     This  presents  a larger example which illustrates most
or all of the cases mentioned above.  Note  that    is
pre-formatted  text  which has non-ignoreable whitespace; of
course we must have a DTD to know that.

      
           
                Dragon Bone
           
           This
      is
           wacky
      


-----------------------------------------------------------------
 PE   start       type              attributes          value
-----------------------------------------------------------------
-----------------------------------------------------------------
 0    0       Element         ele_id=0                 "root"
-----------------------------------------------------------------
 1    0       EndElementTag   ele_id=0                 -
-----------------------------------------------------------------
 2    1       Element         ele_id=1                 "users"
-----------------------------------------------------------------
 3    1       EndElementTag   ele_id=1                 -
-----------------------------------------------------------------
 4    2       Element         ele_id=2                 "user"
-----------------------------------------------------------------
 5    2       Attribute       attrib=0                 "id"
-----------------------------------------------------------------
 6    2       AttributeWord   attrib=0 AW=0 ADJ=1      "smog"
-----------------------------------------------------------------
 7    2       AttributeWord   attrib=0 AW=1            "fog"
-----------------------------------------------------------------
 8    2       EndAttribute    attrib=0 count=2         -
-----------------------------------------------------------------
 9    2       EndElementTag   ele_id=2                 -
-----------------------------------------------------------------
 10   3       PayloadWord     SW=0 PW=0 ADJ=1          "Dragon"
-----------------------------------------------------------------
 11   4       PayloadWord     SW=0 PW=1                "Bone"
-----------------------------------------------------------------
 12   5       EndElement      ele_id=2 match_start=2   -
-----------------------------------------------------------------
 13   6       EndElement      ele_id=1 match_start=1   -
-----------------------------------------------------------------
 14   7       Element         ele_id=4                 "text"
-----------------------------------------------------------------
 15   7       EndElementTag   ele_id=4                 -
-----------------------------------------------------------------
 16   8       PayloadWord     SW=0 PW=2 ADJ=1          "This"






|   |       |               |                        |          |
|   |       |               |                        |          |
|   |       |               |                        |          |
|   |       |               -12-                     |          |
|   |       |               |                        |          |
|   |       |               |                        |          |
+---+-------+---------------+------------------------+----------+
|17 | -     | WhiteSpace    | ADJ=1                  | "\n"     |
+---+-------+---------------+------------------------+----------+
|18 | 9     | PayloadWord   | SW=1 PW=3 ADJ=1        | "is"     |
+---+-------+---------------+------------------------+----------+
|19 | -     | Whitespace    | ADJ=1                  | "\n\t"   |
+---+-------+---------------+------------------------+----------+
|20 | 10    | PayloadWord   | SW=0 PW=4 ADJ=1        | "wacky"  |
+---+-------+---------------+------------------------+----------+
|21 | 11    | EndElement    | ele_id=4 match_start=7 | -        |
+---+-------+---------------+------------------------+----------+
|22 | 12    | EndElement    | ele_id=0 match_start=0 | -        |
+---+-------+---------------+------------------------+----------+