Unix Internals










                     Niagara Numbering
                            and
                       Posting Format


Data Manager

     Niagara has two numbering schemes, one used by the Data
Manager and the other used by the Index Manager.  Both  num-
bering  schemes  have  a  common document_id or doc_id which
identifies the document in question.

The DM uses one additional number, the element_id or  ele_id
to  uniquely identify each XML element in the document.  The
element IDs are assigned in prefix traversal, with the  root
element  have id 1.  This corresponds to the order that ele-
ment start tags are seen in the document parse.

                 +-------------------------+
                 |      Data Manager       |
                 +-------------------------+
                 +------------+------------+
                 |document_id | element_id |
                 +------------+------------+
For example, element_ids are assigned as follows in this XML
document.

     
         Unix Internals
         Vahallia
     


                 +------------------------+
                 |element      element_id |
                 +------------------------+
                 |           1      |
                 |          2      |
                 |<author>         3      |
                 |<lastname>       4      |
                 +------------------------+

     Note  that  the  DM  numbering is slightly more complex
than mentioned above.  If a node contains only payload  text
and  no  descendants,  or  only  contains descendants and no
text, the numbering is as mentioned above.  If  a  node  has
both  payload text and descendants, the payload text between
each descendant is treated as a  built-in  text  descendant.
Using a more complex example to illustrate this case:











                             -2-


      <em>
           This is a test of the
           <strong>emergency broadcast system</strong>
           This is only a test
      </em>

The translated internal version of this is

      <em>
          <text>This is a test of the</text>
          <strong>emergency broadcast system</strong>
          <text>This is only a test</text>
      </em>


                       +-------------+
                       |<em>       1 |
                       |<text>     2 |
                       |<strong>   3 |
                       |<text>     4 |
                       +-------------+

Index Manager

     The  IM assigns IDs to more portions of a document than
the DM, which only numbers the elements of a document.   The
IM numbering also reflects document structure and is used to
analyze it when querying.  The IM numbers each payload  word
and  XML  element  discovered  in  the document in the order
which they are parsed.  This number is known  as  the  start
number.   The  IM  also  has the nesting level, which is the
depth of the item from the root of the document.   For  ele-
ments  which are stored in the IM, an end number is assigned
to each element.  The end number for an element is the  ele-
ment  ID  of the closing tag of that element.  The start and
end numbers allow document structure to be  analyzed.   Data
stored  in  the IM for elements also contains the above men-
tioned DM element_id of that element, which allows  IM  ele-
ments to be mapped to DM elements.

      +----------------------------------------------+
      |                 IM Elements                  |
      +----------------------------------------------+
      +------------+------------+-------+-----+------+
      |document_id | element_id | start | end | nest |
      +------------+------------+-------+-----+------+

Payload  words which are indexed in the IM do not require an
end number, as the word begins and ends at the same  number.














                             -3-


         +----------------------------------------+
         |           IM Payload Words             |
         +----------------------------------------+
         +------------+------------+-------+------+
         |document_id | element_id | start | nest |
         +------------+------------+-------+------+

     Using  the  earlier  XML example again, I'll illustrate
how start numbers are assigned.

     <book id="12345">
         <title>Unix Internals
         Vahallia
     


                   +--------------------+
                   |   start numbers    |
                   +--------------------+
                   |item          start |
                   +--------------------+
                   |        1     |
                   |id=           2     |
                   |"12345"       3     |
                   |       4     |
                   |Unix          5     |
                   |Internals     6     |
                   |      7     |
                   |      8     |
                   |    9     |
                   |Vahallia      10    |
                   |   11    |
                   |     12    |
                   |       13    |
                   +--------------------+
The element postings corresponding to these numbers would be

       +---------------------------------------------+
       |              Element Postings               |
       +---------------------------------------------+
       |element      element_id   start   end   nest |
       +---------------------------------------------+
       |       1            1       13    0    |
       |      2            4       7     1    |
       |<author>     3            8       12    1    |
       |<lastname>   4            9       11    2    |
       +---------------------------------------------+
The payload word postings for this would be















                             -4-


          +--------------------------------------+
          |        Payload Word Postings         |
          |word        element_id   start   nest |
          +--------------------------------------+
          |Unix        2            5       2    |
          |Internals   2            6       2    |
          |Vahallia    4            10      3    |
          +--------------------------------------+

Problems and Considerations

     The IM and the DM are independent entities, each gener-
ating their own numbers.  However,  the  element_id  numbers
generated  by both parsers must match exactly.  This creates
an implicit dependency between the two which must  be  care-
fully maintained for the system to work correctly.

     As the IM and the DM use different numbering schemes to
represent elements, there is a storage price  which  the  IM
must pay to keep both sets of IDs stored.  This happens fort
both payload word and element postings, as queries on  words
must  also  map to the element they are contained in for the
DM to locate them.  While this overhead is small, it is mul-
tiplied  by the number of postings and increases the storage
requirement of elements in the IM by a factor  of  25%,  and
that of words by 33%.

     The  only  mapping  that exists is the one way start to
element_id mapping stored in the IM's postings.  There is no
general  purpose  map  available  to  convert element_ids to
start numbers, and  vice-versa.  This prevents Niagara  from
using data found in the DM to generate efficient IM queries,
as there is no start, end context  available  to  limit  the
search.

     It is difficult to perform IM searches on document pay-
load.  At  best  we  can  provide  non-exact  query  results
against  payload,  versus the exact results that is provided
for a structural query.   This  occurs  when  elements  come
between  payload  words.  The elements have no impact on the
payload context, but cause distance calculations  for  adja-
ceny approximations to go through the roof.  For example

This is a <adj>critically</adj> <em level="3">important<em> issue.



















                             -5-


                    +-------------------+
                    |start numbering    |
                    +-------------------+
                    |word         start |
                    +-------------------+
                    |This         1     |
                    |is           2     |
                    |a            3     |
                    |critically   5     |
                    |important    10    |
                    |issue        12    |
                    +-------------------+
If  we  want  to find "critically important issue", they are
actually adjacent in the document, but their  start  numbers
are  far  apart  because  of  intervening elements.  As this
example shows, elements with attributes will  increase  that
distance  further.   The IM can only provide inexact payload
matches which must be filtered at a higher level with access
to document in the DM.

Solutions

     We  will eliminate the duplication in numbering efforts
between the Data Manager and the Index Manager.  Instead the
IM start number will be used to identify elements in the DM.
This will allow a storage saving in the  numerous  IM  post-
ings.   It also provides a one-to-one mapping between the DM
and the IM elements, so that one doesn't  loose  information
when extracting query results from the DM.

     The  number-generating dependency will be eliminated by
using a common parser that will generate start  numbers  for
both  the  IM  and the DM.  This will remove that cross-sub-
system coupling.  For details, check out the Parsing Streams
vignette.

     For  payload  words  we  introduce a seperate numbering
scheme, where there is a payload word number that is  incre-
mented  only as new payload words are discovered.  Using the
above example

                   +---------------------+
                   |payload numbering    |
                   +---------------------+
                   |word         payload |
                   +---------------------+
                   |This         1       |
                   |is           2       |
                   |a            3       |
                   |critically   4       |
                   |important    5       |
                   |issue        6       |
                   +---------------------+
With this payload  word  numbering  the  IM  provides  exact









                             -6-


matches  on  payload  queries that do not contain stopwords,
instead of only approximate results which must be  filtered.

Another advantage conveyed by this scheme is that it closely
follows traditional IR practices in numbering textual  docu-
ment content.  This allows us to use much of the research of
the IR field for IM queries.   With  the  current  numbering
scheme,  both  stopwords  and  elements  can be interleaving
between payload words.  Due to  this  interleaving  of  ele-
ments,  it  is  not  possible to use traditional IR stopword
frequency techniques to provide fuzzy distances for  inexact
matches  in  distance expressions.  As a result Niagara must
use a much larger fuzzy distance as a metric for nearness of
words,  which  corresponds  to  a larger space which must be
searched by actual retrieval.

Notes

     Attributes are now indexed  --  see  the  vignette  for
details.   That  allows  comparable  exact  structural query
result for structural attribute queries, as  well  as  exact
results for many attribute word queries.  Notice, that since
attribute  words  are  numbered  as  payload,  that  queries
involving  stopwords  can  be  based upon text stopword fre-
quency.



































</pre>
</html>