Indexing Attributes in Niagara


Current

     In  general,  Niagara  does  not  index  attributes nor
attribute values.

What Niagara currently does is to index attribute tags, with
an  element-name+attribute-name  value.  This can be thought
of as a multi-attribute index which has only one value asso-
ciated  with  it.   The  only  attribute  value that Niagara
indexes are IDREFs.  These is also  a  derived  index  which
maps IDREFs to their target IDs.

Goals and Problems

     Firstly,  Niagara  should  be  able  to index attribute
tags, as well as attribute values, in  a  general  way.   It
should  be  possible  to  perform general purpose queries on
attribute related values, instead of being restricted to  an
arbitrary subset of pre-computed things.  For example, it is
only possible to  perform  element+attribute  index  lookups
now, where the element must be known.  It is not possible to
find all occurences of an attribute  without  an  exhaustive
search.   A  more  general  attribute  handling scheme would
allow for general purpose queries against attributes.

     There is also the problem that attributes exist outside
of  the  payload  and  natural order of a XML document.  For
example in the fragment

  some  important  text

the payload some is immediately followed by  important,  not
this  is  a  test.   Some  searches  will  only want to find
attribute tags, others  attribute  values,  and  others  yet
tag+value.   In  addition,  some queries will want to search
both the attribute space and the payload space, while  other
queries  will  want  to search only the payload space or the
attribute space.

     The IM does not store all  the  information  about  the
document  that is required to build these multiple-attribute
indices.

Solutions

     We have developed  a  general  indexing  mechanism  for
attributes  and attribute values.  This mechanism allows for
all types of attribute search, not just  the  limited  types









                             -2-


available  now.   The  current  element+attribute  and idref
indices  are  actually  implicit  multi-attribute   indices.
These  implicit  indices  will be removed, to be replaced by
ordinary multi-attribute  indices  which  produce  the  same
result.   However,  as all the info needed to generate these
indices will now be available in the IM, there is no need to
compute  those  indexes  in memory as a document is indexed.
The  results  can  be  generated  through  ordinary  queries
against  the  IM data.  This allows these indexes to be cre-
ated on arbitrarily large documents,  without  requiring  an
arbitrarily large amount of memory to compute the result.

     This  mechanism  also  allows  for querying payload and
attribute text seperatedly from each other. to use the IM to
it's  full  effectiveness.   This  is  especially  important
because  we  can  generate  correct  structural  queries  on
attributes.  It is also possible to generate accurate super-
set structural queries on attribute values.  It also  solves
the  problems  of  attribute ordering which we have examined
from several directions and have not developed a solution to
before.

     The  flip side of this solution is that you need to run
both payload and attribute queries, and merge their results,
if both payload and attribute must be searched.  However, we
think this is an excellent tradeoff to having the capability
to  do  exact  structural queries on all data in the IM.  As
the difficult ordering problem is also solved, this is good.

Implementation

     The   current   idref  and  element+attribute  implicit
derived indices will be removed, at least in  concept.   The
idref index may remain as generated by the loader, though it
not need be.

     With that complete, we are left with text  and  element
indices.   We  will now rename the text index to be the pay-
load word index -- it indexes words which appear in the pay-
load of an XML document.  At this point we will add two more
indexes.  The first is the  attribute  index,  which  is  an
index of attribute tags.  Postings for this index are of the
form

           +------------------------------------+
           |             Attribute              |
           +------------------------------------+
           +-------+--------+-------+-----------+
           |doc_id | ele_id | start | attribute |
           +-------+--------+-------+-----------+

     A second index for attribute words will also be added











                             -3-


 +---------------------------------------------------------+
 |                     Attribute Word                      |
 +---------------------------------------------------------+
 +-------+------------+-------+-----------+----------------+
 |doc_id | element_id | start | attribute | attribute_word |
 +-------+------------+-------+-----------+----------------+

     Doc_id, ele_id, and start are the expected fields  com-
mon to most entries.

     Attribute  tags  and  values  have  their own numbering
scheme, outside of the numbering of the payload of the docu-
ment.   All  attribute  numbers are relevant only within the
element those those  attributes  and  attribute  words  were
found in.  For example, with a element containing attributes
of ...



we  would  assign  the  following  values.   Note  that  the
attribute  is  assigned the same element_id and start number
as the element they are part of, since they have no position
in the payload of the document

           +-------------------------------------+
           |text      attribute   attribute word |
           +-------------------------------------+
           |attr1         1                      |
           |8675309       1             1        |
           |attr2         2                      |
           |Niagara       2             1        |
           |QE            2             3        |
           |attr3         3                      |
           |foo           3             1        |
           |bar           3             2        |
           +-------------------------------------+

Note  that  I  treated  is  as a stopword to illustrate that
stopwords, though not indexed, still need to count as  words
for determining distance.

     Attribute is a locally numbered identifier; the numbers
are local to each element  which  contains  attributes.   As
such,  the  same  attribute value (0, 1, 2) will appear many
times.  The numbers will be assigned in simple first to last
order of occurence in the source element.  As attributes are
not ordered, these numbers have no meaning beyond  identify-
ing multiple attributes in an element.  However there is one
ordering implication these scheme has: it allows  the  docu-
ment to be reconstructed in the same form as it was input to
the system.

Attribute word is another identifier which is  locally  num-
bered.   It  is  the  position  of  this  word  within  this









                             -4-


attribute.

Alternate Attribute Indexes

     An alternate scheme for storing attributes with differ-
ent  tradeoffs  is presented here.  The earlier scheme seems
to have a number of advantages for query processing which is
why it is presented in preference.  For example, it is triv-
ial to locate unknown attributes with a desired content with
the attribute numbering scheme; With this scheme you need to
do a join to determine that information.  However,  the  two
schemes  are  more or less equivalent and either can express
the same content and provide for identical  queries  on  it.
It  is  just  a  question  of  how  resource intensive those
queries are.

     Explain the alternate attribute numbering scheme.

    +--------------------------------------------------+
    |                    Attribute                     |
    +--------------------------------------------------+
    +-------+--------+-------+-------------+-----------+
    |doc_id | ele_id | start | local_start | local_end |
    +-------+--------+-------+-------------+-----------+


          +--------------------------------------+
          |           Attribute Word             |
          +--------------------------------------+
          +-------+--------+-------+-------------+
          |doc_id | ele_id | start | local_start |
          +-------+--------+-------+-------------+

     Local start is the position in the enclosing element at
which this attribute or attribute word starts.  Local end is
the  position  in  the  enclosing  element  at  which   this
attribute  ends.  Actually it is a position corresponding to
the END of the attribute value, not the  last  word  of  the
attribute  value.   This  is  done  so that certain kinds of
queries work as expected with start,end numbers not  includ-
ing  the  things  that  they  contain.  It also ensures that
there is a break between subsequent attributes.   For  exam-
ple, with a element containing attributes of ...



we  would  assign  the  following  values.   Note  that  the
attribute is assigned the same element_id and  start  number
as the element they are part of, since they have no position
in the payload of the document













                             -5-


            +----------------------------------+
            |text      local_start   local_end |
            +----------------------------------+
            |attr1          1            3     |
            |8675309        2                  |
            |attr2          4            8     |
            |Niagara        5                  |
            |QE             7                  |
            |attr3          9           12     |
            |foo           10                  |
            |bar           11                  |
            +----------------------------------+

Note that the local start and end numbers act similarily  to
the global start and end numbers to "enclose" the content of
the attribute.

Notes

     After writing this document I noticed some things about
the  two  above  representations.   The  first  is  that the
attribute representation doesn't allow you to easily  deter-
mine  if  a  desired  word  is  anchored  at  the end of the
attribute, such as a '$' query in  grep.   To  provide  that
functionality  the attribute tuple needs to be extended with
an extra entry, the number of words in the attribute:

  +------------------------------------------------------+
  |                      Attribute                       |
  +------------------------------------------------------+
  +-------+--------+-------+-----------+-----------------+
  |doc_id | ele_id | start | attribute | attribute_words |
  +-------+--------+-------+-----------+-----------------+

     Of course, this matters only if '$' tagged text queries
are  important,  but  it  is  something  that can be payload
queried in the payload part of the query.  The  two  formats
trade space versus query efficiency.  I still like the pref-
ered format, but note that the alternate  does  seem  to  be
orthogonal with the rest of the system, as well as requiring
less storage.  I believe the best way to  determine  between
the two formats is to write a bunch of queries and see which
format works better in the query engine.

     XXXX move this somewhere else, the new IM document
We are proposing as part of our changes (see  another  docu-
ment)  to  eliminate element_id and use the start numbers as
element IDs as well.