Multi Attribute Indices


Current

     Niagara  also  has  the  concept  of  a multi-attribute
index, A multi-attribute index is an index which has a  com-
plex key, such as a path expression.  A path expression such
as a/b has two components, a and b.   The  multi  aspect  of
this  index  comes  from the existence of multiple postings,
one corresponding to each element of  the  path  expression.
However, they can be used as general purpose join indices as
long as a mapping between a join condition  and  the  multi-
attribute result may be generated.

Instead of a term being a key to the index, the path expres-
sion, essentially a cached query result, is the key  to  the
index.   These  indices  are  derived  indices, which can be
rebuilt from the document, or from the normally indexed por-
tion of a document.  Currently there are two types of multi-
attribute indices in Niagara.

The first type are the implicit MA indices used for the cur-
rent idref and element+attribute indices.  These indices are
built in-memory when a document is indexed, and then written
to  disk.  As such, they can only be built when the document
is indexed, and the information to build them is lost after-
ward.

The  second  type of multi-attribute indices are those built
explicitly to record path expressions.  For example a/b/c is
a  multi-attribute  path  consisting of 3 path elements, and
there would be a resulting multi-attribute posting  consist-
ing of the locations of matching 'a/b/c' postings.

Goals and Problems

     Multi-attribute indices are a powerful tool in Niagara.
They allow the computation of  expensive  results  to  occur
once  and  then  to  be stored -- but the result can be used
again and again.  However,  their  management  is  currently
rather  ad-hoc.   As mentioned earlier, some multi-attribute
indices are implicit -- they  are  only  built  at  document
index  time,  as a side-effect of indexing.  Note that it is
currently not possible to generate  these  implicit  indices
after  the  fact,  through  a  query,  since the information
needed to generate the index is lost after the  document  is
indexed.

     Multiple  attribute  indices have a key which conceptu-
ally consists of a number of independent fields, from  which









                             -2-


the  results are derived.  The result of the multi-attribute
index can also have multiple fields corresponding  to  those
elements  in  the  key.   Note,  however, that it may not be
desirable or possible to save all of the  results  generated
by the query.  For example, the element+attribute index cur-
rently in use has two key values, but only one result  value
-- the element the attribute is stored in.  As another exam-
ple, say a index is generated for the path expression a/b/c.
It  may  only  be desirable to save 1 "column" of results, c
from this query -- if that is all that  is  used  in  actual
queries.   With  other  path indices it may be desireable to
save only some other subset of the results  ..  maybe  a,  c
because  B  is  never  used,  it  is just cared about in the
structure that generated the result.  However, some  queries
will  undoubtedly  want  to keep all the information and the
full a, b, c path.

     There is also a  problem  of  interpreting  the  multi-
attribute results, as the type of the multi-attribute result
is not known.  In addition to path elements it could consist
of a combination of other elements.

Solutions

     To  generalize  the  use  of multi-attribute indices in
Niagara, we'll create catalog information  for  each  multi-
attribute  index.  At a high level, the catalog will contain
information about what data is available in the index, which
fields  from  the  path  are stored.  The other info it will
contain is the type of each posting stored.

     For example, say an index was on the  path  a/b/c.   If
     all the information was stored the catalog would say it
     each posting consists of three sub-postings, and  would
     have the size of the posting.  The catalog catalog will
     also note that each  of  those  sub-postings  contained
     information about an element, aka type information.

     If  we  only  wanted  to  know about the c paths in the
     above, the catalog would say that  there  is  only  one
     piece  of information in the posting.  That info corre-
     sponds to the last element of the path statement.  It's
     type is reference-to-element.

     As  arbitrary path expressions or query expressions may
     be used as the key of a multiple-attribute index, it is
     possible  that the index may contain different types of
     postings in its multiple attributes.  For  example,  we
     could generate an index on the expression as a/b/c con-
     tains "xyzzy".  The posting could consist of 4  fields,
     consisting  of 3 elements postinsg and one payload text
     posting.  As another example, try a/b[@="xyz"]/c.  This
     could  contain  two elements and an attribute, or three
     elements and an attribute, or some other combination of









                             -3-


     result postings.

     The  catalog  eliminates  implicit  knowledge  of  what
multi-attribute indices contain.  It replaces  the  implicit
knowledge with explicit information about what is contained.
Explicit knowledge is flexible and changeable and it  elimi-
nates  lots  of  problems  in the process.  The system is no
longer inhibited by the implicit knowledge, as anything  can
be done by putting info about it in the catalog.

     Implicit  multi-attribute  indices  will go away, to be
replaced by generic multi-attribute  indices  for  the  same
data.

Implementation

     At  this  point  in  time we do not have implementation
details for this portion of the system.  Providing this gen-
eral  concept  early in the game will guide the develeopment
of the system enough to allow us to  handle  those  problems
when  we  get  to  them.   In  the  interim  we can move the
implicit knowledge of multi-attribute queries outside of the
IM  layer,  and let that knowledge reside in the SEQL opera-
tors.  Or, we could provide an interface that  makes  every-
thing  look  right,  but  not  allow  for  anything but path
matches result queries until later.

     There are also questions  about  having  the  optimizer
generate  hints  about what multi-attribute indices would be
useful.

     Note that the catalog info  about  the  multi-attribute
indices is similar to the catalog info needed for the multi-
attribute SEQL streams which will replace the  current  sin-
gle-attribute SEQL streams.