Multi Attribute Indices
Current
Niagara also has the concept of a multi-attribute
index, A multi-attribute index is an index which has a com-
plex key, such as a path expression. A path expression such
as a/b has two components, a and b. The multi aspect of
this index comes from the existence of multiple postings,
one corresponding to each element of the path expression.
However, they can be used as general purpose join indices as
long as a mapping between a join condition and the multi-
attribute result may be generated.
Instead of a term being a key to the index, the path expres-
sion, essentially a cached query result, is the key to the
index. These indices are derived indices, which can be
rebuilt from the document, or from the normally indexed por-
tion of a document. Currently there are two types of multi-
attribute indices in Niagara.
The first type are the implicit MA indices used for the cur-
rent idref and element+attribute indices. These indices are
built in-memory when a document is indexed, and then written
to disk. As such, they can only be built when the document
is indexed, and the information to build them is lost after-
ward.
The second type of multi-attribute indices are those built
explicitly to record path expressions. For example a/b/c is
a multi-attribute path consisting of 3 path elements, and
there would be a resulting multi-attribute posting consist-
ing of the locations of matching 'a/b/c' postings.
Goals and Problems
Multi-attribute indices are a powerful tool in Niagara.
They allow the computation of expensive results to occur
once and then to be stored -- but the result can be used
again and again. However, their management is currently
rather ad-hoc. As mentioned earlier, some multi-attribute
indices are implicit -- they are only built at document
index time, as a side-effect of indexing. Note that it is
currently not possible to generate these implicit indices
after the fact, through a query, since the information
needed to generate the index is lost after the document is
indexed.
Multiple attribute indices have a key which conceptu-
ally consists of a number of independent fields, from which
-2-
the results are derived. The result of the multi-attribute
index can also have multiple fields corresponding to those
elements in the key. Note, however, that it may not be
desirable or possible to save all of the results generated
by the query. For example, the element+attribute index cur-
rently in use has two key values, but only one result value
-- the element the attribute is stored in. As another exam-
ple, say a index is generated for the path expression a/b/c.
It may only be desirable to save 1 "column" of results, c
from this query -- if that is all that is used in actual
queries. With other path indices it may be desireable to
save only some other subset of the results .. maybe a, c
because B is never used, it is just cared about in the
structure that generated the result. However, some queries
will undoubtedly want to keep all the information and the
full a, b, c path.
There is also a problem of interpreting the multi-
attribute results, as the type of the multi-attribute result
is not known. In addition to path elements it could consist
of a combination of other elements.
Solutions
To generalize the use of multi-attribute indices in
Niagara, we'll create catalog information for each multi-
attribute index. At a high level, the catalog will contain
information about what data is available in the index, which
fields from the path are stored. The other info it will
contain is the type of each posting stored.
For example, say an index was on the path a/b/c. If
all the information was stored the catalog would say it
each posting consists of three sub-postings, and would
have the size of the posting. The catalog catalog will
also note that each of those sub-postings contained
information about an element, aka type information.
If we only wanted to know about the c paths in the
above, the catalog would say that there is only one
piece of information in the posting. That info corre-
sponds to the last element of the path statement. It's
type is reference-to-element.
As arbitrary path expressions or query expressions may
be used as the key of a multiple-attribute index, it is
possible that the index may contain different types of
postings in its multiple attributes. For example, we
could generate an index on the expression as a/b/c con-
tains "xyzzy". The posting could consist of 4 fields,
consisting of 3 elements postinsg and one payload text
posting. As another example, try a/b[@="xyz"]/c. This
could contain two elements and an attribute, or three
elements and an attribute, or some other combination of
-3-
result postings.
The catalog eliminates implicit knowledge of what
multi-attribute indices contain. It replaces the implicit
knowledge with explicit information about what is contained.
Explicit knowledge is flexible and changeable and it elimi-
nates lots of problems in the process. The system is no
longer inhibited by the implicit knowledge, as anything can
be done by putting info about it in the catalog.
Implicit multi-attribute indices will go away, to be
replaced by generic multi-attribute indices for the same
data.
Implementation
At this point in time we do not have implementation
details for this portion of the system. Providing this gen-
eral concept early in the game will guide the develeopment
of the system enough to allow us to handle those problems
when we get to them. In the interim we can move the
implicit knowledge of multi-attribute queries outside of the
IM layer, and let that knowledge reside in the SEQL opera-
tors. Or, we could provide an interface that makes every-
thing look right, but not allow for anything but path
matches result queries until later.
There are also questions about having the optimizer
generate hints about what multi-attribute indices would be
useful.
Note that the catalog info about the multi-attribute
indices is similar to the catalog info needed for the multi-
attribute SEQL streams which will replace the current sin-
gle-attribute SEQL streams.