Indexing Attributes in Niagara
Current
In general, Niagara does not index attributes nor
attribute values.
What Niagara currently does is to index attribute tags, with
an element-name+attribute-name value. This can be thought
of as a multi-attribute index which has only one value asso-
ciated with it. The only attribute value that Niagara
indexes are IDREFs. These is also a derived index which
maps IDREFs to their target IDs.
Goals and Problems
Firstly, Niagara should be able to index attribute
tags, as well as attribute values, in a general way. It
should be possible to perform general purpose queries on
attribute related values, instead of being restricted to an
arbitrary subset of pre-computed things. For example, it is
only possible to perform element+attribute index lookups
now, where the element must be known. It is not possible to
find all occurences of an attribute without an exhaustive
search. A more general attribute handling scheme would
allow for general purpose queries against attributes.
There is also the problem that attributes exist outside
of the payload and natural order of a XML document. For
example in the fragment
some important text
the payload some is immediately followed by important, not
this is a test. Some searches will only want to find
attribute tags, others attribute values, and others yet
tag+value. In addition, some queries will want to search
both the attribute space and the payload space, while other
queries will want to search only the payload space or the
attribute space.
The IM does not store all the information about the
document that is required to build these multiple-attribute
indices.
Solutions
We have developed a general indexing mechanism for
attributes and attribute values. This mechanism allows for
all types of attribute search, not just the limited types
-2-
available now. The current element+attribute and idref
indices are actually implicit multi-attribute indices.
These implicit indices will be removed, to be replaced by
ordinary multi-attribute indices which produce the same
result. However, as all the info needed to generate these
indices will now be available in the IM, there is no need to
compute those indexes in memory as a document is indexed.
The results can be generated through ordinary queries
against the IM data. This allows these indexes to be cre-
ated on arbitrarily large documents, without requiring an
arbitrarily large amount of memory to compute the result.
This mechanism also allows for querying payload and
attribute text seperatedly from each other. to use the IM to
it's full effectiveness. This is especially important
because we can generate correct structural queries on
attributes. It is also possible to generate accurate super-
set structural queries on attribute values. It also solves
the problems of attribute ordering which we have examined
from several directions and have not developed a solution to
before.
The flip side of this solution is that you need to run
both payload and attribute queries, and merge their results,
if both payload and attribute must be searched. However, we
think this is an excellent tradeoff to having the capability
to do exact structural queries on all data in the IM. As
the difficult ordering problem is also solved, this is good.
Implementation
The current idref and element+attribute implicit
derived indices will be removed, at least in concept. The
idref index may remain as generated by the loader, though it
not need be.
With that complete, we are left with text and element
indices. We will now rename the text index to be the pay-
load word index -- it indexes words which appear in the pay-
load of an XML document. At this point we will add two more
indexes. The first is the attribute index, which is an
index of attribute tags. Postings for this index are of the
form
+------------------------------------+
| Attribute |
+------------------------------------+
+-------+--------+-------+-----------+
|doc_id | ele_id | start | attribute |
+-------+--------+-------+-----------+
A second index for attribute words will also be added
-3-
+---------------------------------------------------------+
| Attribute Word |
+---------------------------------------------------------+
+-------+------------+-------+-----------+----------------+
|doc_id | element_id | start | attribute | attribute_word |
+-------+------------+-------+-----------+----------------+
Doc_id, ele_id, and start are the expected fields com-
mon to most entries.
Attribute tags and values have their own numbering
scheme, outside of the numbering of the payload of the docu-
ment. All attribute numbers are relevant only within the
element those those attributes and attribute words were
found in. For example, with a element containing attributes
of ...
we would assign the following values. Note that the
attribute is assigned the same element_id and start number
as the element they are part of, since they have no position
in the payload of the document
+-------------------------------------+
|text attribute attribute word |
+-------------------------------------+
|attr1 1 |
|8675309 1 1 |
|attr2 2 |
|Niagara 2 1 |
|QE 2 3 |
|attr3 3 |
|foo 3 1 |
|bar 3 2 |
+-------------------------------------+
Note that I treated is as a stopword to illustrate that
stopwords, though not indexed, still need to count as words
for determining distance.
Attribute is a locally numbered identifier; the numbers
are local to each element which contains attributes. As
such, the same attribute value (0, 1, 2) will appear many
times. The numbers will be assigned in simple first to last
order of occurence in the source element. As attributes are
not ordered, these numbers have no meaning beyond identify-
ing multiple attributes in an element. However there is one
ordering implication these scheme has: it allows the docu-
ment to be reconstructed in the same form as it was input to
the system.
Attribute word is another identifier which is locally num-
bered. It is the position of this word within this
-4-
attribute.
Alternate Attribute Indexes
An alternate scheme for storing attributes with differ-
ent tradeoffs is presented here. The earlier scheme seems
to have a number of advantages for query processing which is
why it is presented in preference. For example, it is triv-
ial to locate unknown attributes with a desired content with
the attribute numbering scheme; With this scheme you need to
do a join to determine that information. However, the two
schemes are more or less equivalent and either can express
the same content and provide for identical queries on it.
It is just a question of how resource intensive those
queries are.
Explain the alternate attribute numbering scheme.
+--------------------------------------------------+
| Attribute |
+--------------------------------------------------+
+-------+--------+-------+-------------+-----------+
|doc_id | ele_id | start | local_start | local_end |
+-------+--------+-------+-------------+-----------+
+--------------------------------------+
| Attribute Word |
+--------------------------------------+
+-------+--------+-------+-------------+
|doc_id | ele_id | start | local_start |
+-------+--------+-------+-------------+
Local start is the position in the enclosing element at
which this attribute or attribute word starts. Local end is
the position in the enclosing element at which this
attribute ends. Actually it is a position corresponding to
the END of the attribute value, not the last word of the
attribute value. This is done so that certain kinds of
queries work as expected with start,end numbers not includ-
ing the things that they contain. It also ensures that
there is a break between subsequent attributes. For exam-
ple, with a element containing attributes of ...
we would assign the following values. Note that the
attribute is assigned the same element_id and start number
as the element they are part of, since they have no position
in the payload of the document
-5-
+----------------------------------+
|text local_start local_end |
+----------------------------------+
|attr1 1 3 |
|8675309 2 |
|attr2 4 8 |
|Niagara 5 |
|QE 7 |
|attr3 9 12 |
|foo 10 |
|bar 11 |
+----------------------------------+
Note that the local start and end numbers act similarily to
the global start and end numbers to "enclose" the content of
the attribute.
Notes
After writing this document I noticed some things about
the two above representations. The first is that the
attribute representation doesn't allow you to easily deter-
mine if a desired word is anchored at the end of the
attribute, such as a '$' query in grep. To provide that
functionality the attribute tuple needs to be extended with
an extra entry, the number of words in the attribute:
+------------------------------------------------------+
| Attribute |
+------------------------------------------------------+
+-------+--------+-------+-----------+-----------------+
|doc_id | ele_id | start | attribute | attribute_words |
+-------+--------+-------+-----------+-----------------+
Of course, this matters only if '$' tagged text queries
are important, but it is something that can be payload
queried in the payload part of the query. The two formats
trade space versus query efficiency. I still like the pref-
ered format, but note that the alternate does seem to be
orthogonal with the rest of the system, as well as requiring
less storage. I believe the best way to determine between
the two formats is to write a bunch of queries and see which
format works better in the query engine.
XXXX move this somewhere else, the new IM document
We are proposing as part of our changes (see another docu-
ment) to eliminate element_id and use the start numbers as
element IDs as well.