Indexing Attributes in Niagara Current In general, Niagara does not index attributes nor attribute values. What Niagara currently does is to index attribute tags, with an element-name+attribute-name value. This can be thought of as a multi-attribute index which has only one value asso- ciated with it. The only attribute value that Niagara indexes are IDREFs. These is also a derived index which maps IDREFs to their target IDs. Goals and Problems Firstly, Niagara should be able to index attribute tags, as well as attribute values, in a general way. It should be possible to perform general purpose queries on attribute related values, instead of being restricted to an arbitrary subset of pre-computed things. For example, it is only possible to perform element+attribute index lookups now, where the element must be known. It is not possible to find all occurences of an attribute without an exhaustive search. A more general attribute handling scheme would allow for general purpose queries against attributes. There is also the problem that attributes exist outside of the payload and natural order of a XML document. For example in the fragment someimportant text the payload some is immediately followed by important, not this is a test. Some searches will only want to find attribute tags, others attribute values, and others yet tag+value. In addition, some queries will want to search both the attribute space and the payload space, while other queries will want to search only the payload space or the attribute space. The IM does not store all the information about the document that is required to build these multiple-attribute indices. Solutions We have developed a general indexing mechanism for attributes and attribute values. This mechanism allows for all types of attribute search, not just the limited types -2- available now. The current element+attribute and idref indices are actually implicit multi-attribute indices. These implicit indices will be removed, to be replaced by ordinary multi-attribute indices which produce the same result. However, as all the info needed to generate these indices will now be available in the IM, there is no need to compute those indexes in memory as a document is indexed. The results can be generated through ordinary queries against the IM data. This allows these indexes to be cre- ated on arbitrarily large documents, without requiring an arbitrarily large amount of memory to compute the result. This mechanism also allows for querying payload and attribute text seperatedly from each other. to use the IM to it's full effectiveness. This is especially important because we can generate correct structural queries on attributes. It is also possible to generate accurate super- set structural queries on attribute values. It also solves the problems of attribute ordering which we have examined from several directions and have not developed a solution to before. The flip side of this solution is that you need to run both payload and attribute queries, and merge their results, if both payload and attribute must be searched. However, we think this is an excellent tradeoff to having the capability to do exact structural queries on all data in the IM. As the difficult ordering problem is also solved, this is good. Implementation The current idref and element+attribute implicit derived indices will be removed, at least in concept. The idref index may remain as generated by the loader, though it not need be. With that complete, we are left with text and element indices. We will now rename the text index to be the pay- load word index -- it indexes words which appear in the pay- load of an XML document. At this point we will add two more indexes. The first is the attribute index, which is an index of attribute tags. Postings for this index are of the form +------------------------------------+ | Attribute | +------------------------------------+ +-------+--------+-------+-----------+ |doc_id | ele_id | start | attribute | +-------+--------+-------+-----------+ A second index for attribute words will also be added -3- +---------------------------------------------------------+ | Attribute Word | +---------------------------------------------------------+ +-------+------------+-------+-----------+----------------+ |doc_id | element_id | start | attribute | attribute_word | +-------+------------+-------+-----------+----------------+ Doc_id, ele_id, and start are the expected fields com- mon to most entries. Attribute tags and values have their own numbering scheme, outside of the numbering of the payload of the docu- ment. All attribute numbers are relevant only within the element those those attributes and attribute words were found in. For example, with a element containing attributes of ...we would assign the following values. Note that the attribute is assigned the same element_id and start number as the element they are part of, since they have no position in the payload of the document +-------------------------------------+ |text attribute attribute word | +-------------------------------------+ |attr1 1 | |8675309 1 1 | |attr2 2 | |Niagara 2 1 | |QE 2 3 | |attr3 3 | |foo 3 1 | |bar 3 2 | +-------------------------------------+ Note that I treated is as a stopword to illustrate that stopwords, though not indexed, still need to count as words for determining distance. Attribute is a locally numbered identifier; the numbers are local to each element which contains attributes. As such, the same attribute value (0, 1, 2) will appear many times. The numbers will be assigned in simple first to last order of occurence in the source element. As attributes are not ordered, these numbers have no meaning beyond identify- ing multiple attributes in an element. However there is one ordering implication these scheme has: it allows the docu- ment to be reconstructed in the same form as it was input to the system. Attribute word is another identifier which is locally num- bered. It is the position of this word within this -4- attribute. Alternate Attribute Indexes An alternate scheme for storing attributes with differ- ent tradeoffs is presented here. The earlier scheme seems to have a number of advantages for query processing which is why it is presented in preference. For example, it is triv- ial to locate unknown attributes with a desired content with the attribute numbering scheme; With this scheme you need to do a join to determine that information. However, the two schemes are more or less equivalent and either can express the same content and provide for identical queries on it. It is just a question of how resource intensive those queries are. Explain the alternate attribute numbering scheme. +--------------------------------------------------+ | Attribute | +--------------------------------------------------+ +-------+--------+-------+-------------+-----------+ |doc_id | ele_id | start | local_start | local_end | +-------+--------+-------+-------------+-----------+ +--------------------------------------+ | Attribute Word | +--------------------------------------+ +-------+--------+-------+-------------+ |doc_id | ele_id | start | local_start | +-------+--------+-------+-------------+ Local start is the position in the enclosing element at which this attribute or attribute word starts. Local end is the position in the enclosing element at which this attribute ends. Actually it is a position corresponding to the END of the attribute value, not the last word of the attribute value. This is done so that certain kinds of queries work as expected with start,end numbers not includ- ing the things that they contain. It also ensures that there is a break between subsequent attributes. For exam- ple, with a element containing attributes of ... we would assign the following values. Note that the attribute is assigned the same element_id and start number as the element they are part of, since they have no position in the payload of the document -5- +----------------------------------+ |text local_start local_end | +----------------------------------+ |attr1 1 3 | |8675309 2 | |attr2 4 8 | |Niagara 5 | |QE 7 | |attr3 9 12 | |foo 10 | |bar 11 | +----------------------------------+ Note that the local start and end numbers act similarily to the global start and end numbers to "enclose" the content of the attribute. Notes After writing this document I noticed some things about the two above representations. The first is that the attribute representation doesn't allow you to easily deter- mine if a desired word is anchored at the end of the attribute, such as a '$' query in grep. To provide that functionality the attribute tuple needs to be extended with an extra entry, the number of words in the attribute: +------------------------------------------------------+ | Attribute | +------------------------------------------------------+ +-------+--------+-------+-----------+-----------------+ |doc_id | ele_id | start | attribute | attribute_words | +-------+--------+-------+-----------+-----------------+ Of course, this matters only if '$' tagged text queries are important, but it is something that can be payload queried in the payload part of the query. The two formats trade space versus query efficiency. I still like the pref- ered format, but note that the alternate does seem to be orthogonal with the rest of the system, as well as requiring less storage. I believe the best way to determine between the two formats is to write a bunch of queries and see which format works better in the query engine. XXXX move this somewhere else, the new IM document We are proposing as part of our changes (see another docu- ment) to eliminate element_id and use the start numbers as element IDs as well.