Niagara Numbering
and
Posting Format
Data Manager
Niagara has two numbering schemes, one used by the Data
Manager and the other used by the Index Manager. Both num-
bering schemes have a common document_id or doc_id which
identifies the document in question.
The DM uses one additional number, the element_id or ele_id
to uniquely identify each XML element in the document. The
element IDs are assigned in prefix traversal, with the root
element have id 1. This corresponds to the order that ele-
ment start tags are seen in the document parse.
+-------------------------+
| Data Manager |
+-------------------------+
+------------+------------+
|document_id | element_id |
+------------+------------+
For example, element_ids are assigned as follows in this XML
document.
Unix Internals
Vahallia
+------------------------+
|element element_id |
+------------------------+
| 1 |
| 2 |
| 3 |
| 4 |
+------------------------+
Note that the DM numbering is slightly more complex
than mentioned above. If a node contains only payload text
and no descendants, or only contains descendants and no
text, the numbering is as mentioned above. If a node has
both payload text and descendants, the payload text between
each descendant is treated as a built-in text descendant.
Using a more complex example to illustrate this case:
-2-
This is a test of the
emergency broadcast system
This is only a test
The translated internal version of this is
This is a test of the
emergency broadcast system
This is only a test
+-------------+
| 1 |
| 2 |
| 3 |
| 4 |
+-------------+
Index Manager
The IM assigns IDs to more portions of a document than
the DM, which only numbers the elements of a document. The
IM numbering also reflects document structure and is used to
analyze it when querying. The IM numbers each payload word
and XML element discovered in the document in the order
which they are parsed. This number is known as the start
number. The IM also has the nesting level, which is the
depth of the item from the root of the document. For ele-
ments which are stored in the IM, an end number is assigned
to each element. The end number for an element is the ele-
ment ID of the closing tag of that element. The start and
end numbers allow document structure to be analyzed. Data
stored in the IM for elements also contains the above men-
tioned DM element_id of that element, which allows IM ele-
ments to be mapped to DM elements.
+----------------------------------------------+
| IM Elements |
+----------------------------------------------+
+------------+------------+-------+-----+------+
|document_id | element_id | start | end | nest |
+------------+------------+-------+-----+------+
Payload words which are indexed in the IM do not require an
end number, as the word begins and ends at the same number.
-3-
+----------------------------------------+
| IM Payload Words |
+----------------------------------------+
+------------+------------+-------+------+
|document_id | element_id | start | nest |
+------------+------------+-------+------+
Using the earlier XML example again, I'll illustrate
how start numbers are assigned.
Unix Internals
Vahallia
+--------------------+
| start numbers |
+--------------------+
|item start |
+--------------------+
| 1 |
|id= 2 |
|"12345" 3 |
| 4 |
|Unix 5 |
|Internals 6 |
| 7 |
| 8 |
| 9 |
|Vahallia 10 |
| 11 |
| 12 |
| 13 |
+--------------------+
The element postings corresponding to these numbers would be
+---------------------------------------------+
| Element Postings |
+---------------------------------------------+
|element element_id start end nest |
+---------------------------------------------+
| 1 1 13 0 |
| 2 4 7 1 |
| 3 8 12 1 |
| 4 9 11 2 |
+---------------------------------------------+
The payload word postings for this would be
-4-
+--------------------------------------+
| Payload Word Postings |
|word element_id start nest |
+--------------------------------------+
|Unix 2 5 2 |
|Internals 2 6 2 |
|Vahallia 4 10 3 |
+--------------------------------------+
Problems and Considerations
The IM and the DM are independent entities, each gener-
ating their own numbers. However, the element_id numbers
generated by both parsers must match exactly. This creates
an implicit dependency between the two which must be care-
fully maintained for the system to work correctly.
As the IM and the DM use different numbering schemes to
represent elements, there is a storage price which the IM
must pay to keep both sets of IDs stored. This happens fort
both payload word and element postings, as queries on words
must also map to the element they are contained in for the
DM to locate them. While this overhead is small, it is mul-
tiplied by the number of postings and increases the storage
requirement of elements in the IM by a factor of 25%, and
that of words by 33%.
The only mapping that exists is the one way start to
element_id mapping stored in the IM's postings. There is no
general purpose map available to convert element_ids to
start numbers, and vice-versa. This prevents Niagara from
using data found in the DM to generate efficient IM queries,
as there is no start, end context available to limit the
search.
It is difficult to perform IM searches on document pay-
load. At best we can provide non-exact query results
against payload, versus the exact results that is provided
for a structural query. This occurs when elements come
between payload words. The elements have no impact on the
payload context, but cause distance calculations for adja-
ceny approximations to go through the roof. For example
This is a critically important issue.
-5-
+-------------------+
|start numbering |
+-------------------+
|word start |
+-------------------+
|This 1 |
|is 2 |
|a 3 |
|critically 5 |
|important 10 |
|issue 12 |
+-------------------+
If we want to find "critically important issue", they are
actually adjacent in the document, but their start numbers
are far apart because of intervening elements. As this
example shows, elements with attributes will increase that
distance further. The IM can only provide inexact payload
matches which must be filtered at a higher level with access
to document in the DM.
Solutions
We will eliminate the duplication in numbering efforts
between the Data Manager and the Index Manager. Instead the
IM start number will be used to identify elements in the DM.
This will allow a storage saving in the numerous IM post-
ings. It also provides a one-to-one mapping between the DM
and the IM elements, so that one doesn't loose information
when extracting query results from the DM.
The number-generating dependency will be eliminated by
using a common parser that will generate start numbers for
both the IM and the DM. This will remove that cross-sub-
system coupling. For details, check out the Parsing Streams
vignette.
For payload words we introduce a seperate numbering
scheme, where there is a payload word number that is incre-
mented only as new payload words are discovered. Using the
above example
+---------------------+
|payload numbering |
+---------------------+
|word payload |
+---------------------+
|This 1 |
|is 2 |
|a 3 |
|critically 4 |
|important 5 |
|issue 6 |
+---------------------+
With this payload word numbering the IM provides exact
-6-
matches on payload queries that do not contain stopwords,
instead of only approximate results which must be filtered.
Another advantage conveyed by this scheme is that it closely
follows traditional IR practices in numbering textual docu-
ment content. This allows us to use much of the research of
the IR field for IM queries. With the current numbering
scheme, both stopwords and elements can be interleaving
between payload words. Due to this interleaving of ele-
ments, it is not possible to use traditional IR stopword
frequency techniques to provide fuzzy distances for inexact
matches in distance expressions. As a result Niagara must
use a much larger fuzzy distance as a metric for nearness of
words, which corresponds to a larger space which must be
searched by actual retrieval.
Notes
Attributes are now indexed -- see the vignette for
details. That allows comparable exact structural query
result for structural attribute queries, as well as exact
results for many attribute word queries. Notice, that since
attribute words are numbered as payload, that queries
involving stopwords can be based upon text stopword fre-
quency.