New Index Manager Structure
Background
The current Index Manager consists of a number of Lexi-
cons. A Lexicon exists for each type of Posting, or index,
that the IM maintains. The type of content indexed by indi-
vidual Lexicons are Text, Elements, Element+Attribute,
IDrefs, DTDs, and multi-attribute results. Each Lexicon
logically contains a set of LexiconEntries. Each Lexicon
Entry corresponds to a term (word, element, etc) which is
stored in that particular Lexicon. Each Lexicon Entry is
identified by the term (word) which that entry represents.
The Lexicon Entry contains meta information about the post-
ings for its term, and the postings themselves. The Post-
ings contained in the Lexicon Entries can conceptually vary,
based on the information which needs to be stored in the
postings. A superset of all contained information that
exists is:
+----------------------------------------------+
|Posting |
|document_id element_id start end nest |
+----------------------------------------------+
Words, for example, don't utilize end information -- and
don't store that information.
The Lexicons and their entries are stored on disk in
BTree files. There is a table which contains IM meta-infor-
mation. This table includes meta information for each Lexi-
con. This meta-information is read at system startup and
written at shutdown. Each Lexicon is stored on disk as a
BTree file. The key to the BTree is the term that corre-
sponds to the Lexicon Entrys of that Lexicon. The value
portion of the BTree entry contains some Lexicon Entry meta-
information, followed by the list of postings that record
the occurence of that term.
In addition to the simple indexes, such as word or ele-
ment, the IM is also capable of storing indices with multi-
attribute (MA) results. These multi-attribute indices are
used for path indices. The posting for a multi-attribute
index contains multiple postings, hence the name. For exam-
ple, given a path of a/b, the posting for the MA result
would consist of two element postings. The first posting of
the set would be the a locations, and the second posting
corresponds to the b locations.
In the current system there are large lookup tables,
the Lexicon, where key and meta information about each term
-2-
is memory resident for the duration of the session. The
per-term information is the LexiconEntry. The postings
associated with each term, however, are not memory resident
-- they are rolled in from disk when they are accessed and
referenced to by the individual LexiconEntrys. This caching
requires that postings be read out of the buffer pool and
unmarshalled into an in-memory format. This in-memory for-
mat is then copied to a user buffer and used for query pro-
cessing. The postings in the LexiconEntry have some sort of
replacement policy (or they don't yet, which is part of the
problem) to get recover the space they consume when more
space is required.
In addition this in-memory buffer is used to buffer
postings written to the IM that are scheduled to be written
to the underlying disk. When enough postings are accumu-
lated, they are marshalled to the on disk format and written
to disk. Note that sorting is needed somewhere in here to
ensure the correct order of the on-disk data. The two forms
of data (on disk and in memory) require expensive copying
and data conversion operations.
All postings for all documents of a given term are
stored in a large BLOB. Due to this, it is necessary to
read all proceeding data in the BLOB until the actual data,
such as a document or subtree of a document, is located.
These BLOBs can also grow quite large and manipulating them
is expensive both in disk I/O and cpu time.
New System
The new system's intent, is to simplify how IM data is
handled, allow access to specific data quickly, eliminate
in-memory copies of the data, and make the IM be scalable
with respect to the amount of memory it uses.
Data conversion will be eliminated by supporting dif-
ferent posting formats which are known by the system --
instead of using a uniform posting format and shoe-horning
all the postings into it. Multi-Attribute postings already
violated the uniform posting format, and it requires expen-
sive data conversion to use.
The complex (and not implemented) cache management will
be eliminated by eliminating the general-purpose cache, and
replacing it with something else. The replacment cache is a
performance-enhancment only which will buffer written data
to minimize format changes and disk I/Os -- the system would
work without it. It is not used in the context of normal
query processing;
Indexing the data via (term-id, doc-id) allows immedi-
ate access to the data in a particular document for in-docu-
ment queries. This storage change also alleviates the large
-3-
BLOB problem mentioned earlier, by breaking up existing
BLOBs into smaller ones. The ability to use multiple BTrees
to store the postings also allows the data to be split among
devices, which allows better disk access and storage uti-
lization. Ultimately we may partition data into multiple
BTrees in a different fashion, rather than the all entries
for a term in a BTree format assumed here. That work,
though, requires that even simple queries be capable of par-
allel execution. Though intended in the growth of the sys-
tem, that is not something we will be requiring immediately.