Niagara Parse Streams
Background
There is a certain amount of redundancy in Niagara due
to its structure and the independent development efforts of
its separate components. For example the Data Manager and
the Index Manager each have their own XML parser. These are
both event-driven SAX parsers.
Actually ... there is yet another XML parser in the system,
that which the File Scan operator uses, which is a DOM
parser. However the need for that separate parser will be
eliminated once the DM's XNodes will be used for all XML
data representation.
What is the problem with two parsers you ask? Well,
there are several. First off there is a performance problem
-- a document must be parsed twice, once for the DM and once
for the IM, if it will be stored in Niagara. XML parsing
has a lot of overhead, and this cost is not reasonable for
real queries.
Secondly, there is the problem that the system as a whole
must maintain two XML parsers. Outwardly this may not seem
a big deal. However, you must realize that the two parsers
must be in total agreement about the numbering of items in
an XML document This coupling makes you realize that the
parsers are not truly independent, but rather duplicate the
functionality of each other. If they duplicate it incor-
rectly the system won't work.
Next, there is the issue or orthogonality. The IM and the
DM have nothing to do with document parsing, and everything
todo with document storage. Yet, considerable portions of
both revolve around document parsing. By moving document
parsing to its own system the complexity of these components
will be reduced. In addition, it will be possible to have
multiple XML parsers and multiple parse users in the system
with only N+M complexity instead of NxM complexity. An
example of this is "parsing" out of a document stored in the
DM so that it can be [re-]indexed.
Issues
The XML parser is the repository for knowledge about
numbering of XML content. At the very least there is the
Element-ID numbering currently used by the DM. However,
even this simplistic numbering scheme must take into account
the way documents are broken up in Niagara. For example,
-2-
text between XML elements at a given level are stored as
Text nodes and the numbering must be adjusted to account for
this.
In addition to element numbering there is also the start,
end numbering used by the IM. This numbering scheme is
needed as it reflects the structure of the document and
allows contains predicates to be evaluated against two post-
ings which have start, end numbers. Knowledge of document
structure is required for contains queries.
With this new Niagara we are also introducing the concept of
payload word numbering. This allows exact non-stopword
queries to be performed by the IM -- see the IM vignette for
details.
Stopwords are another issue, as the parser may need to num-
ber stopwords in some cases and not number them in other
cases. As such, the parser needs to know what a stopword
is. It may be important to have a configurable stopword
list. For now a fixed stopword list is good enough, but
this is something important to consider for the future.
In general, the parser will most likely provide more
information than is needed by any one entity which is
receiving its parse events. Generically, this means the
parser will provide a union of all parse information
required by each of its users.
Multiple Parsing
All of the above goals are workable with a standalone
parsing module. One parser in the system is better than two
and a big step forward from what we have now. However, one
of the goals is to have the parser provide parse events to
multiple consumers of those events at the same time. This
is what will allow one parse to both load the document into
the DM, and to generate postings which are stored in the IM.
Since this is a modular design, the multiple parsing module
could just be an ordinary parse event receiver which then
duplicates the events to a number of other parse event
receivers.
It is currently assumed that the parse event interface
will be procedural in nature, rather than an actual event-
driven channel such as a stream. This is specified for the
following reasons. First is to minimize the amount of data
that is buffered. If streams were used and the two con-
sumers have different consumption rates then a backlog will
accrue to to slower consumer.. By using a synchronous
interface this buffering problem is eliminated. Another
issue to consider is the need for incremental parsing. If
there is, for example, a large string we don't want to have
to build it all in memory before passing it to the consumer.
-3-
An incremental approach where portions of the large data are
passed to the consumers reduces the impact of large data and
makes the system more scalable. The third factor to con-
sider is the issue of copying data. XML parsing is an
expensive operation, and copying the data to multiple
sources is that much more expensive.
The bottom line is that the procedural interface is simple,
highly adaptable, and useful from the beginning. We may
decide to change it in the future, but it is a good building
block for now, and may be all that Niagara requires.
Implementation Notes and Goals
One goal of the parsing system is to divorce the rest
of the system from dependence upon a particular XML parser.
As such, the parse events interface should completely iso-
late the consumers of parse events from the produce of the
same. This means that there should be no direct or indirect
coupling between the existing XML parser, Xerces-C, and the
remainder of the Niagara system. In other words, the XML
parser is an implementation detail of the parse stream gen-
erator and nothing more.
Currently the IM and the DM each drive their own load-
ing with a load(url) function. They know about creating
transactions for load, and then either commit or abort the
transaction. In the new system this will change -- loads
will happen in the context of a load operator which is
responsible for connecting a document source to an XML
parser. The load operation will also be responsible for
collecting the receiver(s) of parse events and arranging for
the parse events to be sent to those event receiver(s). The
IM and DM will know nothing about transactions, and operate
in the context of one. Transaction abort and commit will be
carried out by higher-level entities which can make policy
decisions.
Currently, the IM and the DM are both entities with
somewhat unrelated halfs. The important part of both the IM
and the DM are the storage and retrieval of database infor-
mation. However, both the IM and the DM have a large frac-
tion devoted to parsing XML into a format which the storage
engine can then store. This increases the complexity of the
two systems quite a bit, Moving the bulk of this functional-
ity to the parsing system, with only a IM or DM relevant
data interpreter to convert the parsed data into the format
needed by those subsystems. Even better would be to isolate
the the data conversion process in the IM and DM from the
actual DB components. This would allow the DB components to
just store a "stream" of relevant objects, whether postings
or XNodes. This will make the DB components just components
which deal with database access, and make the system more
orthogonal. Of course this is a long-range goal, but it
-4-
should be known up-front to guide development of the system.
It is important that the parser be "scalable" in terms
of document size. This scalability item means that parsing
should be independent of document size -- if document size
grows parsing size shouldn't grow with it. However it is
also important that the parser be scalable on a smaller
level -- for example individual elements of a document,
whether they be words or XML elements. For example, a 100MB
word in a document must not require the 100MB word to be
formed in memory before it is passed to the event receivers.
The event driven nature of the parser will tend to take care
of the large documents by breaking them up into individual
elements. However, a further incremental approach is
required for handling individual large content. For exam-
ple, a start_element function to start a new element of the
text, and a continue_element function which incrementally
adds text to a started element until it is complete. Of
course, a small element would have be complete with just the
call to start_element, and not require further calls.