SEQL Streams


Current Implementation

     There  are  no operators or streams to connect them int
the SEQL portion of a Niagara query.  The things called SEQL
operators  are not real operators, but rather operators in a
SEQL expression evaluation tree.  The SEQL query  expression
is  buried  as a predicate to the Index Scan oeprator.  When
executed the index scan presents the SEQL predicate  to  the
SEQL  expression evaluator.  At that point, the SEQL expres-
sion evaluator evaluates the operators in the SEQL tree.

     The SEQL query is evaluated completly in  memory  by  a
bunch  of  almost  but  not  quite identical merge-sort-like
algorithms.  All of the SEQL query  evaluation  is  done  in
main  memory.   The  SEQL operator generates a Index Manager
retrieval of stored postings, and presents it to the IM  for
retrieval.   It  does this for both the left and right sides
of each SQL operator.  In return, the IM reads postings from
disk and constructs an in-memory representation of the post-
ings requested by the SEQL operator.  Note that this  brings
both  complete  set of postings into memory before any query
processing  is  done.   After  the  in-memory  postings  are
returned to the SEQL operator, the SEQL operator sort-merge-
like code runs to generate another in-memory list  of  post-
ings -- the result of the operator.

     At  this point, just one operator the of the SEQL query
has been evaluated, and an in-memory result has been  gener-
ated.   This is feed as one input to the next SEQL operator.
The other input will be staged from disk into  memory  again
by the IM, or be already in memory as a result of prior SEQL
operator execution.

     When the evaluation of the SEQL operator tree  is  com-
plete,  the  resulting in-memory set of postings is returned
to the Index Scan operator which started  execution  of  the
SEQL evaluation tree.  At this point index scan converts the
postings returned by the SEQL operator into  XNodes.   Those
XNodes are then placed onto the QE stream, the output of the
index operator, as  single-element  tuples.   From  now  one
query execution is completely done at the QE level.

Notes

     Note that the optimizer tries to push down as much work
as possible from the select operators in  the  query  engine
part  of  the  tree into the SEQL expression.  This tends to
create  complex  SEQL  queries.   However,  a  corresponding









                             -2-


select  may remain in the QE part of the query to due to the
non-exact nature of non-structural SEQL queries.

     XXX maybe move this section to QE notes
Note that if a "in document" query is  performed,  that  the
File Scan operator is used.  That operator replaces the com-
plete index scan + SEQL operator, and just produces a single
XNode  on  its output.  That XNode is the document root, and
all query processing is done by the unnest operators in  the
QE plan.

     SEQL  results  can  be thought of as a stream, although
they are not implmented as such.  SEQL streams convey single
field  tuples  corresponding to the individual posting.  The
SEQL operators only return posting tuples from the left side
of any SEQL operator.

Problems

     Memory  use:  SEQL queries can only be executed in mem-
ory.  This can give them quite  a  large  memory  footprint.
The memory footprint grows with the complexity of SEQL oper-
ator trees, and the  number  of  independent  SEQL  operator
trees.

     High Latency: All postings must be completely read into
memory before evaluation of the SEQL operators may commence.
This  produces a lot of latency during a query.  At the very
least, the total latency for a query will be the sum of  the
disk  I/Os needed for the deepest SEQL tree.  It is also not
possible to do efficient partial results for the same reason
--  no  results are generated until all the SEQL queries are
complete.

     Loss of information: The Index  Manager  and  the  SEQL
Operators  are  actually  capable of returning fully correct
structural queries at a far lower cost  than  using  the  QE
unnest  operator.   Two  thinsg  prevent this.  The first is
that only postings from the left side of the  SEQL  operator
are  returned.   This  is a real problem since you need both
the left and the right hand side posting to make a  complete
reference  to  the queries object.  For example, for 'A con-
tains B', you want both the A, and the Bs which that A  con-
tains to be returned.  With that info, unnest becomes super-
flous in most queries which use the IM.

The other problem in this area is that  the  SEQL  operators
give  up  execution  as soon as they find a potential match.
For example, if one 'A contains B' is  run,  and  one  B  is
found,  the  operator  gives  up  on A and returns it in the
result list.  What it should do  is  continue  to  find  all
instances  of  B contained in A.  Of course, this is useless
with the current setup, since the unnest  will  exhaustively
search for all matches.  However, since we have already paid









                             -3-


the cost of examinging the IM information, we should use  it
to the greatest possible benefit.

Solutions

     SEQL  operators will become full-fledged operators with
their own thread and context.

     SEQL operators will be connected by SEQL streams.   The
tuples  on the SEQL streams may have multiple fields.  These
will allow all the exact structural matches found in the  IM
to  be  returned directly to the Query Engine.  It will also
provide the locations of possible non-exact payload matches.
By doing this the work that the unnest operator has to do is
greatly reduced.

     These changes will allow the entire query tree  to  run
"live",  instead  of  blocking occuring while the SEQL query
runs.  This change wil also  allow  for  partial  execution,
partial  results,  and  early  results  to  be  seen  by the
receiver of query results.  The memory  utilization  of  the
SEQL code will be greatly reduced as well.

Implementation Details

     There  was a big concern that we should try to optimize
the contents of the tuples contained in  the  SEQL  streams.
This  is  because  of  the  redundant information carried in
those tuples, such as the document-ID, which can often apply
to  many subsequent tuples.  This optimization can only hap-
pen on  runtime  and  is  dependent  upon  the  actual  data
queried.   It  is possible to implement such prefix compres-
sion of tuple contents by making the streams  aware  of  the
tuple content.  The streams could decide when compression of
the tuples would be advantageous, and would provide a encom-
passing  document-ID prefix which would apply to all follow-
ing tuples until the prefix was changed.   With  the  prefix
available,  the document-ID field could be stripped from all
subsequent tuples  with  the  resulting  space  savings  and
effective bandwidth increase of useful payload.

While  it is possible to implement this, we are not going to
implement this feature at  this  time.   It  will  introduce
additional  complexity and would be quite likely to increase
the overhead of tuple processing beyond that of simply copy-
ing  the  data.  If the situation in the future warrants it,
we have a plan to deal with it, and  the  changes  need  not
have  any  effect  on  the  system as a whole, only with the
implementation of streams and their tuple payload.

     Using streams of tuples  allows  us  to  use  to  great
effect  the  many  relational query operators as SEQL opera-
tors.  Instead of being forced to use sort-merge style oper-
ators,  our  horizon can be exanded to take advantage of the









                             -4-


great quantity of  relational  experience  to  process  SEQL
queries efficiently.