SEQL Streams Current Implementation There are no operators or streams to connect them int the SEQL portion of a Niagara query. The things called SEQL operators are not real operators, but rather operators in a SEQL expression evaluation tree. The SEQL query expression is buried as a predicate to the Index Scan oeprator. When executed the index scan presents the SEQL predicate to the SEQL expression evaluator. At that point, the SEQL expres- sion evaluator evaluates the operators in the SEQL tree. The SEQL query is evaluated completly in memory by a bunch of almost but not quite identical merge-sort-like algorithms. All of the SEQL query evaluation is done in main memory. The SEQL operator generates a Index Manager retrieval of stored postings, and presents it to the IM for retrieval. It does this for both the left and right sides of each SQL operator. In return, the IM reads postings from disk and constructs an in-memory representation of the post- ings requested by the SEQL operator. Note that this brings both complete set of postings into memory before any query processing is done. After the in-memory postings are returned to the SEQL operator, the SEQL operator sort-merge- like code runs to generate another in-memory list of post- ings -- the result of the operator. At this point, just one operator the of the SEQL query has been evaluated, and an in-memory result has been gener- ated. This is feed as one input to the next SEQL operator. The other input will be staged from disk into memory again by the IM, or be already in memory as a result of prior SEQL operator execution. When the evaluation of the SEQL operator tree is com- plete, the resulting in-memory set of postings is returned to the Index Scan operator which started execution of the SEQL evaluation tree. At this point index scan converts the postings returned by the SEQL operator into XNodes. Those XNodes are then placed onto the QE stream, the output of the index operator, as single-element tuples. From now one query execution is completely done at the QE level. Notes Note that the optimizer tries to push down as much work as possible from the select operators in the query engine part of the tree into the SEQL expression. This tends to create complex SEQL queries. However, a corresponding -2- select may remain in the QE part of the query to due to the non-exact nature of non-structural SEQL queries. XXX maybe move this section to QE notes Note that if a "in document" query is performed, that the File Scan operator is used. That operator replaces the com- plete index scan + SEQL operator, and just produces a single XNode on its output. That XNode is the document root, and all query processing is done by the unnest operators in the QE plan. SEQL results can be thought of as a stream, although they are not implmented as such. SEQL streams convey single field tuples corresponding to the individual posting. The SEQL operators only return posting tuples from the left side of any SEQL operator. Problems Memory use: SEQL queries can only be executed in mem- ory. This can give them quite a large memory footprint. The memory footprint grows with the complexity of SEQL oper- ator trees, and the number of independent SEQL operator trees. High Latency: All postings must be completely read into memory before evaluation of the SEQL operators may commence. This produces a lot of latency during a query. At the very least, the total latency for a query will be the sum of the disk I/Os needed for the deepest SEQL tree. It is also not possible to do efficient partial results for the same reason -- no results are generated until all the SEQL queries are complete. Loss of information: The Index Manager and the SEQL Operators are actually capable of returning fully correct structural queries at a far lower cost than using the QE unnest operator. Two thinsg prevent this. The first is that only postings from the left side of the SEQL operator are returned. This is a real problem since you need both the left and the right hand side posting to make a complete reference to the queries object. For example, for 'A con- tains B', you want both the A, and the Bs which that A con- tains to be returned. With that info, unnest becomes super- flous in most queries which use the IM. The other problem in this area is that the SEQL operators give up execution as soon as they find a potential match. For example, if one 'A contains B' is run, and one B is found, the operator gives up on A and returns it in the result list. What it should do is continue to find all instances of B contained in A. Of course, this is useless with the current setup, since the unnest will exhaustively search for all matches. However, since we have already paid -3- the cost of examinging the IM information, we should use it to the greatest possible benefit. Solutions SEQL operators will become full-fledged operators with their own thread and context. SEQL operators will be connected by SEQL streams. The tuples on the SEQL streams may have multiple fields. These will allow all the exact structural matches found in the IM to be returned directly to the Query Engine. It will also provide the locations of possible non-exact payload matches. By doing this the work that the unnest operator has to do is greatly reduced. These changes will allow the entire query tree to run "live", instead of blocking occuring while the SEQL query runs. This change wil also allow for partial execution, partial results, and early results to be seen by the receiver of query results. The memory utilization of the SEQL code will be greatly reduced as well. Implementation Details There was a big concern that we should try to optimize the contents of the tuples contained in the SEQL streams. This is because of the redundant information carried in those tuples, such as the document-ID, which can often apply to many subsequent tuples. This optimization can only hap- pen on runtime and is dependent upon the actual data queried. It is possible to implement such prefix compres- sion of tuple contents by making the streams aware of the tuple content. The streams could decide when compression of the tuples would be advantageous, and would provide a encom- passing document-ID prefix which would apply to all follow- ing tuples until the prefix was changed. With the prefix available, the document-ID field could be stripped from all subsequent tuples with the resulting space savings and effective bandwidth increase of useful payload. While it is possible to implement this, we are not going to implement this feature at this time. It will introduce additional complexity and would be quite likely to increase the overhead of tuple processing beyond that of simply copy- ing the data. If the situation in the future warrants it, we have a plan to deal with it, and the changes need not have any effect on the system as a whole, only with the implementation of streams and their tuple payload. Using streams of tuples allows us to use to great effect the many relational query operators as SEQL opera- tors. Instead of being forced to use sort-merge style oper- ators, our horizon can be exanded to take advantage of the -4- great quantity of relational experience to process SEQL queries efficiently.