Grouping
                            and
                        Aggregation


Background

     Grouping  and Aggregation have historically been things
which are conceptually easy to think of ... yet the  result-
ing  implementations  are  a  mess.   In  relational systems
grouping and aggregation go hand-in-hand and their implemen-
tations  are interdependent upon each other.  The difficulty
arises in getting the aggregates  correct,  and  needing  to
process  a  potentially  large  number  of groups.  With XML
query processing, all of the above is true, though the situ-
ation  is far worse -- because XML and XQuery require struc-
tural aggregation in addition to simple  numerical  aggrega-
tion.  With structural aggregation you must store the entire
content of each group, as well as having  many  groups.   In
other  words  you  group  a forest of XML document fragments
which must be stored, instead of just aggregated into a sim-
ple aggregate.

     The  current  Niagara  implementation does not have any
grouping or aggregation capabilities, so there is nothing to
compare with in the old system.

     We  have  identified  three modes in which grouping and
aggregation can be used in an XML query system.  These three
are  by no means a limit on what our system can do, they are
just the modes we have identified as common.

1    Traditional numerical aggregation.  This form of group-
     ing  produces  a  relatively  simple result of the many
     inputs to each grop.  The complexity  here  comes  from
     the need for aggregate functions to compute the result.

2    Structural aggregation to  a  single  result  document.
     This  form of aggregation creates a document which rep-
     resents the content of the group.  This form is  useful
     if  no more query processing will be performed upon the
     aggregated data, or if the data must be  treated  as  a
     whole.   Further query processing on this data requires
     use of unnest.

3    Structural aggregation into  tuple  streams.   In  this
     form of structural aggregation the tuples being grouped
     remain as individual tuples.  This facilitates  further
     query  processing  on the grouped data.  Existing tech-
     niques of document_id order processing can be  used  to
     process the groups in group_id order.









                             -2-


     With  the addition of grouping comes the problem of how
to extract the resulting aggregates, numeric or  structural,
from  each  group.   This area is a prime target for careful
implementation, as ordering via group is easy to do at  this
stage,  and  inexpensive  compared  to  performing  the same
ordering at a later point.

     Another point to note is  that  structural  aggregation
can  produce  a  lot of structure that can be pruned quite a
bit through further query processing -- but on  that  group.
To  reduce  the  overhead  of  structural  aggregation it is
important to have the ability to  filter  tuples  which  are
placed  in  that  group.  This can also reduce the cost of a
structural grouping operator considerably, as it  no  longer
has  to  store a full group -- only the entries that will be
desired later.  You can think of this as a simple push of  a
predicate through the output of a grouping operator into its
input.

Grouping Operator Algorithms

     There are two important cases of input ordering to con-
sider   when   choosing  grouping  operator  --  ordered  or
unordered.  If the input is ordered by document_id ordering,
it  is  possible  to use a relatively trivial merge grouping
operator.  This operator  merely  notes  the  current  docu-
ment_id  and stops forming the group when the incoming docu-
ment_id changes.  Of course this works with other keys  too.
There  is no need to store tuple aggregate results; they can
be streamed to the next operator -- with the proper group ID
tag  included.   If  document aggregation is being done, the
resulting document can be passed  upstream  once  its  input
group  is complete.  Only one numeric aggregate function and
data context need exist at a time if  numerical  aggregation
is performed.

     If  the  group  is performed on an unordered attribute,
the cost increases considerably.  The typical  solution  for
unordered aggregation is a hash based grouping function.  If
a numeric aggregation is performed, the stored data for each
group  will  be  relatively small, consisting of the context
needed to form the numeric aggregate or aggregates which are
computed.   For  tuple  level structural aggregation a tuple
buffer will be needed to store the tuples  for  each  group,
and  spill them to disk if needed due to memory constraints.
Document level structural aggregation does requires  similar
resources  to  that  needed  by  numerical aggregation -- at
least at the  grouping  level.   Note  that  many  temporary
XNodes  will be consumed by the DM as the documents are con-
structed, and that may cause the DM to spill temporary docu-
ments to disk.












                             -3-


Aggregate Operators

     In addition to using aggregate functions with grouping,
XML query processing often needs stand-alone aggregate oper-
ators.  Our aggregate processing system eliminates duplicate
aggregate procesing in grouping and aggregate operators.  An
aggregate operator will exist, just as the grouping operator
exists.  The aggregate operator will be given  an  aggregate
function  to process, same as the grouping operator is given
to process aggregation.

Aggregate Functions

     The heart of our common aggregate  and  grouping  func-
tionality  are  the  aggregate  functions.   Just as you can
build predicate expressions for  predicate  evaluation,  you
can build aggregate expressions for aggregate evaluation.

         +-----------------------------------------+
         |          Aggregate Functions            |
         +-----------------------------------------+
         |Numeric      min                         |
         |             max                         |
         |             count                       |
         |             average                     |
         |             sum                         |
         |             histogram                   |
         +-----------------------------------------+
         |Ranking      Range                       |
         +-----------------------------------------+
         |Structural   Construct document fragment |
         |             Save Tuples                 |
         |             Tuple Stream                |
         +-----------------------------------------+

Conceptual Implementation

     Remember  that  this  is a conceptual level description
which is intended to show how aggregation and  grouping  fit
into  the  Niagara framework as just another piece.   Actual
implementation will need to use traditional relational tech-
niques, or discover new techniques, to be practicable.

     Conceptually,  group-by  can  be thought of as a switch
which routes input tuples to an output stream based upon the
attribute(s)  which the fIgroup-by is performed upon.  Those
streams are then sent to aggregate operators  which  perform
whatever aggregate is needed upon the data.  The stream out-
put of those operators is then  collected  by  a  conceptual
merge or order which combines all the individual groups back
into one output stream.

     That mentioned above is  the  conceptual  model.   Note
that  it  may actually be directly applicable for some query









                             -4-


execution plans; for example where only  a  few  groups  can
possibly exist.

o    Each  of  those  resulting  streams  may be left in raw
     tuple form for further processing; for example group-by
     outputs the tuples from the individual streams in a new
     order of group-id.  This is a parallel to the  existing
     document_id ordering used for multiple documents in the
     rest of the system, and will have similar semantics  in
     query  processing.   If  groups  do  not have a natural
     ordering the optional stream feedback mechanism can not
     be used for forward seeks, but only to skip to the next
     document.  Note that tuple  results  will  need  to  be
     buffered  until  the grouping operation can complete --
     perhaps even buffered to disk.

o    The ranking aggregates need only store the  appropriate
     number of tuples for that group.

o    The  numerical  aggregates  we have been discussing fit
     well into  this  conceptual  model  as  well;  however,
     instead  of  storing the tuples, the tuples are used to
     evaluate the aggregate functions  immediately,  storing
     only  whatever  context  is  required for computing the
     numerical result.

o    The document construction aggregates  will  work  simi-
     larly  to the numerical aggregates; they produce a sin-
     gle output tuple.  However that tuple will  contain  an
     XNode  which   corresponds to the constructed temporary
     document.

Real Implementation

     While simple, the conceptual  implementation  would  be
difficult  to  implement.  Instead, we will now translate it
into a form that is implementable in a  reasonable  fashion.
To  do so I'll introduce the concept of aggregate functions.
Aggregate functions take a tuple as an input.  The  function
returns  information  about  the aggregate it is processing.
The function may also produce an output tuple.

              +------------------------------+
              |      Aggregate Status        |
              +------------------------------+
              |OK                            |
              |OK, pass on this tuple        |
              |OK, pass on a new tuple       |
              |Skip remaining input          |
              |Skip to doc_id (for feedback) |
              |Error                         |
              +------------------------------+











                             -5-


     The aggregate functions will run in the context of  the
grouping  operator.   The  grouping  operator will determine
which group the tuple belongs to  based  upon  the  grouping
attribute.   The  grouping  operator  will then retrieve the
context for the aggregate function for that group.  At  this
point  the  aggregate function will be given the input tuple
and it's implicit context  and  start  executing.   It  will
update the context with the aggregate value, perhaps pass on
a tuple, do whatever it needs  to  do.   Multiple  aggregate
functions can be executed by a aggregate aggregate function;
for example to  compute  multiple  aggregates  on  the  same
group.  Then the execution continues with the next tuple.

     When  the grouping operation is complete, or a group is
complete,  the  aggregate   function's   context   will   be
retrieved.   The  aggregate will then be able to execute and
end-of-aggregate method.  This end-of-aggregate  allows  the
AF  to produce final results, specify what kind of output it
produces (for example  a  tuple  buffer  will  produce  many
tuples).   There  may be both a result retrieval and end-of-
aggregate interface to allow seperation of  those  functions
for some grouping use.