Efficient Evaluation of Aggregates on Bulk Types

A new method for eeciently evaluating queries with aggregate functions is presented. More specii-cally, we introduce a class of aggregate queries where traditional query evaluation strategies in general require O(n 2) time and space in the size of the (at most two) input relations. For this class of aggregate queries our approach needs at most O(n log n) time and linear space. Further, our approach deals not only with relations but with general bulk types like sets, bags, and lists.


Introduction
Many queries involve the application of functions like count, sum, avg 15].Among these queries, two di erent classes can be detected: scalar aggregates and aggregate functions 10].Scalar aggregates return a single number from an input relation.Examples thereof are count, sum, avg etc. Aggregate functions, on the other hand, return a set of values for a given relation.Aggregate functions typically involve grouping.Then, a scalar aggregate is applied to each group.As an example consider the query retrieving the number of employees for each department.In this paper we are concerned with the e cient evaluation of aggregate functions.
At the source level of an SQL-like query language, aggregate functions are often expressed as nested queries (Type JA of Kim's classi cation 14]).The original proposal to evaluate these queries is to perform a nested loop 1].In order to eliminate the inherent ine ciency of this approach, Kim suggests a more e cient method 14]: perform a join between the relations of the outer and inner block, group the result, and then apply the scalar aggregate to each group.Clearly, if the join is an equi-join, this can be e ciently implemented.Subsequently, several bugs in this approach have been detected and corrected 9,11,13].The main point is to use an outer join instead of a regular join.Nevertheless, the main idea | join, group, aggregate | still remains the same and is even today the only known improvement over nested loops 7,16,17].Let us call this approach A-G-J.Lately, it was proposed to exchange grouping and aggregate operations with joins 19,18,20,5,6].This work is orthogonal to our's in that our method can be used to implement the grouping and aggregate operations used there.
Consider the performance of this approach if the join is a non-equi join, e.g., a -join, or 6 =-join.
Then, if the two relations involved contain n and m tuples each, the size of the output in general is O(n m).Hence, the approach needs O(n m) time and space.If both relations are the same, which is frequently the case, the A-G-J approach needs O(n 2 ) time and space.
The goal of the paper is to improve this to O(n log n + m log n) and O(n logn).At the core of our approach is a special parameterized abstract datatype which we call -table for a comparison operator (e.g. 2 f=; ; 6 =g).A -table allows a linear space representation of all stages of intermediate results of aggregate functions.Thereby, it avoids the explicit representation of the (intermediate) result of the join which is the crucial point of ine ciency of the A-G-J approach 1 .Further, we show that -tables can be implemented and that the result of the aggregate function can be extracted e ciently.This e ciency is achieved by exploiting a simple observation on aggregates: they are decomposable and reversible (cf.Sec. 2).
The paper is outlined as follows.Section 2 gives some basic notations and the de nitions for decomposable and reversible aggregates.Section 3 introduces the abstract datatype -table.It contains a short motivating introduction, the de nition of -tables together with an outline of their e cient implementation, and some example applications.Section 4 concludes the paper.

Preliminaries
Since we do not want to restrict our approach to the relational case, we consider aggregates on bulk type instances.Typical bulk types are sets, lists, and bags, denoted by set, list, and bag, respectively.They all contain a number of elements of some type.To abstract from this type, we introduce type variables, denoted by .To indicate that the elements of a bulk type are restricted to type , we write bulk( ).We use the upper case letters X, Y , and Z, to denote bulk type instances.The elements thereof are denoted by their lower case equivalents x, y, and z.A singleton bulk type instance containing solely the element x is denoted by bulk(x).
For any given bulk type bulk, we assume the existence of ; and +.For the above bulk types set, list, and bag, these symbols denote the empty set and disjoint set union, if bulk = set, the emtpy list and the append operation, if bulk = list, and the empty bag and bag union, if bulk = bag.Operations corresponding to + can be de ned for all bulk types (see, e.g., 2]: their bulk constructor C can be used to de ne +).For all bulk types, we denote membership by the symbol 2. Further, we assume the existence of an iterator foreach which iterates over the elements of a bulk instance where duplicates are iterated over as many times as they are present and bulk type orders are adhered to.This roughly corresponds to the apply-to-all operator 2] but foreach will be used at a lower level of abstraction as a means for expressing query evaluation plans.At the algebraic level, we use the operator instead.For a function f : 1 ! 2 , the signature of f is f : bulk( 1 ) ! bulk( 2 ), and its semantics is de ned with f (;) = ;, f (bulk(x)) = bulk(f(x)) and f (x+y) = f (x)+ f (y). is sometimes called map operator.
For convenience, we introduce two abbreviations concerning .If is a tuple type and X is an instance of a bulk type bulk( ), we abbreviate x:x a:f] (X) by a:f (X) to denote the extension of a tuple by an additional attribute a containing the application of f to the original tuple.As usual, denotes tuple concatenation.We further abbreviate x: x:x] (X) by X x].This expression builds a bulk of tuples of a single attribute x whose values are the original element x of a bulk type X.Hence, any bulk( ) instance is turned into a bulk( x : ]) instance.The main motivation behind these de nitions is that it is quite convenient to deal with bulks of tuples only 7,12].The last operator needed on bulk types is selection.
For a function p : !Bool, it is denoted by p and has the signature p : bulk( ) ! bulk( ).The de nition is given by p (;) = ;, p (x + y) = p (x) + p (y), and p (bulk(x)) = bulk(x) for a singleton bulk(x), if p(x) and = ; otherwise.
After these preliminaries, we can now concentrate on aggregates.In order to keep the de nition of aggregates as general as possible, we use the special type symbol N to denote the codomain of an aggregate.Typically, N is a numeral data type such as integer or oat.Within the paper, we further need N 0 being tuples of numeral data types.That is, N 0 has a signature of kind a 1 : 1 ; : : :; a n : n ] where each of the n is a numeral.A scalar aggregate f is a function ) for all X, Y , and Z with Z = X + Y .This condition assures that f(Z) can be computed on arbitrary subsets (-lists, -bags) of Z independently and the (partial) results can be joined to yield the correct (total) result.
A decomposable scalar aggregate f : bulk( ) ! N is called reversible if for there exists a function ?1 : N 0 ; N 0 !N 0 with f(X) = ( ?1 ( (Z); (Y ))) for all X, Y , and Z with Z = X + Y .This condition assures that we can compute f(X) for a subset (-list, -bag) X of Z by \subtracting" its aggregated complement Y from the \total" ( (Z)) by using ?1 .
The fact that scalar aggregates can be decomposable and reversible is the basic observation uppon which the e cient evaluation of aggregate functions builds.
3 The Abstract Data Type -Table 3.

Motivating Introduction
Our goal is to apply our technique to queries whose translation into the algebra contains subexpressions of the form a:f( p( py ( px (Y y])))) ( qx (X x])) where 1. p is a predicate involving x and y, 2. p x is a predicate involving x only, 3. p y is a predicate involving y only, 4. q x is a predicate involving x only, 5. f is a decomposable/reversible scalar aggregate, and 6. a is a new attribute unequal to x and y.Let us call the class of these queries .
Since the treatment of the additional selections px , py and qx is rather trivial and we do not care about the actual construction of the tuples from the elements in X and Y , we further restrict our discussion to expressions of the form ( ) a:f( p(Y )) (X): where X and Y are bulk type instances containing tuples.Nevertheless, the extensions of the algorithms below to the general case are straight forward.
Before we proceed, let us consider two example queries contained in the class .The rst query retrieves all managers together with the number of employees earning more than the manager.The translation into the algebra yields a:count( m:salary e:salary ( e:Emp])) (Mgr m]): This directly matches the form (*).
The second query demonstrates the special case X = Y and | at a rst sight | looks more complex than the rst one.The query retrieves all students together with the percentage of students better than the student.We measure better by a higher gpa.The translation of the second query into the algebra In order to further simplify the subsequent discussion and not to overwhelm the reader with the technical subtleties of the approach, we further restrict the predicate p of (*) to the form x:a y:b: for 2 f ; 6 =g.Other comparison operators can be treated similarily.More complex predicates, involving (implicit) conjunctions like x:a ?c 1 y:b x:a + c 2 for constants c i , can easily be treated by expanding the case discussed below by exploring reversibility as exempli ed below for 6 =.
The introduction of -tables is easier to understand when their usage is clear already.Hence, we illustrate their usage without giving their de nition but simply explain verbally the functionality and complexity of the evaluation with and without -tables.For this, let us consider yet another simple example query: Retrieve all employees together with the number of employees earning less (to cheer them up).If all employees are tuples and are contained in a set X, and s denotes the salary of each employee, then translation into the algebra yields a:count( y:s x:s (X y])) (X x]): The additional information | the result of count | will be contained in the additional attribute a which is added during the evaluation of .
Let us rst evaluate the costs imposed by the A-G-J approach.If the number of employees is n, and their salaries are equally distributed, then the size of the -join to be computed for the A-G-J evaluation is n n=2.Hence, the time and space complexity of this approach is O(n 2 ).
Opposedly, we will apply the following general procedure to evaluate algebraic expressions of the form (*) in the case X = Y : (1) tt := new -table ; (2) foreach (x in X) (3) tt.insadd(x); (4) tt.sort(); (5) tt.eval(); (1) creates a new -table tt which takes constant time.( 2) and (3) insert all elements (x) into tt.Each insertion takes constant time.(4) performs some kind of sorting and processing of duplicates.This is the most expensive step taking O(n log n) time.(5) then performs the last step of evaluating the aggregate.Whereas and of, e.g., count have been applied during insadd already, is applied during eval.This step takes O(n).Since the space consumed by the -table is linear in n, the analysis of this evaluation yields the time complexity O(n log n) and space requirements of O(n) which clearly outperforms the corresponding A-G-J approach.
In case X 6 = Y , the general evaluation procedure of (*) is (1) tt := new -table ; (2) foreach (x in X) (3) tt.ins(x); (4) tt.sort(); (5) foreach (y in Y) (6) tt.add(y); (7) tt.eval(); The only di erence is that the information of Y is added to the -table in a distinct step (4) and (5).The complexity of this algorithm will be O(n) for inserting the n elements of X, O(n logn) for sorting, O(m log n) for adding the m elements of Y , and O(n) for eval.Hence, we reach the promised O(n logn + m log n).

-Tables
Let us start by giving the signature of a -table which depends on several parameters: 1. the comparison operator , 2. the types bulk x ( x ) and bulk y ( y ) of X and Y , respectively, and 3. the signature of the component of the aggregate f to be computed.The -table has the signature list n : int; x : x ; f : N 0 ] if : N 0 ; N 0 !N 0 .The attribute n counts the number of occurrences of the element x in X. Special care has to be taken not to destroy the original order of bulk x if there is one.Nevertheless, we omit this technical issue in the present extended abstract.Hence, we just assume that the elements x of X are directly contained in the attribute x of the -table.The third attribute, f, contains the result of applying and to some subset of X.This subset is determined by the predicate p = y:b x:a.More speci cally, the value of f in the entry containing x 2 X is the same as applying to the set of all y with y:b x:a.
Since a -table will always be of type list, it is necessary, to convert the -table during eval into the bulk type bulk y but we will skip this trivial step since it also depends on operators occurring outside (*), e.g., projections, other conversions like sort, unique etc.
We implement -tables as arrays.Hence, linear space will su ce and sorting can be done in place.Further, insertion via binary search takes O(log n) if the -table contains n entries.Typical extensions of a -table are shown in Fig 2 .An alternative is to implement a -table as a balanced binary tree or B ? tree but note that this consumes O(n logn) space.Also, note that although these data structures useful for implementing -tables have been known, the way we use these data structures to evaluate aggregate queries more e ciently is new.insert builds a trivial entry from the element x to be inserted and appends it to the -table.The entry appended to the -table is list n : 1; x : x; f : (;)]: sort After inserting all elements, the method sort is applied.This method does two things.First, it sorts the elements in the -table according to the value of x:a.Then, it eliminates duplicate entries and remembers the number of occurrences in the additional attribute n of the entries of a -table.Thereby, it accumulates the f entries using .
add For adding an element y to the -table add performs the following.Using binary search, add searches for the element n : n; x : x; f : f] within the -table with the smallest value x:a such that y:b x:a.This element is then replaced by n : n; x : x; f : (f; (bulky))]: insadd For insadd(x), we append n : 1; x : x; f : (bulk(x))] to the -table.
eval is de ned as follows: eval() f N 0 a = (;); foreach (t in self) replace t = n : n; x : x; f : f] by n : n; x : x; f : a = (a; f)]; return self; g The local variable a accumulates all of the f values for each entry in the -table up to the current element.Then, each f value is set to this accumulated value.
Let us demonstrate the methods by giving their results when applied to the two bags X and Y of tuples with a single attribute a and b, respectively.X and Y are shown in Fig 1 .The algebraic expression is a:count( y:a x:a (Y )) (X) For the scalar aggregate count, we have the regular scalar aggregate count, is + and is identity.Fig. 2 shows the results after insert, sort, add, and eval in a left to right order.

6 =-Tables
For a scalar aggregate f with decomposition , , and where : N 0 ; N 0 !N 0 , a 6 =-table has the following signature: list( n : int; x : x; f : N 0 ]): Hence, it is the same as for -tables.The only di erence will be the value of f.Let x:a 6 = y:b be the predicate of (*).Then, f will collect the aggregate of all the y values with y:b = x:a.The computation of the correct aggregate value for a given x can then easily be computed by \subtracting" the value of f from the total aggregate of all y 2 Y .The total aggregate will be computed during the add or insadd phase depending on X 6 = Y or X = Y .For holding the total aggregate, we assume the 6 =-table to have an additional local variable a.Now, we are ready to specify the methods of the 6 =-table : create during the creation, the local variable a is initialized with (;).insert(x) appends the tuple n : 1; x : x; f : (;)] to the -table .sort proceeds as before.add(y) looks for an entry n : n; x : x; f : f] with y:b = x:a and replaces it by n : n; x : x; f : (f; (bulk(y)))].Further, it performs a := (a; (bulk(y))).addins(x) appends the tuple n : 1; x : x; f : (bulk(x))] and assigns a := (a; (bulk(x))).
As an example let us compute a:avg( y:b6 =x:a(Y )) (X) for X and Y of Fig. 1. (For the de nition of avg see page 2.) The expression computes for each x 2 X, the average value of y:b of all y 2 Y with y:b 6 = x:a.The resulting states of the 6 =-table after insert and sort are given in the upper row of Fig. 3, those after add and eval in the bottom row.

Remarks on -Tables
We would like to make some remarks.First, note that -tables can be used to implicitly represent the result of any -join in linear space.This can be achieved by chosing appropriate , , and operators with a non-numeral in their signature, i.e., N 0 is some bulk type.More speci cally, if and are chosen to be identity and is chosen to be +, -tables implicitly represent the result of a -join in linear space.Hence, they can also be used to optimize queries involving non-equi joins.
Second, note that the methods add and eval give a possible implementation of the generalized aggregate function G-aggr 9].Remember that G-aggr is a combination of (unary) grouping and the computation of a scalar aggregate.
Third, -tables can be used to e ciently implement the binary grouping operation introduced in 8] in order to enable the unnesting of queries which cannot be unnested otherwise.
Fourth, the technique introduced here can be seen as an extension of two techniques proposed in Section 5.3 of 4] and 3].In the context of distributed relational databases, the techniques proposed there are the following: (1) Grouping followed by aggregation is distributed over union, if each group is contained in one fragment.(2) For queries containing a scalar aggregate but no grouping, the answer to the query is computed by distributing the aggregate over the partitions of the queried relation, if the scalar aggregate to be computed is decomposable.
We introduced the abstract datatype -table which allows to evaluate queries of class involving aggregates in O(n logn + m log n) or O(n logn) time and linear space, where the traditional approach needs O(n m) or O(n 2 ) time and space.A syntactic characterization of the class based on the result of translating the query into the algebra was given.This seems to be insu cient since the range of applications of -tables is not limited to this class.Hence, further research should investigate the total range of applicability of -tables and come up with a useful characterization of the corresponding queries.By useful we mean that an optimization can easily detect the applicability of -tables.
A rst step in investigating the applicability is implied by the following observation.The queries capturable by the algebraic expressions treated in this paper are typically those requiring one level of nesting within some SQL-like query language.Future research will explore -tables for more levels of nesting.
Last not least, since -tables are implicit representations of -joins, their usefulness for e ciently evaluating successive non-equi joins seems worth to be investigated.
yields s:s;b:b=Stud:card ( b:count( b:gpa>s:gpa(Stud b])) (Stud s])) where Stud.card denotes the cardinality of all students.If not materialized, it can be computed during the processing of the expression b:count( b:gpa>s:gpa(Stud b])) (Stud s]) which is of the form (*).

Figure 1 :
Figure 1: The Extensions of two bags of tuples with a single attribute

Figure 2 :
Figure 2: Di erent States of a -Table