Database Programming Languages (DBPL-5) An Algebraic Framework for Physical OODB Design

Physical design for object-oriented databases is still in its infancy. Implementation decisions often intrude into the conceptual design (such as inverse links and object decomposition). Furthermore, query optimizers do not always take full advantage of physical design information. This paper proposes a formal framework for physical database design that automates the query translation process. In this framework, the physical database design is speciﬁed in a declarative manner. This speciﬁcation is used for generating an efﬁcient query transformer that translates logical queries into programs that manipulate the physical database. Alternative access paths to physical data are captured as simple rewrite rules that are used for generating alternative plans for a query. we present a methodfortranslatingthese directivesintoa form that facilitatesan automated translationof logical queries and updates. The program translation as well as the elimination of the intermediate logical structures in the resulting program is based on a formal model.


Introduction
One important advantage that commercial database systems offer is data independence, whereby abstract objects and the operations upon them can be significantly decoupled from their implementations. In a relational database system, for example, a database designer may choose the implementation of a database table from a number of possible structures (such as a B-tree or a hash table) as well as attach secondary indices to the table. These implementation decisions will not affect how queries are expressed in the database language but only how they are compiled and optimized. Furthermore, some systems provide a restructuring mechanism to change the implementation of parts of the database or to modify the database schema itself without losing any stored data.
Physical design for object-oriented databases is more difficult than for relational systems because the complexity of object-oriented database (OODB) data models results in a larger number of implementation choices. The database designer may consider clustering versus normalization for various nested collections in the database, create inverse links, attach secondary indices, materialize functions and views, partition large objects, etc. [15,14,23,4]. It is highly desirable to have these choices isolated from the conceptual model itself, leaving the application programmer to worry only about what data to retrieve, not how to retrieve the data. Achieving the same degree of data independence in an OODB system as in a relational database system is a major challenge for object-oriented databases.
This paper presents a framework for specifying the physical design in a declarative language, called the physical design language. It consists of a small, but extensible, repertoire of commands (called physical design directives) for specifying the implementation techniques for various parts of a database. For example, one command may indicate that a specific nested collection be normalized (flattened out) into two collections. The query translator uses these commands to translate queries against the conceptual database into queries against the physical database. If normalization was chosen for a nested collection, then a logical query that manipulates this nested collection may be translated into a query that joins the two normalized collections. The physical design language described in this paper captures most of the recent proposals for OODB physical designs, including clustering, horizontal and vertical partitioning, normalization, join indices, and multiple access paths via secondary indices. Expressing a physical design as a set of independent directives simplifies the physical design process.
The query translation process in our framework consists of several stages. First, the database administrator specifies the conceptual database schema. The main concern of this person is to write functionally correct specification satisfying all the design requirements. Then, the database implementor specifies the physical design in such a way that the performance of the resulting system is acceptable for the needs of this application. This person is also responsible for tuning the database to cope with new performance requirements. Finally, the application programmer submits a logical query against the database without any knowledge of the physical design. The query translator translates the query into a physical plan that reflects the physical design and ideally runs faster than any other equivalent plan. The query evaluator executes this plan and returns the result to the application programmer.
Query translation in our framework is purely algebraic and can be easily validated for correctness. In our framework, the physical database design has an internal schema that specifies the structure of the internal database state, an abstraction function [11] that maps the internal schema into the conceptual schema, and a set of constraints that capture the alternative access paths (such as secondary indices, materialized functions and views). The abstraction function is a logical view of the physical database. This function always exists, since otherwise there would be some semantic information lost when the conceptual database is mapped into the physical storage. Given the conceptual schema of an OODB and a set of physical design directives, we have an automated method for generating the physical schema, the abstraction function, and the plan transformers (this is the optimizer generation component in Figure 1). This method is the focus of the paper. It is expressed in rule form, requiring only one rule per physical design directive, and allows extensions to more complex physical design methods.
Our physical design framework requires that both conceptual and physical data structures, as well as the operations upon them, be defined in the same language. The language used in this paper is called the monoid comprehension calculus [9,10] because it is based on monoids and monoid comprehensions. Logical collection types, such as sets, lists, and bags, as well as physical data types, such as B-trees and hash tables, can be captured as monoids.
Logical queries are equivalent to queries against the conceptual database built from the internal database via the abstraction function. That is, any logical query can be transformed to a program that manipulates the physical database if we replace all references to the conceptual database state in the query with the logical view of the physical database state. The query translation process in our framework consists of substituting R(DB) for all occurrences of db in a logical query and normalizing the resulting program, where db is the conceptual database state, DB is the physical database state, and R is the abstraction function (this is the composition component in Figure 1).
We give a normalization algorithm that removes all the unnecessary intermediate logical structures, in such a way that the resulting normalized program does not actually materialize any part of the conceptual database. The resulting program (the physical plan in Figure 1) is thus a query that directly manipulates the physical database. That is, if the abstraction function is expressed in the monoid calculus, then any query in the monoid calculus that manipulates the conceptual database can be efficiently translated into a query that manipulates only the physical database. Even though the abstraction function builds the entire conceptual database from the physical database, no part of this construction will actually take place if we normalize the resulting query. The normalization algorithm is purely algebraic, simple, and efficient.
Access path selection is achieved by substituting C i (DB) for DB in the derived physical plan, where C i is a plan transformer, and then normalizing the resulting program (this step is the plan generation component in Figure 1). This phase can be combined with the application of commutativity and associativity rules for monoid comprehensions. There is no need of using a rewrite system for these transformations, since we only use three types of rules: an application of a plan transformer, associativity, and commutativity. In fact an optimizer based on dynamic programming, such as the one for System R [19], would be sufficient for our purpose. In that case, the costing component in Figure 1 could be combined with the plan generation component.
In addition to query translation, in this paper we report an automated method for translating database updates against the conceptual database state into updates against the physical database.
The contributions of this paper are twofold. First, we present a declarative language for specifying physical design directives for an OODB management system that captures many recent proposals for OODB physical design. Second, we present a method for translating these directives into a form that facilitates an automated translation of logical queries and updates. The program translation as well as the elimination of the intermediate logical structures in the resulting program is based on a formal model. ?

Background
Queries in our framework are transformed into physical plans by a number of refinement steps. Thus, they need to be compiled into an algebraic form that captures both logical and physical operators. More importantly, the algebraic forms derived after query translation need to be normalized in a way that no intermediate logical structures are constructed during the evaluation of these forms. In this section we give a brief overview of the monoid comprehension calculus, which fulfills these two requirements. For a complete formal description of the calculus, which includes advanced data structures such as vectors, matrices and object identity, the reader is referred to our previous work [9, 10].

The Monoid Comprehension Calculus
A data type T in our calculus is expressed as a monoid M with a unit function: where the function merge, of type T T ! T, is associative with left and right identity zero. If in addition merge is commutative (idempotent, i.e., 8x : merge(x; x) = x ), then the monoid is commutative (idempotent). For example, (set(); fg; f ; [ ) , where f(x) = f x g , is a commutative and idempotent monoid while (int;0; g ; +), where g(x) = x , is a commutative monoid. When necessary to distinguish the components of a particular monoid M we qualify them as zero M , unit M , and merge M . We have two types of monoids: collection and primitive monoids. Collection monoids capture bulk data types, while primitive monoids capture primitive types, such as integers and booleans. Table 1 presents some examples of collection and primitive monoids. The C/I column indicates whether the monoid is a commutative or idempotent monoid. The monoids list, bag, and set capture the well-known collection types for linear lists, multisets, and sets [7] (where ++ is list append and ] is the additive union for bags). The monoid sorted[f] is parameterized by the function f whose range is associated with a partial order . The merge function of this monoid merges two sorted lists into a sorted list. If x appears before y in a sorted [f] list, then f(x) f(y).
In our treatment of queries we will consider only monoid types as valid types. A monoid type has one of the following forms: class name (a reference to a class) T (T is a primitive type, such as int and bool)

T(type)
(T is a type constructor, such as set, bag, and list) h A 1 : t 1 ; : : : ; A n :t n i (a record type) where type and t 1 ;: : : ;t n are monoid types and T is a monoid. That is, collection types can be freely nested.
A monoid comprehension over the monoid M takes the form Mf e | r g. Expression e is called the head of the comprehension. Each term r i in the term sequence r = r 1 ; : : : ; r n ; n 0 , is called a qualifier, and is either a generator of the form v e 0 , where v is a variable and e 0 is an expression, or a filter p, where p is a predicate. The scope of the variable v in Mf e | e 1 ; v e 0 ; r 2 g is limited to the rest of the comprehension, r 2 , and to the head of the comprehension, e. Like in most modern programming languages, the scope is textual and we have the typical scoping rules for name conflicts: e.g., the scope of the left v in Mf e | e 1 ; v e 0 ; r 2 ; v e 00 ; r 3 g is r 2 and e 00 , while the scope of the right v is r 3 and e.
For example, the join of two sets x and y, join(f;p)(x; y), is setf f(a; b) | a x; b y; p(a; b) g where p is the join predicate and function f constructs an output set element given two elements from x and y. For A monoid comprehension is defined by the following reduction rules: (A formal definition based on monoid homomorphisms is presented elsewhere [10].) Mf e | true; r g ! Mfe | r g (3) Mf e | v zero N ; r g ! zero M (4) Mfe | v unit N (e 0 ); r g ! let v = e 0 in Mf e | r g (5) Mf e | v merge N (e 1 ; e 2 ) ;r g ! merge M ( Mf e | v e 1 ; r g; Mf e | v e 2 ; r g ) (6) 5th International Workshop on Database Programming Languages, Gubbio, Italy, 1995 Rules 2 and 3 reduce a comprehension in which the leftmost qualifier is a filter, while rules 4-6 reduce a comprehension in which the leftmost qualifier is a generator.
This definition of a comprehension provides an equational theory that allows us to prove the soundness of various transformations, including the translation of comprehensions into efficient joins.
The monoid comprehension is the only form of bulk manipulation of collection types supported in our calculus. But monoid comprehensions are very expressive. In fact, a small subset of these forms, namely the monoid comprehensions from sets to sets, captures precisely the nested relational algebra (since they are equivalent to the set monad comprehensions [6]). For example, the nesting operator for nested relations is Similarly, the unnesting operator is The last comprehension is an example of a dependent join in which the value of the second collection s:P depends on the value of s, an element of the first relation x. Dependent joins are a convenient way of traversing nested collections.
But monoid comprehensions go beyond the nested relational algebra to capture operations over multiple collection types, such as the join of a list with a bag that returns a set, plus predicates and aggregates. For example, setf (x; y) | x [1; 2]; y f f 3;4;3g g g = f (1;3); (1;4); (2;3); (2;4)g Another example is sumf a | a [1; 2; 3]; a 2 g , which returns 5, the sum of all list elements greater than or equal to 2. They can also capture physical algorithms, such as the merge join: x is an instance of a sorted[f] monoid and y of a sorted[g] monoid (f and g are not necessarily the same). That is, this comprehension behaves exactly like a merge-join: it receives two sorted lists as input and it generates a sorted list as output. Even though the naive interpretation of this program derived from the comprehension definition (Rules 1 through 6) is quadratic, we will see later that there are some effective ways of assigning specialized execution algorithms to these programs. In that case, the program will be a real merge join. This assignment to efficient execution algorithms is possible by examining the types of the generator domains in a comprehension.
The calculus has a semantic well-formedness requirement that a comprehension be over an idempotent or commutative monoid if any of its generators are over idempotent or commutative monoids. For example, listf x | x f 1;2g g is not a valid monoid comprehension, since it maps a set (which is both commutative and idempotent) to a list (which is neither commutative nor idempotent), while sumf x | x f f 1 ; 2 g g g is valid (since both bag and sum are commutative). This requirement can be easily checked during compile time [9].
We will use the following convention to represent variable bindings in a comprehension: where e[u=x] is the expression e with u substituted for all the free occurrences of x (i.e., e[u=x] is equivalent to let x = u in e). A term of the form x u is called a binding since it binds the variable x to the expression u.

Program Normalization
The monoid calculus can be put into a canonical form by an efficient rewrite algorithm, called the normalization algorithm (described in detail elsewhere [10]). The evaluation of these canonical forms generally produces fewer intermediate data structures than the initial unnormalized programs. Moreover, the normalization algorithm improves program performance in many cases. The normalization algorithm will be used as a prephase to our query evaluator since canonical forms are a convenient program representation that facilitate program transformation. The physical design framework described in Section 3 uses this algorithm to eliminate value coercions introduced when mapping logical queries into physical programs. The normalization algorithm is a pattern-based rewriting algorithm. One example of a rewriting rule that this algorithm uses is unnesting nested comprehensions (i.e., comprehensions that contain a generator whose domain is another comprehension): Mf e | r; v Nfe 0 | tg ;sg ! Mfe | r; t; v e 0 ; s g (8) Rules 7 and 8 are the most complex rules of the normalization algorithm. The other rules include trivial reductions, such as a projection over a tuple construction results into a tuple component. Rule 8 may require some variable renaming to avoid name conflicts. The following is an example of a program normalization that requires variable renaming. The

by Rule 7)
A path path is a name (the identifier of a bound variable, or the identifier of a persistent variable, or the name of a class extent) or an expression path 0 :name (where name is an attribute name of a record and path 0 is a path). If the generator domains in a comprehension (i.e., expressions e in v e) do not contain any non-commutative merges (such as the list append), then these domains can be normalized into paths [10]. In the next section we will use the following shorthand: A path expression (as it is defined in [12]) is an expression of the form db:pth 1 :pth 2 : : : : pth n+1 , where each pth i is a path and db is the conceptual database state, and whose interpretation in our calculus is setf v n :pth n+1 | v 1 db:pth 1 ; v 2 v 1 : pth 2 ; : : : ; v n v n 1 : pth n g In addition to the normalization rules, there are other important program transformations that explore the commutativity properties of monoids. In particular, if M is a commutative monoid, then we have the following join commutativity rule: Mf e | r; v 1 e 1 ; v 2 e 2 ; s g ! M f e | r; v 2 e 2 ; v 1 e 1 ; s g which holds only when term e 2 does not depend on v 1 . The following transformation, which is valid for any monoid M, pushes a selection before a join if pred does not depend on v: Mf e | r; v e 1 ; pred; s g ! M f e | r; pred; v e 1 ; s g

Physical Design
In this section we show how to translate queries against the conceptual database into queries against the physical database in a way that reflects a user-specified physical design. The translation process is described through examples that illustrate the basic idea. The physical design language is presented in Section 4 while the rules for generating the query translator from a physical design are presented in Section 5. In the first example we normalize a nested relation. We intentionally kept this example simple so that one can easily express the abstraction function and the plan transformers by simply observing the conceptual and the physical schema. These observations will help us understand how these programs are generated automatically by the optimizer-generation component of our translator. We use these programs to translate a logical query into a physical plan and to derive alternative plans. The second example is more complex.
It is based on a conceptual OODB schema with a complex physical design. The purpose of this example is to support our claim that the same theory can be easily scaled up to capture more complex designs.

Example 1: Mapping Nested Relations into Flat Relations
Consider the following NF 2 conceptual database schema: db: set(h A: int, B: set(h C: int, D: int i), E: int i) Suppose that we want to implement this schema using flat table structures. The standard approach is to normalize the nested collection into two tables T1 and T2: table T1 holds the outer set while table T2 holds the union of all the inner sets. Then, whenever a query manipulates the initial nested collection, this nested collection is reconstructed via an implicit join. Furthermore, suppose that we want to implement the set as a B-tree indexed by A and we want to add a secondary index (also implemented as a B-tree) indexed by E. Using our physical design language (that will be described in detail in Section 4), this specification is expressed by the following physical design directives: secondary( db, E ) g Directive (1) indicates that the outer set be implemented as a B-tree indexed by A. Directive (2) indicates that the nested set (reached by the path expression db.B) be normalized. Directive (3) indicates that there will be a secondary index attached to the outer set. One possible internal (physical) schema that captures this design is the following: The # attributes in T2 and T3 hold tuple identifiers. Sequence T1 is implemented as a sequence sorted by A, that is, 8x; y 2 T1 : @x @y ) x:A y:A. A similar equation holds for the secondary index T3. Sequence T2 is indexed by the # attribute, that is, 8x; y 2 T2 : @x @y ) x:# y:#. If x 2 T2 is a child of y 2 T1, then x:# = @y.
The inner set of the conceptual database is implemented as a sorted[#] sequence so that the join between T1 and T2 over the join predicate x:# = @y, which reconstructs the nested set, can be performed as a merge join. Similarly, for each x 2 T1 there is y 2 T3 such that y:# =@xand y:E =x:E.
Let R be the abstraction function that maps the physical schema DBtype to the conceptual schema dbtype. That is, if db of type dbtype is the database state as a user sees it and DB of type DBtype is the actual database state as it is stored on disk, then db = R(DB). For our example, we have: In addition, there is a relationship between the table T1 and its secondary index T3. This relationship can be captured by the function C (a plan transformer), which represents a referential integrity constraint on the physical schema:  table T1 can also be retrieved by joining T1 with T3. That is, if a tuple b of the secondary index T3 is located (e.g., by providing the value b.E), then the associated tuple a of T1 is located by the equijoin. The tuple identifier @ of the resulting tuples in T1 is set to @a so that the tuples in T1 have the same tuple identifiers as those generated by the comprehension. That is, the TID @ is handled as a record attribute, even though it does not occupy any physical space. This function makes the tuple identifiers of all the records in DB equal to the tuple identifiers generated by the expression in the C definition.
An abstract query is a function f over the conceptual database db. We see that the initial dependent join, which was over a nested collection, is flattened into an 1NF join. Notice that DB.T1 is sorted by both @ and A attributes while DB.T2 is sorted by @ and #. That is, the derived program has the functionality of a sort-merge join since the join predicate is b.#=@a. This functionality can be deduced directly from the types of the comprehension generators. In contrast to most query optimization approaches, the programs derived in our framework are guaranteed to be correct since our framework uses transformations that are purely algebraic and meaning preserving.
The alternative access path of using the secondary index T3 can be derived from the equation F 0 (DB) = F ( C ( DB)): The resulting program is an alternative plan to evaluate the initial logical query. It is a 3-way sort-merge join that corresponds to the alternative access path associated with the secondary index T3. Both programs F 0 (DB) and F (DB) should be considered by the query optimizer for costing. If there were many integrity constraints because of multiple access paths, then an optimization step would consist of selecting one of the plan transformers C, substituting C(DB) for DB in the current program, and normalizing the resulting program. The optimization process consists of the exploration of all the alternative programs generated by applying this optimization step multiple times as well as of using the commutativity and associativity properties of monoids.

Example 2: OODB Physical Design
The example presented here translates an OODB query into a physical plan that reflects an OODB physical design. The conceptual database schema is the following: where the extent name is a collection of all instances of a class. The database schema db associated with this specification is the aggregation of all class extents along with a number of persistent variables. To make our examples short, though, we will assume that there are no persistent variables. In that case, db has type: h hotels: set(hotel), cities: set(city) i As we mentioned earlier, physical design in our framework consists of a set of physical design directives specified by the database implementor. In order to reduce the number of required physical directives, we assume a default implementation for the database. Then the physical design directives are commands to change these defaults.
In the default implementation, objects from two different classes are not clustered together. That is, the hotels extent will be stored in a different storage collection than the cities extent, while each cities.hotels bag will be a bag of OIDs 1 that reference hotels. But the database implementor can cluster cities and hotels together by stating the right physical directive. The default implementation for a nested collection, such as the hotels.rooms, is the direct storage model [23]: all hierarchical object structures are stored in preorder form. For example, hotels and hotels.rooms are clustered together, with the rooms of a hotel stored adjacent to the hotel.
The following is an example of physical design directives specified by the database implementor during the physical design of the previous OODB example: (2) secondary( hotels, address ), join index( hotels.rooms ) g (5) Directives (1) and (2) indicate that both cities and hotels will be implemented as B-trees indexed by name. Directive (3) indicates that a secondary index on attribute address will be attached to hotels. Directive (4) indicates that cities.hotels will be normalized. The conceptual nested collection is reconstructed by a join. Directive (5) requests a binary join index for hotels.rooms. This directive implies that hotels.rooms be normalized and that there will be an additional index for accelerating the join between the normalized tables. According to these physical design directives, the physical schema DB for our OODB example is the following: (it is automatically generated by a program described in Section 5)  That is, the set of rooms in a hotel b is reconstructed by joining the normalized table hotels rooms with the join index hotels rooms JI. The set of all hotel references cities.hotels in a city a is reconstructed by joining the normalized table cities hotels with the hotels extent.
The plan transformer generated (because of the secondary index) is the following: Observe that this query is purely in terms of physical storage structures and has no nested comprehensions, hence it is not reconstructing any of the structures in the conceptual database. The resulting program is still a dependent join since c is derived from a.places to visit. But the collection DB.cities.places to visit is not normalized. Therefore, all places to visit are clustered together with the cities. Hence, when a city a is retrieved, all places to visit in a are retrieved as well.
If we use the secondary index secondary(hotels,address), the previous program becomes setf x.name | a C (DB).cities, c a.places to visit, b C (DB).cities hotels, x C (DB).hotels, @x=b.INFO, b.#=@a, a.name="Portland", x.name=c.name g which, when normalized by the Rules 8 and 7, becomes setf y.name | a DB.cities, c a.places to visit, b DB.cities hotels, y DB.hotels, z DB.hotels address, z.#=@y, @y=b.INFO, b.#=@a, a.name="Portland", y.name=c.name g

Physical Design Specification
The following is the detailed description of the physical design directives. This description is by no means a complete list. It can be easily extended to incorporate new physical design techniques, new storage structures, and new physical algorithms. Such extensions are easy to incorporate because, as we will see next, each design technique can be expressed in a declarative way, in a form of a rule that is independent of the other rules. We have been experimenting with vertical partition of collections, hierarchical join indices [23], implementation of OIDs with surrogates, materialized functions and views, and denormalization [17] (where two collections that are not nested together are stored as a nested collection), but we decided not to include them here to simplify the exposition of the translation algorithms.

The Optimizer Generator
The following algorithms generate the physical schema, the abstraction function, and the semantic constraints from the conceptual schema and from the physical design directives. To make the algorithms simple, we assumed that the physical design directives have been checked for semantic correctness and for possible conflicts before they fed to these algorithms (e.g., all expression paths in the directives are valid within the conceptual database schema). The condition checks whether a specific directive exists in the set of physical design directives. Only the first rule whose head matches the current type and whose condition matches one of the directives is executed. The matched directive is not used again. For example, the rule that checks for a partition directive can only be used once for each directive, hence allowing multiple horizontal partitions for the same collection.

T ([ [base
The conceptual database db is mapped into the physical schema T ([ [dbtype] ];db), which is a record since dbtype is also a record. The resulting record is extended with the following record attributes that contain the normalized collections   The primitive monoid pick in the third rule is over tuple identifiers. Its zero value is null, its unit function is the identity function, and its merge function satisfies merge pick (null; x ) = x , otherwise merge pick (x; y) = x . For example, pickf @x | x DB.hotels, @x=h g dereferences a hotel from the class extent DB.hotels using the TID h. If there is no such hotel, then it returns null. If there are more than one hotel (this never happens, since TIDs are unique), then it returns the first one.
The f(y) = x:KEY predicate in the next-to-last rule in Figure 3, which checks for a partition, is redundant because of the way this partition was constructed. But, if there were a generator v e in a comprehension, where e is partitioned by f, and a predicate f(v) = constant, then it is translated into x e; y x:PARTITION; f ( y ) = x:KEY; f ( y ) = constant, which implies x:KEY = constant. That way, only the partition with the specified KEY is retrieved. This is a secondary index attached to a nested collection, i.e., we can access any place to visit by providing its name only, without having to go through the cities extent.

Translation of Updates
In this section we are concerned with the translation of user-level database updates over the conceptual database into updates over the internal database. For example, if there was a secondary index attached to a table, then, when we insert an item into this table, we would like the secondary index to be updated as well.
Database updates can be captured by extending the definition of monoid comprehensions with the following comprehension qualifiers: Qualifier path := u destructively replaces the value stored at path with u, qualifier path += u merges the singleton u with path, and qualifier path -= u deletes all elements in the collection reached by path equal to u.
For example, if the abstract database db is of type set(int), then somef true | a db, a > 10, a += 1 g increments every database element greater than 10 by one. It returns true if there is at least one update performed. A more complex example related to the previous OODB schema is the following: somef true | c db.cities, c.name="Portland", h c.hotels, h.name="Benson", r h.rooms, r.beds=1, r.price += 100 g It increases the price of a single room in Portland's Benson hotel by $100.
If database updates modify primitive values only, then the query translation process described in Section 3 is sufficient for update translation too (since a conceptual path that reaches a primitive value is always translated into a physical path, while a conceptual path that reaches a collection may be translated into a complex comprehension.) For example, if we substitute R(DB) for db in the last comprehension and normalize we get: somef true | c db.cities, c.name="Portland", c.hotels += h name="Hilton", address="Park Ave", facilities=fg, rooms = f h beds=1, price=100 i, h beds=2, price=150 i g i g the logical path expression that corresponds to path, e.g., if path = s.price then ppath = db.hotels.rooms.price and type is the type of ppath.) The algorithm is given in Figure 4. It uses the following support functions: Database deletions can be handled in the same way as insertions (by substituting -= for +=). Updates of the form path := e, where path is a collection, can be translated into: somef true | x path, path -= x, y e, path += y g

Related Work
Our framework is based on monoid homomorphisms, which were first introduced as an effective way to capture database queries by V. Tannen and P. Buneman [5,7,6]. Their form of monoid homomorphism (also called structural recursion over the union presentation -SRU) is more expressive than our calculus. Operations of the SRU form, though, require the validation of the associativity, commutativity, and idempotence properties of the monoid associated with the output of this operation. These properties are hard to check by a compiler [7], which makes the SRU operation impractical. They first recognized that there are some special cases where these conditions are automatically satisfied, such as for the ext(f)(A) operation. In our view, SRU is too expressive, since inconsistent programs cannot always be detected in that form. To our knowledge, there is no normalization algorithm for SRU forms in general. (I.e., SRU forms cannot be put in canonical form.) On the other hand, ext(f) is not expressive enough, since it does not capture operations that involve different collection types and it cannot express predicates and aggregates. We believe that our monoid comprehension calculus is the most expressive subset of SRU where inconsistencies can always be detected at compile time, and, more importantly, where all programs can be put in canonical form.
Monad comprehensions were first introduced by P. Wadler [24] as a generalization of list comprehensions (which already exist in some functional languages). Monoid comprehensions are related to monad comprehensions, but they are considerably more expressive. In particular, monoid comprehensions can mix inputs from different collection types and may return output of a different type. This mixing of types is not possible for monad comprehensions, since they restrict the inputs and the output of a comprehension to be of the same type. Monad comprehensions were first proposed as a convenient and practical database language by P. Trinder [21,20], who also presented many algebraic transformations over these forms as well as methods for converting comprehensions into joins. The monad comprehension syntax was also adopted by P. Buneman and V. Tannen [8] as an alternative syntax to monoid homomorphisms. The comprehension syntax was used for capturing operations that involve collections of the same type while structural recursion was used for expressing the rest of the operations (such as converting one collection type to another, predicates, and