A Graphical Yet Formalized Framework for Specifying View Systems

A graphical formalized language is proposed for specifying systems of views over database schemas. The language is based on the notion of arrow (mapping) between data schemas and is suitable for any data model for which schema mappings are defined. In particular, the constructs of query, query language, view and view integration can be consistently expressed in this arrow formalism and correspondingly specified. This gives rise to a general graph-based framework for specifying complex view systems. Basic constructions of the language and the entire framework as well can be considered as specialization of very general constructs developed in the mathematical category theory.


Introduction
The notion of view is one of the central ones in the database (DB) technology.Views make it possible to provide each application with its own presentation of data and isolate them from inessential (for them) details and changes of DB schemas.The practical importance of views is commonly recognized and, thus, a complex information system appears as a system of views over the DB, views over views, views over views over views and so on.In addition, these views can be somehow related between themselves, eg, one view can be the intersection of several other views, or the merge of several views, or both.In other words, the view system is itself a complex data structure subjected to integrity constraints and carrying certain operations.Hence, there is a practical need in languages specifying view systems in a compact and comprehensible but precise way.
The problem becomes especially actual in the context of modern information systems which are essentially heterogeneous and distributed over a DB net.As a result, the specification language in question should work even when views in the system are defined in different data models, ie, we need a language for specifying heterogeneous view systems.
The value of the issue is well recognized, however, while almost all is clear in the case of the relational data model, a proper notion of view against an OODB is still debatable.The situation is even worse in the context of heterogeneity , in fact, no sufficiently general notion of view was proposed, and each time the construct is managed in a ad hoc way (if any).
The goal of the present paper is to describe a specification language which makes it possible to specify heterogeneous view systems in a comprehensible yet formalizable way.Namely, we show that a view against a DB schema S is an arrow v : S V → S which denotes a mapping from the view schema S V into some augmentation S of S, S ⊃ S with derived items.Of course, it is presupposed that for each given data model the notions of schema, derived schema and schema mapping are accurately defined, and so quite concrete metadata are hidden behind a view arrow as above.On the other hand, this allows to use the same arrow pattern for specifying view systems in different data models.
Advances in Databases and Information Systems, ADBIS'97 Thus, a view system in a given data model is described by a directed graph whose nodes are data schemas and arrows are schema mappings.In addition, constraints imposed on the system can be specified by declaring the corresponding properties for some diagrams in the view graph.So, a view system is normally specified by a sketch ( [11]) rather than merely a graph, that is, by a graph in which some diagrams are labeled by predicate markers taken from some predefined signature.Different data models give rise to different marker signatures but this variety is not diversity: sketches in different signatures are nevertheless sketches, and they can be compared and integrated by special methods.Moreover, signatures are also specified by sketches and signature integration is reduced to sketch integration described in [8].Anyway, in many practically interesting cases signature integration is not difficult.
The view-systems-as-sketches framework is polymorphic in its nature, that is, is suitable for any data model: one must only define what are schemas and their mappings.In particular, the framework can be applied to the sketch data model itself 1 , that is, both data and metadata may be specified by sketches.In the present paper sketches are essential on the metalevel and, simultaneously, are used on the level of data modeling to exemplify and demonstrate the essence of the approach.
One can see that the proposed framework is based on arrows and needs the corresponding arrow machinery (and even special arrow thinking).Such a machinery (and kind of thinking) were developed in the mathematical category theory2 and have proven their extreme effectiveness in studying various systems of logic, in particular, logics employed in computer science and artificial intelligence.In a sense, the present paper can be considered as stating one more application of category theory in computer science.What we wish to stress in this respect is that abstract categorical constructs turned out to be unexpectedly close to the real software problem we discuss.
The rest of the paper is organized as follows.In section 2 a brief outline of sketches is presented to make the paper selfcontained.In section 3 the arrow setting for views is described.In section 4 a general graph-based framework for specifying federal database environments is presented.
Acknowledgement.We are indebted to Olga Tkacheva for careful TeXing (with Paul Taylor's "Diagrams") numerous sketches in this paper.

Sketches vs ER-diagrams: an example
The sketch data model was introduced in [11].We remind that a sketch is a directed multigraph in which some diagrams are labeled with special markers.The intended semantics is to interpret nodes as sets and arrows as functions (both evolving in time) while a marker is interpreted as a certain (constant) constraint imposed on the diagram of sets and functions whose schema is labeled by the marker.In other words, nodes denote classes and value domains, arrows denote references and attributes, and markers denote integrity-checking procedures (a collection of diagram markers is presented in Table 1).In this way one obtains a variety of sketch data models by choosing one or another signature of markers.Let us consider an example of semantic modeling via sketches.Suppose, the user is interested in information about men, married couples and married women of some town for, say, the last 50 years.Here "married women" means women which are or have been married during the last 50 years.A rough view on the universe is described by the ER-diagram on the top of Fig. 1 in a conventional ER-notation.Semantics of nodes and attributes is hopefully clear from its names, MDate,DDate are dates of marriage and divorce (the latter is optional).The domain of the optional attribute Character is a set consisting of three values a,c,q.The class Woman is of weak entity type since users are assumed to be interested only in women which are or have been married.
The sketch specifying the situation is depicted on the lower figure (see Table 1 for the meaning of the marked diagrams).Note the arc on the triple of arrows (husb,wife,mdate).It means that Married-objects are identified by their attributes husb,wife,mdate, the latter is necessary because a married couple can get divorced and then get married again.Note also monic markers on the key attribute pin and the cover marker on the arrow wife.Rectangle nodes are abstract classes whose extensions should be stored in the DB: this is additionally pointed by small dots filling-in rectangles.Oval nodes are predefined value domains whose semantics is a priori known to the DBMS.For the sketch approach, Int and {a,c,q} are markers (in our precise sense) hung on corresponding nodes, that is, constraints imposed on their intended semantic interpretations.For example, if a node is marked by Int its intended semantics is the predefined set of integers.

Derived information via diagram operations
Consider, again, the simple universe described on Fig. 1.An important constraint that should be added to the schema is the condition of unique identification of any currently married couple by either its husband, or its wife as well.In other words, the subset of the relation 'Married' for which the attribute 'ddate' is undefined, is of the one-one relationship type.To express such a constraint, one should extend the original sketch with derived items as shown on Fig. 2: in order to distinguish basic items from the derived ones, the former are filled-in with small dots intended to remind about the extension to be stored.Of course, derived arrows should be also specially distinguished, and we agree to denote derived items by hanging various superscripts on their names (like , * , • etc).
Certainly, the derived items of the sketch we consider can be obtained by an evident Select-From-Where SQLquery.We prefer, however, to specify this query as a composition of elementary diagram operations with sets and functions: Null, CoImage, Composition, presented in the top of Table 2 (as for the Null operation, we assume that each domain contains a single distinguished null value).Denotational semantics of operations is described in the corresponding column of the table.
In general, a diagram operation Q is specified by a sketch denoting its input data, S in Q , and a sketch denoting the output data, S out The body of operation is then a procedure P Q which calculates an extension of S Q from a given extension of S in Q .Thus, the derived items of the sketch S on Fig. 2 can be presented as follows: u e e e fn X1 . . .Xn u e e e fn X1 . . .Xn

Composition of diagram operations (queries)
Queries can be composed by passing a part of the output data of a query (or a set of queries) to the input of another query operation.The way of passing is specified by the corresponding mappings of sketches.On the other hand, the composition can be presented as a stepwise augmentation of the initial schema with derived items.In fact, we have already used this procedure in the process of building the sketch on Fig. 2. A bit more involved example is presented on Fig. 4(a) text where we omit the fragment specifying derivation of items CurrMarried , cm * described in Fig. 2.

Output sketch Semantics notation Null
e e e e . . .Given Table 2, with conventions adopted above the graphical image on Fig. 4(a) specifies a system of queries against the basic sketch in a unambiguous way.For example, the marker CoIm $ labeling the the square diagram around it specifies that the extent of the node Man $ consists of those objects of the class Man for which the value of the attribute income is greater than 10 6 .Similarly, the extension of the node Man ! is the intersection of extensions of Man $ and Man (note, the CoImage operation applied to two ISA-arrows turns into the ordinary intersection of sets).The node Man is produced by the operation Image applied to the arrow husb • which is derived, in its turn, by the composed query presented on Fig. 2. It follows from the semantics of these procedures that, eg, the extension of the class Man !consists of those happy Man-objects which are married (at the present time) and rich (with income no less than 10 6 ).Certainly, such a query can be expressed by standard SQL-means but the discussion of which of the languages is better is not relevant for our present considerations.All that we wish to demonstrate in this respect is how standard SQL-queries can be expressed by diagram operations over sketches.

Query languages: an abstract formulation 2.4.1
The crucial observation is that semantic extensions of view schemas can be also described by arrows.If one thinks of a schema as a graph endowed with special marking, an extension of the schema is a mapping sending nodes to sets and arrows to functions (built over some predefined universe (world) of data objects, W) in such a way that intended semantics of the markers is respected.One can consider the collection of W-sets and W-functions as a graph whose nodes are sets and arrows are functions.In addition, given a marker (predicate) m, those diagrams of W-sets and functions that possess the property denoted by m can be thought of as marked by m.In this way the universe can be converted into a (monstrous) schema W of the same type as semantic schemas we consider, and then an extension of S is nothing but a schema morphism (mapping) e : S → W .The set of all possible semantic extensions of S will be denoted by Ext(S).
As it was said above, a query q is specified by its schema S q together with a mapping as shown on Fig. 3(a).One can add (derived) items of the schema S q to the initial schema so that the query q is depicted by diagram Fig. 3(b).Thus, a query appears as an operation while the semantic schema W (in effect, the database in question) is a domain carrying this operation.This pattern is well known for the relational data model and can be generalized for semantic graph-based data models as well.♦

2.4.2
In many considerations it would be convenient to introduce a monstrous closure of a given schema S, which contains all possible derived items produced by QL-queries, Clearly, der QL S ⊃ S since any reasonable query language should contain trivial queries returning basic data without any operations on them.Certainly, for particular questions one needs only finite parts of derS, however, the concept of the full closure can be very useful. 3n addition, any semantic schema (database) should be endowed with a mapping corresponding to computing extents of derived items.Indeed, if an item X belongs to der QL S \ S, then it is a derived item and hence presents a (part of) query specification against the schema S. When S = W ∈ Sem, the mapping µ W is available and then the item µ W (X) is the result of computing the query X over the database W .In fact, the mapping µ W makes the schema W closed under operations from QL.
So, in the abstract framework we suggest, a query mechanism is specified by a closure operator der QL over the collection of all possible schemas Schema such that databases can be considered as a der QL -closed objects of Schema .This is visualized by diagram Fig. 3(c) where the arrows in the bold brackets denote instances of schemas S and der QL S respectively.♦ The considerations above can be formulated in abstract categorical terms by saying that the collection of all schemas is a category Schema and der QL is a monad over Schema while databases are algebras of this monad.

Definition.
An abstract data model is a triple M = (Schema, der, Sem) with Schema a category, der a monad (closure operator) over it, and Sem a collection of der-algebras (der-closed schemas). 4 Then a database instance over S is an arrow e : S → W with W ∈ Sem which can be extended in a unique way to the arrow e : der QL S → W such that the restriction of e on S equals e.

Getting definition.
There was a considerable debate in the database theory literature about the general notion of view against a semantic or logical database schema, in particular, in the object-orientation framework.Despite the large body of work done in the area ( [1,20,19,16] if to mention only a few recent publications), the notion of view was not precisely formulated (as we will show) and even the terminology is still somewhat confusing.Indeed, in the majority of works (like that just mentioned) the term view is used to denote derived items (as a rule, classes or relations), according to the pattern "define view name as query specification".In some works, however ( [22,13]), a view to a schema is a subschema of the schema.An evident integrated formulation is to consider a view to a schema S as a subschema of some augmentation S of S with derived items, ie, S View ⊂ S ⊃ S.This is an almost good definition but it forces one to consider the name space of view schema S V as a subspace of that of the schema S.However, for the view user its schema S V should be totally autonomous with its own name space.
Thus, a more correct consideration is to define a view over S as a pair V = (S V , v) with S V a schema (of the same kind that S is) and v a mapping (schema morphism) v : S V → S sending items of S V into those of S (one may think of the items as names of nodes (classes) and arrows (attributes or references) ).
An essential advantage of this definition is that the mapping v is not bound to be one-one, that is, it can well be the case when two different items N, M in S V are glued into one item K in S V , v(N ) = v(M ) = K.Then the presence of two different names in S V could seem a trick that just misleads the view user, but note that the database schema S may evolve and later the item K may diverge into two different items K 1 , K 2 such that according to the new view semantics the view morphism v has to map N to K 1 and M to K 2 , v(N ) = K 1 = K 2 = v(M ).Or, vice versa, initially there were two different nodes K 1 , K 2 with v(N ) = K 1 , v(M ) = K 2 but then S has evolved to a state when K 1 should be merged with K 2 .Thus, by means of a suitable changing the view mapping v but without changing the view schema S V , the DBA can conform an application built over the view with evolving database so that there is no need to rebuild the application.♦

Examples.
Let us again consider a simple conceptual schema, sketch S, on Fig. 1.Two simple views V 1 , V 2 on this sketch are presented on Fig. 4(b) on the left.Each of the views is specified by the view schema and a mapping which is set by the corresponding table; an augmentation of S required to set up these views is specified on Fig. 4(a).From these specifications it is seen, for example, that extension of the node PHusband in the view V 1 consists of those S.Manobjects whose income is more than 10 6 and, simultaneously, they are not present in the relation S.CurrMarried *5 .Indeed, the markers Im and Dif ∼ on Fig. 4(a) denote operations of taking the image of function and set difference respectively.Then the class Man ∼ consists of those Man-objects which do not occur into CurrMarried * -pairs.Further, the marker CoIm ?defines the class Man ? as the intersection of Man ∼ and Man $ .The latter class is defined by the marker CoIm $ , that is, it consists of Man-objects with income greater than 10 6 (so, extension of the node Man ?consists of rich unmarried men).According to the V 1 -view mapping, the extension of the node S V1 .PHusband is as was described above.
Similarly, extension of the node Marr in the view V 2 consists of those S.CurrMarried * -pairs (w, h) for which w .character= a and h.income ≥ 10 6 .Some delicate points associated with our definition of view are demonstrated in the right half of Fig. 4(b).A view mapping v must be a schema morphisms.That is, for the sketch data model, v is not merely a graph morphism but a sketch morphism: if a diagram D in S V is labeled by a marker M then its image v(D) in S must be also labeled by M .For example, in the specification V 3 on Fig. 4(b) the corresponding mapping is not a sketch morphism since the pair (S V3 .husb,S V3 .wife) is marked by an arc while its image, the pair (S.husb, S.wif e) in the sketch S is not labeled (well, the triple [S.husb,S.wif e, S.mdate] is separating but it does not imply that the pair is).Semantically, this means that according to the view schema S V3 , any M arried-object is identified by the pair of objects (wife,husb) whereas the data which the view extracts from the S-extension do not necessarily satisfy this condition.Hence, V 3 is not a view on S in the technical sense.
The view V 4 is a somewhat artificial example of view where the view mapping is not one-one (as for the arrow id P erson , we assume that any node in a sketch has the identity arrow into itself which is not depicted).♦

View metasketch.
The system of views we have just considered can be presented by a graph on Fig. 4(c).Moreover, this graph carries a sketch structure since view mappings and their diagrams are subjected to certain constraints.For example, the arc with brackets hung on the arrows v 1 , v 2 reflects the constraint that views V 1 , V 2 are disjoint.Also, views V 1 , V 2 are denoted to be one-one (see Table 1).One can see that specifying the sketch structure in the graph of views is important for capturing data semantics.♦

Semantics.
If S is an augmentation of S with derived items, any S-extension e : S → W can be extended to an extension of the augmented schema, e : S → W, in a unique way as follows.If an item N from S actually occurs into S, N ∈ S, then e(N ) = e(N ); however, if N ∈ S \ S then N occurs into the schema of some query q against S and so e(N ) is the answer to q extractable from the set of data {e(M ) | M ∈ S }.Now, given a view v : S V → S over a schema S and an extension e of S, the corresponding extension e V of the view schema is the composition of two arrows: (the symbol denotes the operation of arrow composition).Moreover, if v 1 : S V1 → S V is a view over S V , then the extension e V1 of S V1 is the composition where v : S V → S, e : S → W are the corresponding augmented mappings.♦ Thus, a system of views over a given schema is specified by a graph of view schemas and view mappings, and the well known mechanism of calculating view extension is specified by the arrow composition.In this way the view architecture can be described in a graph-based algebraic arrow framework.

DB architecture schemas in the arrow formalism
4.1 In the framework developed above, a centralized DB-system with a system of views over it can be specified by the diagram presented on Fig. 5 (we remind that the arrows in the bold brackets denote instances of schemas S and der QL S respectively).Note, the graphical image on the figure is not merely a picture convenient for heuristic discussion but also a precise formalized specification where each item (node or arrow) has its own semantic extent to be supported by the DBMS.One can think of the schema on Fig. 5 as displayed on the monitor's screen (of course, only a finite part S ⊂ der QL S can and actually need be shown) and clicking an item will display the semantic content denoted by the item.
One can think of the node der QL S as a sufficiently large augmentation of S with derived items such that all view schemas S i can be mapped into it (example of such an augmentation is presented on Fig. 4a).Arrows v i are coupled

W
S i and QL i are the schema and the query language of the ith view v i : S i → der QL i S i , i = 1, . . ., n.
Figure 5: Meta-schema of a centralized DB with a system of views with schema mappings stored in the view catalog (one can think of these mappings as tables similar to those presented on Fig. 4b).Arrows v i can be thought of as consisting of lists of procedures translating queries against view schemas into queries against the global central schema.All these data are metadata which form the syntactical side of the view system supported by DBMS.
On the semantic side, the arrow QL consists of the list of procedures P q (see section 2.4) calculating QL-queries.Let e be an extent of S (a database instance) and q i is a query against S i .First, q i is translated into a query q = v i (q i ) against S and then the procedure P q returns the answer e q , that is, a partially defined mapping der QL S E W .It is then restructured into an answer to q i according to the rules coupled with the translation v i .
The architecture and the view mechanism we have just described are well known and our formal treatment may seem superfluous detail.However, the benefits are in the clear separation of syntax and semantics, and in a compact transparent yet formalizable notation which gives rise to an easy-to-manipulate specification of the view system.In addition, the arrow presentation of the view mechanism makes easier the management of some of its subtle components by reducing them to compositions of arrows.♦

4.2
The same framework allows to specify formally the general architecture of federal DB systems substantially described in [21,18] as shown on Fig. 6.The result is presented by the meta-schema (actually, a sketch!) depicted on Fig. 7: there are shown two federal superviews, a and b with integrated schemas S I a , S I b respectively, and a system of external views against the a-superview (a similar system for b is not shown).
Correspondence information schemas S CI a , S CI b specify data about correspondence between component schemas.Federal superview schemas S I a , S I b are obtained by integration of component schemas with the correspondence information schemas (an automated procedure of sketch integration in the presence of inter-schema conflicts was proposed in [6,8]).The arcs hung on arrows coming into S I a , S I b denote the cover constraint (Table 1).They show that each item of the integrated schema can be found in either one of the component schemas or in the correspondence schema.
In the explanation column on the right of Fig. 7, "data model" means a triple M = (Schema, der, Sem) as above, and ordinary arrows in the diagrams are morphisms in the corresponding category Schema .In contrast, curly arrows in the "translation" rows are mappings between data models, M i M E (cf.data model transformations studied by Kalinichenko in [15]).Conceptually, they are similar to the so called institution morphisms which have been studied in the institution theory, see [2] for references.Precise categorical description of data model mappings goes out the scope of the present paper and we restrict ourselves with a rough outline.♦  The second component of F , δ, is intended to explicate translation of M 1 -queries into M 2 -queries.That is, informally, if q is an M 1 -query against a schema S ∈ Schema 1 then δq is a M 2 -query against φS ∈ Schema 2 .Since this holds for any query, the entire der 1 -closure of S should be mappable into the der 2 -closure of φS.More formally, δ is a mapping which sends any schema S ∈ Schema 1 to a Schema 2 -morphism δS : φ(der 1 S) → der 2 (φS).♦

Let
As a justification for the abstract nonsense we have just displayed, one can observe the nice similarity of Fig. 6 and Fig. 7 but the latter is a precise specification while the former is an informal picture.

Conclusion
On the technical side, the main result of the paper consists in abstract polymorphic definitions of data model (section 2.4) and view (section 3.1).They allow to specify architecture of view systems in a way possessing the advantages of being (i) graph-based, (ii) precisely formalized, (iii) polymorphic, that is, independent of any specific data model but capable to capture a wide class of them.The last property allows to manage specifications of heterogeneous view systems Speaking in a broader context, the paper initiates building a data-model-independent framework for the DB theory.It seems to be a new research direction though important initial steps were made in a series of works by Kalinichenko ([14,15]).The cause of the novelty is not that the issue was considered useless -contrariwise, its importance has been recognized for at least a few years, especially for the CoopIS area ( [12]).One can guess that the actual cause Local data models, Mi is rooted in unfamiliarity of the community with appropriate specification tools.The present paper hopefully shows that the arrow machinery (and the style of thinking underlying it) developed in the mathematical category theory can be extremely suitable for stating the specification foundations of the problem.This gives rise to a difficult task of incorporating the arrow machinery into the IS design methodologies and techniques and, in a broader context, of inculcating the arrow style of thinking in the area of software engineering (cf.[7]).
Of course, the arrow thinking cannot be universally good for the entire field of information technologies.It is clear that in many situations string-based specifications are preferable, and often the common practice of using graphical schemas as a high-level informal language is quite relevant so that there is no need to employ graphical images as logical schemas (see a discussion in [17]).Actually, it is a question of experience and cognitive science research: when, where and to what extent employment of graphical schemas as formal logical specifications is effective and convenient for a human.

Figure 2 :
Figure 2: Sketches with operation markers: Specifying a query to a sketch = Extending it with derived items

Figure 4 :S 1
Figure 4: Specifying views over a sketch = Setting sketch morphisms into its augmentations with derived items

Figure 6 :
Figure 6: General architecture of a federal DB system (from [18])

Figure 7 :
Figure 7: Meta-schema of a federal DB

Table 2 :
A collection of diagram operation markers (queries) 2 be two data models.Their mapping F : M 1 → M 2 consists of two components, F = (φ, δ).The first one maps data schemas and their morphisms in the model M 1 into those of M 2 :if S ∈ Schema 1 then φS ∈ Schema 2 and if h : S → S is a morphism in Schema 1 then φh : φS → φS is a morphism in Schema 2 .In addition, if W ∈ Sem 1 then φW ∈ Sem 2 .