Programming Constructs for Unstructured Data

We investigate languages for querying and transforming unstructured data by which we mean languages than can be used without knowledge of the structure (schema) of the database. There are two reasons for wanting to do this. First, some data models have emerged in which the schema is either completely absent or only provides weak constraints on the data. Second, it is sometimes convenient, for the purposes of browsing, to query the database without reference to the schema. For example one may want to \grep" all character strings in the database, or one might want to nd the information associated with a certain eld name no matter where it occurs in the database. This paper introduces a labelled tree model of data and investigates various programming structures for querying and transforming such data. In particular, it considers various restrictions of structural recursion that give rise to well-deened queries even when the input data contains cycles. It also discusses issues of observable equivalence of such structures.


Introduction
We investigate languages for querying and transforming unstructured data by which we mean languages than can be used without knowledge of the structure (schema) of the database.There are two reasons for wanting to do this.First, some data models have emerged in which the schema is either completely absent or only provides weak constraints on the data.Second, it is sometimes convenient, for the purposes of browsing, to query the database without reference to the schema.For example one may want to \grep" all character strings in the database, or one might want to nd the information associated with a certain eld name no matter where it occurs in the database.The idea of using labeled trees for this purpose has been suggested by two data models.ACeDB (A C. elegans Database) 7] is a database system popular with biologists.It has a schema, but this only places very weak constraints on the database since any eld in the deeply nested records that are common in ACeDB can be null.Recently Tsimmis 5] has been proposed as a data model for heterogeneous data integration.In Tsimmis there is no schema.The \type" is interpreted by the user from labels in the structure, which is quite exible.In particular, a Tsimmis structure may be interpreted as a record or as a set.There is an analogy here with the dynamic type system of Lisp, whose one basic data structure, the s-expression, may be used to represent lists, trees, association lists, lambda terms, etc.The approach we shall take is to extend structural recursion to labeled trees.This poses some interesting problems: rst, it is no longer \ at" structural recursion, so that the usual syntactic forms and optimizations for collection types such as lists bags and sets may not be relevant.Second, we shall want to examine the possibility that the values we are manipulating may be cyclic.It is common in ACeDB, and generally in object-oriented databases, for objects to refer to each other, allowing the possibility of arbitrarily \deep" queries.Of course, such cyclic structures are usually constructed through the use of a reference/pointer type; however query languages are insensitive to these object identities and perform automatic dereferencing.We therefore want to understand what programs are well de ned when we are allowed to make unbounded searches in the database.This work is preliminary and serves only to describe languages that may be useful for unstructured data.While we believe that there are sound principles for justifying this choice of languages, they are at present mostly \articles of faith".The paper is organized as follows.After specifying the data structure of interest, we rst develop a Figure 1: Labeled tree representation of a bibliographic database, bib.variant of nested relational algebra which gives us the ability to construct queries to a xed depth.Next we extend the idea of structural recursion to perform queries that can reach data at arbitrary depths in the tree.Finally we examine restrictions of this language that work on cyclic data.By a database \query" we usually understand a program that produces a relatively simple output from a complex input { the database.In what follows we are interested in producing data structures that may be as complex as the input.This is the problem of transforming databases, which is of paramount importance in heterogeneous database systems.

A labeled tree data type
As is common in this area we shall take a bibliographic database as a running example.The diagram in gure 1 shows an edge-labeled tree.At the top level we see four edges labeled doc indicating a set of documents.The rst such document displayed is a tree with two distinct labels topic and book indicating a record.The labels on the edges are drawn from some collection of basic types.For the sake of consistency with the systems mentioned above, we shall consider the type label to be the discriminated union of a number of basic types: character strings such as \Math", \Wheels"; numbers such as 4, 3.1415; and symbols such as 'doc, 'article using the lisp notation for quoting (the quote mark is not shown on the symbol edges in gure 1).In general, symbols are used to mark internal edges, and other constants such as strings and numbers are used at the leaves, but this is not demanded by our model.Having xed a data type label, we can now de ne the type of a labeled tree to be a set of pairs, each consisting of a label and another tree.Using P fin (S) for the nite subsets of S, we can describe a labeled tree type by the recursive type equation tree = P fin (label tree) Before proceeding further we should remark that there are some di erences between this type and the models used in Tsimmis and ACeDB.Tsimmis 5] attaches values of base types such as num and string to the terminal nodes of the tree, and the edges are labeled only with symbols.Tsimmis also has object identities associated with the internal nodes.The transformation from Tsimmis is straightforward: we represent terminal nodes by terminal edges; and we may introduce object identities by simply adding a new object-identity base type.ACeDB, is much closer to our presentation in that numbers, strings etc. may be attached to non-terminal edges.It also allows one to build cyclic structures, which we shall discuss later.The transformation from ACeDB is obtained essentially by transferring label information from the schema to the data; and a similar technique may be used to represent other databases as trees.
We now describe constructors for the type tree.Trees are sets, so we have for the empty set and e 1 e 2 to construct the union of sets e 1 and e 2 .In addition we have the expression fa)eg to describe a singleton set consisting of a tree formed by attaching the edge labeled with a to the root of the tree e.The types of these constructors are as follows: : tree f ) g : label tree !tree : tree tree !tree We shall also make use of the following abbreviations for constructing trees: fa 1 )e 1 ; a 2 )e 2 ; : : :a n )e n g for fa 1 )e 1 g fa 2 )e 2 g : : : fa n )e n g.Also a) , appearing within f: : :g may be abbreviated to a. Thus f1; 2; 3g is an abbreviation for the \ at" tree f1) ; 2) ; 3) g As a more elaborate example, the tree depicted in gure 1 can be built with the following syntax: bib = f'doc )f'topic ) f\Genetics"g, 'book )f'title ) f\Cartoon Guide to Genetics"g, 'authors ) f\Gonick",\Wheels"ggg, 'doc )f'topic ) f\Math"g, 'book )f'title ) f\Cartoon Guide to Statistics"g, 'authors ) f\Gonick"ggg, 'doc )f'topic ) f\Genetics",\Database"g, 'article )f'title ) f\FlyBase-the Drosophila database"g, 'authors ) f\The FlyBase Consortium"ggggg 3 Nested relational algebra on trees The previous section gave a syntax for constructing trees.We now extend this to the syntax of a programming language for trees.To our types label and tree we add a boolean type bool with the usual boolean connectives1 .We also add An equality test a = b on labels.Equality is of type label label !bool.
An emptiness test null(t) on trees.null : tree !bool A conditional if b then e 1 else e 2 in which b is a boolean expression and e 1 ; e 2 denote trees.
Since trees are necessarily sets, we rst consider structural recursion on sets as the basic programming paradigm, and following 3] use the restricted form of structural recursion given by functions h of the form h In this, the meaning of the function h of type tree !tree is determined by the function f : label tree !tree.Note that this is a mathematical de nition, which suggests an implementation.The syntax we will use for h(S) when the function f is de ned by f(a; t) = e is ext( (a; t):e)(S).An example of this form of de nition is a selection function: In our syntax, sel(b)(S) can be written as ext( (a; t):if a = b then fa)tg else )(S).Its e ect is to discard from S all edges that are not labeled with b, together with their subtrees.For example, sel(1)(f1)f10g; 2)f20gg) = f1)f10gg.
Another useful function is projection, de ned as proj(b)(S) def = ext( (a; t):if a = b then t else )(S).This function takes the union of the trees at the ends of b edges and discards the others.Note how it di ers from sel(b): proj (1)(f1)f10g; 2)f20gg) = f10g.
A attening function de ned as flat(S) def = ext( (a; t):t)(S) will also be useful.This function removes one level of edges out of the root and takes the union of the subtrees at their ends: at(f1)f10g; 2)f20gg) = f10; 20g.
To summarize the language at this point, we assume we have an in nite collection of typed variables for labels (ranged over by a) and for trees (ranged over by t).In addition we have a set of constants of type label as described above.An expression in the language is built up from the variables and constants with the following rules: : tree l : label e : tree fl)eg : tree e 1 : tree e 2 : tree e 1 e 2 : tree e : tree e 0 : tree ext( (a; t):e 0 )e : tree e : tree null e : bool l 1 : label l 2 : label l 1 = l 2 : bool b : bool e 1 : tree e 2 : tree if b then e 1 else e 2 : tree We also assume the boolean constants and operations with the obvious typing rules together with other appropriate operations on labels.We shall call this language EXT.
Nested Relational Algebra.We now have a language equivalent in expressive power to the nested relational algebra, since it includes all the operations described in 3].Note that although we have not introduced an explicit pairing operation into EXT, it can be simulated with the operators already available.If we x two labels, 1; 2, the pair e 1 ; e 2 can be expressed as f1)e 1 ; 2)e 2 g and the projection operations are proj (1); proj (2).In particular, we may simulate a ( at) relational database by constructing for each tuple (v 1 ; : : :; v n ) in relation R(A 1 ; : : :A n ) a tree f'R)f'A 1 )fv 1 g : : :'A n )fv n ggg and taking the union of all such trees.
To illustrate the types of queries and transformations that we can perform with EXT we give some examples in the spirit of 5].To simplify their presentation, we will use the following abbreviations: e:a for proj(a)(e), e"a for sel(a)(e), and a in e for :null(sel(a)(e)).
Example 1: Find the titles of all books on Genetics.The last example does not return a subtree of the original tree, and illustrates how the result can restructure information.Such restructuring cannot be performed in OEM, the language of 5].It should also be observed that the queries in these examples assume a particular structure on the trees, i.e. that the labels of interest appear at predetermined depths.In the next section, we will see how to specify queries which operate on trees in which labels can appear at arbitrary depths.

Structural recursion on trees
We now consider a form of structural recursion that one would naturally associate with trees.
The only di erence between this and our previous form of structural recursion is that h acts recursively on the subtrees of a tree.As before we will use the syntax text( (a; r):e)(S) for h(S) when the function f is de ned by f(a; r) = e For example, to change each 'topic label to a 'subject label we may use the function change lab de ned by This will change labels at any depth in the tree.It is expressed using text( ) as change lab(S) def = text( (a; r):if a = 'topic then f'subject)rg else fa)rg)(S) We may also write a selection function that operates over the whole tree.tsel(p)(S) selects only those edges in S that satisfy p; the other edges are lost and their subtrees become inaccessible.It is de ned by tsel(p)(S) def = text( (a; r):if p(a) then fa)rg else )(S) Applying this to the bib structure with the predicate p(x) = :(x='topic) will result in the topic labels and the associated strings being removed from the tree.We may also build a \ at" tree of all the edges in in a tree with at trees(S) def = text( (a; r):fag r)(S) Then at trees(bib) results in f'doc, 'topic, \Genetics", 'book, : : :g.With such a transformation and the use of the discriminating function for strings, we can easily nd all the strings in the database.
A more interesting example is to nd a tree containing the set of all paths from the root of the tree.We represent a path by a list, or \vertical" tree, so that the path consisting of the sequence of labels 'doc; 'book; 'title is f'doc)f'book)f'titleggg.We can obtain the set of all paths with all paths(S) def = text( (a; r):fag ext( (a 1 ; r 1 ):fa)fa 1 )r 1 gg)(r))(S) In this expression, r is bound recursively to all the paths of the subtree below the edge a.The set of paths we want includes the single edge a together with the the paths that are formed by tacking a onto the beginning of each of the paths in r, which is done with an application of ext( ).The result of this query will be f'doc, 'doc)f'topicg, 'doc)f'topic)f"Genetics"gg, 'doc)f'bookg, 'doc)f'book)f'titlegg, : : : g As a nal example of the use of text( ), consider the expression blow up(S) def = text( (a; r):fa)rg r)(S) When S the linear tree f1)f2)f3ggg blow up(S) produces f1)f2)f3g; 3g; 2)f3g; 3g.When S is a linear tree with n edges it produces a tree with 2 n ? 1 edges.We shall see later that this apparent growth does not imply that we are performing transformations that are outside PTIME.
The preceding examples show that a number of queries that are expressed with \path variables" 5] can be computed in EXT together with text( ).We will refer to this extended version of EXT as TEXT.However there is one extremely useful query which presents some di culty: it is that of computing the union of a tree with all of its subtrees.A re nement of this problem is to obtain all the edges that satisfy a certain property together with their subtrees.For example, we might want all the books in the bib structure.In order to build such a tree we need a more general form of recursion h( ) = h(fa)tg) = f(a; t; h(t)) h(t 1 t 2 ) = h(t 1 ) h(t 2 ) in which the function f is now a three-place function taking the edge label a, the input subtree t, and the result of recursively computing h on that subtree.We use the form text'( (a; t; r):e)(S) for this form, where the function f is de ned by f(a; t; r) = e.The union of a tree S with all of its subtrees is now given by: all trees(S) def = text'( (a; t; r):fa)tg r))(S) As a further example, to nd all the books in the bib database together with their subtrees we can write text'( (a; t; r):(if a = 'book then fa)tg else ) r))(bib) Note that if we have one 'book tree below (or inside) another, this function will extract both of them.X 1 X 2 where X 1 = fdoc )ftopic )f"Genetics"g, book)ftitle)"Cartoon Guide to Genetics", authors)Y 1 Y 2 ggg X 2 = fdoc )ftopic )f"Math"g, book )ftitle )f"Cartoon Guide to Statistics"g, authors)Y 1 ggg Y 1 = f\Gonick", papers )X 1 X 2 g Y 2 = f\Wheels", papers )X 1 g Figure 2: A speci cation of a cyclic structure It is possible to implement text'( ) in TEXT, so that this form of structural recursion presents nothing new.However, the implementationinvolves the use of projection, which can cause problems when we consider operations on cyclic structures.The direct use text'( ) is more closely related to a form of structural recursion that we shall now examine.

Cyclic structures
The languages EXT and TEXT operate on labeled trees.Surprisingly, EXT and an important fragment of TEXT, which we call VEXT, can be extended naturally from trees to cyclic structures.This is due to the fact that the queries in EXT and VEXT can be computed by independently processing each edge of a cyclic structure, without needing to chase every path in the structure.Syntactically, we describe cyclic structures with the aid of variables and equations de ning these variables.Semantically, cyclic structures are rooted, labeled graphs.Consider the syntactic speci cation of a cyclic structure given in gure 2. It uses the variables X 1 ; X 2 ; Y 1 ; Y 2 , and four equations de ning them.We recover a tree for such a speci cation by textually substituting in X 1 X 2 each variable with the right hand side of the equation de ning it, and by repeating this process until all variables are eliminated, thus unfolding a labeled tree.E.g. consider the same example from gure 2, but with the de nitions of Y 1 ; Y 2 changed to: Y 1 = \Gonick" and Y 2 = \Wheels".Then by unfolding we get a subtree of the tree of gure 1.
In general the unfolded tree may be in nite.Note that we may have di erent syntactic speci cations denoting the same tree: these should be regarded as equal.Also, exactly those (in nite) labeled trees are meanings of syntactic speci cations of cyclic structures, which are rational, i.e. for which the set of subtrees is nite.Formally, a syntactic speci cation of a cyclic structure is: e where X 1 = e 1 : : : X k = e k Here e; e 1 ; : : :; e k are labeled trees with markers X 1 ; : : :; X k .More precisely, they are expressions built up using the three constructors ;; f ) g, and , and which may have the variables X 1 ; : : :; X k on their leaves.The type tree X1;:::;Xk of labeled trees with markers X 1 ; : : :; X k is de ned by: tree X1;:::;Xk = P fin (label tree X1;:::;Xk fX 1 ; : : :; X k g) We write f'name)\Joe Doe"g X 1 X 2 instead of the o cial f'name)\Joe Doe"; X 1 ; X 2 g.The semantics of cyclic structures is given by rooted, labeled graphs, G = (V; E; r; l).Each such graph has a distinguished vertex r 2 V called the root, and the edges are labeled with elements from label f"g, i.e. l : E ! label f"g, where " is a special symbol not occurring in label.E.g. the cyclic structure of gure 2 will be interpreted as the graph given in gure 3. Notice how we use "-edges to connect an occurrence of some variable X i with its de nition.It is on these graphs that we can now de ne equality.Namely we say that two graphs G = (V; E; r; l), G 0 = (V 0 ; E 0 ; r 0 ; l 0 ) are bisimilar i there exists a binary relation V V 0 s.t.(1) r r 0 , and (2) if v v 0 then for any label a 2 label, there exists a path v " !: : : " ! a !w in G i there exists some path v 0 " !: : : " ! a !w 0 and w w 0 .We state here without proof that two graphs are bisimilar i their (potentially in nite) unfolded 2 labeled trees are equal.This gives us an e ective procedure for deciding whether two syntactic speci cations are equal.Namely (1) convert the two speci cations to rooted graphs, and (2) test whether the two graphs are bisimilar.Note that testing for bisimilarity is a PTIME problem.By contrast, testing for graph isomorphism is believed to be outside of PTIME.See 1] for a discussion of the relevance of bisimulation in query languages with object identities.Now we will extend our languages to cyclic structures.First notice that all operations in EXT can be extended straightforwardly to cyclic structures.This is obvious in the case of ;; f ) g, and .To apply null of ext(f) to some syntactic speci cation t, we rst have to expose the topmost set in t.For this we convert t into a rooted, labeled graph G, and then restructure into a graph G 0 bisimilar to G, in which no -edge leaves the root (this is always possible).Next we convert G 0 back to a syntactic speci cation t 0 , and on t 0 we apply null, or ext(f).But text cannot be extended to cyclic structures.Indeed, consider the query all paths, of section 4. On some in nite rational tree as input, i.e. a cyclic structure, it will return as output an in nite non-rational tree, i.e. a non-cyclic structure.Fortunately there exists a natural restriction of text pointed to us by Val Tannen 6], which allows us to de ne most of the queries of section 4, and which generalizes naturally to cyclic structures.We call this restriction vext.To de ne vext, we start by discussing another primitive operation, substitution.Let X be some variable.We add a new primitive to the ones already mentioned, namely the substitution subst X : tree X tree X !tree X .subst X (s; t) will simply replace every occurrence of X in s with t.Formally: subst X (X; t) def = t subst X (fa)sg; t) def = fa)subst(s; t)g subst X (t 1 t 2 ; t) def = subst X (t 1 ; t) subst X (t 2 ; t) Notice that on \lists", i.e. trees in which each interior node has exactly one son, like fa 1 )fa 2 )fa 3 )Xggg, subst X becomes append.
Finally we are ready to introduce the vext construct.Namely for any function f X : labels !tree X , h = vext X (f X ) is de ned by: h Again, notice that on \lists" vext X is simply ext on lists.We now informally describe now how vext X (f) acts on some syntactic speci cation of a cyclic structure t = (e where X 1 = e 1 ; : : :; X k = e k ).The syntactic speci cation t 0 = vext X (f X ) is obtained from t by processing every subexpression of the form fa)e 0 g of e; e 1 ; : : :; e k as follows: we rst fetch a fresh variable Y , next replace the subexpression fa)e 0 g with f Y (a), and nally add the equation Y = e 0 .Intuitively there is a high degree of parallelism in the computation of vext X , which can be visualized even better on rooted, labeled graphs.Namely here, vext X (f X )(S) is computed by independently replacing each edge v a !w with the tree f X (a) having v as the root, and by drawing "-edges from the leaves X of f X (a) to w; the "-edges in the original graph are left untouched.We call VEXT the language obtained by extending EXT with the vext construct.Obviously the following relationships between languages holds: Here are some of the queries from section 4 expressed with vext X : change lab(S) def = vext X ( a:if a = 0 topic then f 0 subject)Xg else fa)Xg)(S) tsel(p)(S) def = vext X ( a:if p(a) then fa)Xg else ;)(S) at trees(S) def = vext X ( a:fag X)(S) blow up(S) def = vext X ( a:fa)Xg X)(S) Notice how sharing allows us to avoid combinatorial explosion of the data structure resulting from the function blow up.E.g. on the input tree S = fa 1 )fa 2 )f: : :fa n );g : : :ggg, blow up(S) will return: X 0 where X 0 = fa 1 )X 1 g X 1 X 1 = fa 2 )X 2 g X 2 : : : X n?1 = fa n?1 )X n g X n X n = fa n );g When we unfold this compact syntactic speci cation, we obtain a tree with 2 n nodes.In general, we have: Proposition 5.1 VEXT is in PTIME.
Unlike text however, vext X does not seem to be able to express all trees.However we can express all trees with a slight generalization of vext X , from one variable X to an arbitrary number of variables.First we de ne subst X1;:::;Xn (t; t 1 ; : : :; t n ) to substitute simultaneously X 1 with t 1 , : : :, X n with t n in t.Then vext X1;:::;Xn (f 1 ; : : :; f n ) is a construct which allows us to de ne simultaneously n functions h 1 ; : : :; h n by iteration on the cyclic structure (we omit the subscripts X 1 ; : : :; X n ): h 1 (;) def = ; : : : h n (;) def = ; h 1 (fa)sg) def = subst(f 1 (a); h 1 (s); : : :; h n (s)) : : : h n (fa)sg) def = subst(f n (a); h 1 (s); : : :; h n (s)) h 1 (t 1 t 2 ) def = h 1 (t 1 ) h 1 (t 2 ) : : : h n (t 1 t 2 ) def = h n (t 1 ) h n (t 2 ) Then we can compute all path using vext X1;X2 .More interestingly, for any regular expression on labels, R, we can write in VEXT an expression project R (S) which, on a given tree S, returns the set of all subtrees which can be reached from the root of S using a path in R. E.g. when R = (a(bc) ) , then select R (S) will return the set of all subtrees in S which can be reached from the root by a path of the form abc : : :bcabc : : :bc : : :abc : : :bc.We invite the reader to check that, for any regular expression R, select R (S) can be written using vext X1;:::;Xn .It su ces to take n as the number of states in the deterministic automaton accepting R.

Conclusions
We believe that the forms of structural recursion on trees described in this paper o er good prospects for the development of powerful query languages for unstructured data.However to demonstrate this, considerable additional work is needed.First, we need to substantiate the claim that the graph/bisimulation model provides an e ective semantics for the language developed in section 5. Second, there appear to be some optimization change lab( ) = change lab(fa)tg) = if a = 'topic then f'subject)change lab(t)g else fa)change lab(t)g change lab(t 1 t 2 ) = change lab(t 1 ) change lab(t 2 )