Finite Query Languages for Sequence Databases

This paper develops a query language for sequence databases, such as genome databases and text databases. Unlike relational data, queries over sequential data can easily produce infinite answer sets, since the universe of sequences is infinite, even for a finite alphabet. The challenge is to develop query languages that are both highly expressive and finite. This paper develops such a language. It is a subset of a recently developed logic calledSequence Datalog[19]. Sequence Datalog distinguishes syntactically between subsequence extractionand sequence construction. Extraction creates sequences of bounded length, and leads to safe recursion; while construction can create sequences of arbitrary length, and leads to unsafe recursion. In this paper, we develop syntactic restrictions for Sequence Datalog that allow sequence construction but preserve finiteness. The main idea is to use safe recursion to control and limit unsafe recursion. The main results are the definition of a finite form of recursion, calleddomain bounded recursion ,a nd a characterization of its complexity and expressive power. Although finite, the resulting class of programs is highly expressive, since its data complexity is complete for the elementary functions.


Introduction
It is widely accepted that relational databases do not provide enough support for many of today's advanced applications.In some cases, object-oriented databases [4] are the right solution.However, in other cases, such as genome databases [12] and text databases [14], there is still a need for more flexibility in data representation and manipulation.In these applications, much of the data has an inherently sequential structure.This has several implications for database management systems.First, a DBMS should provide a sequence type; that is, it should be able to manipulate sequences of unbounded length over a fixed alphabet.Second, the query languages provided to the user must have powerful primitives and operators for analyzing and restructuring sequences.
Sequences represent a particularly interesting domain for query languages.In contrast to sets, computations over sequences can easily become infinite, even when the underlying alphabet is finite.This is because repetitions of symbols are allowed, so that the number of possible sequences over any finite alphabet is infinite.The researcher thus faces an interesting challenge: on the one hand, the language should provide powerful primitives for restructuring sequences; on the other hand, the expressive power of the language should be carefully limited, to avoid infinite computations.
In [19], we developed a logic called Sequence Datalog for querying sequence databases.Two safe subsets of the logic were defined, based on a new computational model called Generalized Sequence Transducers.These machines are a simple yet powerful device for computing sequence mappings.In [19], we showed how networks of these machines could be expressed in Sequence Datalog.Moreover, any Sequence Datalog program constructed in this way is guaranteed to be safe and finite.In this paper, we take a different approach: instead of computational definitions, we develop syntactic restrictions that guarantee finiteness and safety.This provides an alternate view of finite computations in the logic.The main idea is to use structural recursion (which is guaranteed to terminate) to limit the construction Finite Query Languages for Sequence Databases of new sequences.The first result is a syntactically defined class of Sequence Datalog programs that guarantees finiteness and safety.We call these programs domain bounded programs.The second result is a characterization of their complexity and expressive power.We prove that domain bounded programs can express any sequence mapping with hyper-exponential time complexity.Thus, although finite, these programs are still highly expressive.

Overview of Sequence Datalog
Sequence Datalog is an extension of Datalog for manipulating sequences.It uses a simple data model that extends the relational model by allowing tuples of sequences in relations, instead of just tuples of constant symbols.This section provides an informal overview of the syntax and semantics of Sequence Datalog.A formal development can be found in [19].
To manipulate sequences, SequenceDatalog has two interpreted function symbols for constructing complex terms, one for concatenating sequences and one for extracting subsequences.Intuitively, if X and Y are sequences, and I and J are integers, then the term X Y denotes the concatenation of X and Y , and the term X[I : J] denotes the subsequence of X extending from position I to position J.To be more precise, the language of terms uses three countable, disjoint sets: a set of constant symbols, a; b; c; :::, called the alphabet and denoted ; a set of variables, R; S; T; :::, called sequence variables and denoted V ; and another set of variables, I;J ; K ; : : : , called index variables and denoted V I .A constant sequence (or sequence, for short) is an element of .From these sets, we construct two kinds of term as follows: index terms are built from integers, index variables, and the special symbol end, by combining them recursively using the binary connectives + and .Thus, if N and M are index variables, then 3, N +3, N M, end 5 and end 5 + M are all index terms.sequence terms are built from constant sequences, sequence variables and index terms, by combining them recursively into indexed terms and constructive terms, as follows: -If s is a sequence variable and n 1 ; n 2 are index terms, then s[n 1 :n 2 ] is an indexed sequence term.n 1 and n 2 are called the indexes of s.As a shorthand, each sequence term of the form s[n i :n i ] is written s[n i ].
-If s 1 ; s 2 are sequence terms, then s 1 s 2 is a constructive sequence term.Thus, if S 1 and S 2 are sequence variables, and N is an index variable, then S 1 [4], S 1 [1:N], and ccgt S 1 [1:end 1] S 2 are all sequence terms.The semantics of terms is formalized in [19].Constructive sequence terms have a semantics of concatenation; e:g:, abc def = abcdef.Indexed sequence terms have a semantics of subsequence extraction; e:g:, abcdef[2:5] = bcde and abcdef[4:end] = def.However, there are some subtleties when the index terms take on "fringe" values, as illustrated by the following examples, where denotes the empty sequence: As in most logics, the language of formulas for SequenceDatalog includes a countable set of predicate symbols, p; q; r; :::, where each predicate symbol has an associated arity.If p is a predicate symbol of arity n, and s 1 ; :::; s n are sequence terms, then p(s 1 ; :::; s n ) is an atom.Moreover, if s 1 and s 2 are sequence terms, then s 1 = s 2 and s 1 6 = s 2 are also atoms.From atoms, we build rules, facts and clauses in the usual way [17].The head and body of a clause, , are denoted HEAD() and BODY(), respectively.A clause that contains a constructive term in its head is called a Database Programming Languages, 1995 Finite Query Languages for Sequence Databases constructive clause.A Sequence Datalog program is a set of Sequence Datalog rules in which constructive terms may appear in rule heads, but not in rule bodies.
We say that a variable, X, is guarded in a clause if X occurs in the body of the clause as an argument of some predicate.Otherwise, we say that X is unguarded.For example, X is guarded in p(X [1]) q(X), whereas it is unguarded in p(X) q(X [1]).Because of the active domain semantics, variables in Sequence Datalog clauses need not be guarded.
The semantics of rules is formalized in [19] in terms of least fixpoints.As in classical logic programming [23,3,17], each Sequence Datalog program, P , has an associated operator, T P , that maps databases to databases.Each application of T P may create new atoms, which may contain new sequences.The operator T P is monotonic and continuous and has a least fixpoint [19].This paper develops conditions under which the least fixpoint is finite.In such cases, we say that P has a finite semantics.

Example 1.1 [Indexed Terms]
The following rule extracts all prefixes of sequences in relation R: For each sequence, X, in R, this rules says that a prefix of X is any subsequence starting with the first element and ending with the N -th element, so long as N is no longer than the length of X.

2
The universe of sequences over the alphabet, , is infinite.Thus, to keep the semantics of programs finite, we do not evaluate rules over the entire universe, .Instead, we introduce a new active domain for sequence databases, called the extended active domain.This domain contains all the sequences occurring in the database, plus all their subsequences.1 Substitutions range over this domain when rules are evaluated.2 The extended active domain is not fixed during query evaluation.Instead, whenever a new sequence is created (by the concatenation operator, ), the new sequence-and its subsequences-are added to the extended active domain.
The fixpoint theory of Sequence Datalog provides a declarative semantics for this apparently procedural notion.In the fixpoint theory, the extended active domain of the least fixpoint is larger than the extended active domain of the database.For the database, the domain consists of the sequences in the database and all their subsequences.For the least fixpoint, the domain consists of the sequences in the database and any new sequences created during rule evaluation, and all their subsequences.

Example 1.2 [Constructive Terms]
The following rule constructs all possible concatenations of sequences in relation R: This rule takes any pair of sequences, X and Y , in relation R, concatenates them, and stores the result in answer, thereby adding new sequences to the extended active domain.The concatenated sequences (and their subsequences) form the extended active domain of the least fixpoint.

2
Compared to Datalog with function symbols, or Prolog, two differences are apparent.The first is that Sequence Datalog has no uninterpreted function symbols, so it is not possible to build arbitrarily nested structures.On the other hand, Sequence Datalog has a richer syntax than the [HeadjTail] list constructor of Prolog.This richer syntax is motivated by a natural distinction between two types of recursion, one safe and the other unsafe.Recursion through construction of new sequences is inherently unsafe since it can create longer sequences, which can make the active domain grow indefinitely.On the other hand, structural recursion over existing sequences is inherently safe, since it only creates shorter sequences, so that growth in the active domain is bounded.In fact, it is bounded by the set of all subsequences of the active domain, which we call the extended active domain.Typically, languages for list manipulation do not discriminate between these two types of recursion.Sequence Datalog does: constructive recursion is performed using constructive terms, of the form X Y , while structural recursion is performed using indexed terms, of the form X[n 1 :n 2 ].
Finite Query Languages for Sequence Databases

Controlling Constructive Recursion
This section illustrates how sequences are manipulated in Sequence Datalog.The examples develop the idea that constructive recursion (which is unsafe) can be limited and controlled by structural recursion (which is always safe).This is the main idea of this paper, and the basis for the syntactic restrictions developed in Section 4.

Example 1.3 [Pattern Matching]
Suppose we are interested in sequences of the form a n b n c n in relation R. The query answer(X) retrieves all such sequences, where the predicate answer is defined by the following rules: .
The formula answer(X) is true iff X is a sequence in R and it is possible to split X in three parts such that abc n is true.Predicate abc n is true for every triple of sequences of the form (a n ; b n ; c n ) in the extended active domain of the database. 2 In Example 1.3, the semantics of the rulebase is trivially finite for every database, since the rules contain no constructive terms.Thus, any sequence in the least fixpoint is a subsequence of a sequence in the database.In contrast, the next two examples restructure the sequences in the database, producing new sequences longer than any in the database.Example 1.4 does this in a straightforward way, but has an infinite semantics.Example 1.5 solves the same problem, but with a finite semantics.

Example 1.4 [Infinite Semantics]
Suppose R is a unary relation containing a set of sequences.For each sequence X in R, we want the sequence obtained by repeating each symbol in X twice.For example, given the sequence abcd, we want the sequence aabbccdd.We call these sequences echo sequences.The easiest way to define echo sequences is with the following program: The first rule retrieves every sequence in relation R and its echo, by invoking the predicate echo(X;Y ).The last two rules specify what an echo sequence is.For every sequence, X, in the extended active domain, these rules generate its echo sequence, Y .Starting with X = and Y = , they recursively concatenate single characters onto X while concatenating two copies of the same character onto Y .As new sequences are generated, they are added to the active domain, which expands indefinitely.

2
The program in Example 1.4 has an infinite semantics over every database that contains a non-empty sequence.
This is because the rules defining echo(X;Y ) recursively generate longer and longer sequences without bound.For example, suppose the input database contains only one tuple, fR(aa)g.Its extended active domain consists of the sequences ; a; aa.The table below shows how the inferred facts and the extended domain both grow during a bottom up computation of the least fixpoint.Each row in the table is the result of one additional application of the T operator.In each row, the inferred facts contain one more echo entry, and the extended active domain contains one more sequence, consisting entirely of a's.The least fixpoint of the T operator is therefore infinite, and its extended active domain is the set of all sequences made of a's.Note that the query answer consists of a single atom, answer(aa; aaaa), which is computed during the fourth step.Thus, although the least fixpoint is infinite, the query answer is not.The next example expresses the query in such a way that both the answer and the least fixpoint are finite.
Database Programming Languages, 1995 . In this program, the sequences in relation R act as input for the third rule, which defines the predicate echo(X;Y ).This rule recursively scans each input sequence, X, while constructing an output sequence, Y .For each character in the input sequence, two copies of the character are appended to the output sequence.The rule computes the echo of every prefix of every sequence in R. The first rule then retrieves the echoes of the sequences in R.

2
Like Example 1.4, the program in Example 1.5 involves constructive recursion.However, in Example 1.5, the least fixpoint is finite.This is because constructive recursion does not go on indefinitely, but terminates as soon as the input sequences have been scanned.In essence, growing terms of upwardly bounded length are used to guarantee termination: these terms "grow" at each recursive evaluation of the rule, and recursion stops when the upper bound has been reached.In this way, structural recursion over the first argument controls and limits constructive recursion over the second argument.The bottom-up computation is similar to the one in the table above, except that no more echo facts are inferred after the fourth step, whence the least fixpoint is reached and the computation stops.

Preliminary Definitions
This section introduces technical definitions used in the rest of the paper, including sequence database, sequence query and sequence function.
Let be a countable set of symbols, called the alphabet.denotes the set of all possible sequences over , including the empty sequence, . 1 2 denotes the concatenation of two sequences, 1 ; 2 2 .LEN() denotes the length of sequence , and (i) denotes its i-th element.With an abuse of notation, we blur the distinction between elements of the alphabet and 1-ary sequences.
We say that a sequence, 0 , of length k is a contiguous subsequence of sequence if for some integer, i 0, 0 (j) = ( i + j ) for j = 1 ; : : : ; k .Note that for each sequence of length k over , there are at most kk+1 2 + 1 different contiguous subsequences (including the empty sequence).For example, the contiguous subsequences of the sequence abc are: ; a; b; c; ab; bc; abc.We now describe an extension of the relational model, in the spirit of [13,15].The model allows for tuples containing sequences of elements, instead of just constant symbols.A relation of arity k over is a finite subset of the k-fold cartesian product of with itself.A database over is a finite set of relations over .We assign a distinct predicate symbol, r, of appropriate arity to each relation in a database.
A sequence query is a partial mapping from the set of databases over to itself.Given a sequence query, Q, and a database, DB, the result of evaluating Q over DB is denoted Q(DB).Similarly, a sequence function [11] is a partial mapping from to itself.A sequence function is computable if it is partial recursive.Usually, a notion of genericity [10] is introduced for queries.The notion can be extended to sequence queries in a natural way.We say that a sequence query Q is computable [10] if it is generic and partial recursive.
Database Programming Languages, 1995

Finite Query Languages for Sequence Databases
Sequence functions can be thought of as queries from a database, finput( in )g, containing a single sequence tuple, to a database, foutput( out )g, containing a single sequence tuple.Expressibility results formulated in terms of sequence functions are especially meaningful for sequence query languages, since they provide a clear characterization of the power of the language to manipulate sequences.A sequence query language cannot express complex queries over sequence databases if it cannot express complex sequence functions.In short, function expressibility is necessary for query expressibility.
This paper addresses the complexity of sequence functions, and the data complexity [25] of sequence queries.Given a sequence function, f, the complexity of f is defined in the usual way, as the complexity of computing f(), measured with respect to the length of the sequence .Given a sequence query, Q, a database, DB, and a suitable encoding of DB as a Turing machine tape, the data complexity of Q is the complexity of computing an encoding of Q(DB), measured with respect to the size of DB.A query language, L, is complete in the complexity class C if: (i) each query expressible in L has complexity in C; (ii) there is a query, Q, expressible in L such that computing Q(DB) is a complete problem for the complexity class C.
This paper also addresses the ability of query languages to express sequence functions.A query language, L, is said to express a complexity class, C, of sequence functions if: (i) each sequence function expressible in L has complexity in C and conversely, (ii) each sequence function with complexity in C can be expressed in L.

The Finiteness Problem
As discussed in the previous sections, computations over sequences may become infinite even when the underlying alphabet is finite.We are interested in studying finite programs, that is, programs that have a finite semantics over every input database.

Definition 1 [Finite Programs] A program is finite if it has a finite semantics (i.e. a finite least fixpoint) over every database.
As it is typical of powerful logics, the finiteness property for Sequence Datalog programs is in general undecidable [19].Thus, our aim is to develop subsets of the logic that are finite.In [19], we took what might be called a "semantic" approach, defining finite subsets in terms of abstract computing devices, called generalized sequence transducers.In this paper, we take a syntactic approach, defining finite subsets in terms of syntactic restrictions.
We first note that a necessary condition for infiniteness is the generation of sequences of unbounded length, as in Example 1.4.To do this, programs must use recursion through construction.That is, newly computed sequences must be used recursively to construct more new sequences.This kind of computation is closely related to a particular form of constructive rule, which we call recursive constructive rules.In such rules, the predicate in the head depends on itself.To formalize this concept, we introduce the notion of a predicate dependency graph of a Sequence Datalog program.This notion, and several others, are closely related: Predicate p is a constructive predicate in program P if P contains a constructive rule for p, that is, a rule with a constructive term (a term containing ) in its head.Note that constructive predicates cause new sequences to be added to the domain during query evaluation.
Predicate p depends on predicate q in program P if P contains a rule in which p is the predicate symbol in the head and q is a predicate symbol in the body.If the rule is constructive, then p depends constructively on q. 3 The predicate dependency graph, PDG P , of program P is a directed graph representing the binary relation "depends on" over the predicates of P .An edge (p; q) in this graph is a constructive edge if p depends constructively on q.
Predicate p is recursive with respect to construction in program P if the predicate dependency graph for P contains a cycle passing through p with a constructive edge.

Finite Query Languages for Sequence Databases
Example 3.1 Consider the following two Sequence Datalog programs, P 1 and P 2 : q(X): q(X) r(X): r(X Y) p(X); t ( X;Y ): Both programs are recursive, so their predicate dependency graphs both have cycles.The graph of P 2 has a constructive cycle, while the graph of P 1 does not.Thus, P 2 is recursive wrt constructions, while P 1 is not.

2
The simplest way to enforce finiteness in the presence of constructive rules is to disallow recursion through construction.This means forbidding programs whose predicate dependency graph contains cycles with constructive edges.Intuitively, this means we can find a stratification of the program with respect to constructive rules.In this case, the least fixpoint of the program is finite for every database, since there is no way to construct new sequences of unbounded length.We have the following result about the resulting language, which we call Stratified Sequence Datalog.

Theorem 1 Stratified Sequence Datalog is data complete for PTIME.
Although Stratified Sequence Datalog is complete for PTIME, it has a very limited ability to restructure sequences.Many natural, simple and low-complexityrestructurings-such as reversing a sequence or computing its complementrequire constructive recursion, and cannot be expressed in Stratified Sequence Datalog.Intuitively, reversing a sequence of length n requires n concatenations (to concatenate the n characters together in reverse order).Likewise for complementation.For these operations, the number of concatenations depends on the database.In contrast, in Stratified Sequence Datalog, the number of concatenations is fixed for each program and is independent of the database.
To increase expressiveness while preserving finiteness, the next section develops syntactically restricted forms of recursion.

Domain Bounded Recursion
In some patterns of recursion, the length of newly constructed sequences is bounded above; that is, the recursive construction of new sequences proceeds up to a certain length and then stops, so that the semantics is finite.This section develops one of these patterns, which we call domain bounded recursion.In this form of recursion, the length of constructed sequences is bounded above by the size of the active domain of the database, that is, by the sum of the lengths of all sequences in the database.Recursion therefore stops after a finite amount of time, depending on the size of the domain.
In previous examples of finite programs, the length of a computed sequence depended only on the length of a single sequence in the database (as when computing the echo of a sequence, in Example 1.5).In the next example, the length of a computed sequence depends on two database sequences.

Example 4.1 [Shuffle]
We define a shuffle of two sequences, 1 and 2 , to be any sequence obtained by interleaving the symbols of 1 with those of 2 .For example, the shuffles of ab and 12 are ab12; a 1 b 2 ; a 12b; 12ab; 1a2b; 1ab2.
The program below computes all the shuffles of pairs of sequences from the unary predicate R. It defines a predicate shuffle(X;Y;Z) to be true iff Z is a shuffle of X and Y , where X and Y are sequences in R. The length of Z is thus the sum of the lengths of X and Y .For this reason, recursion stops after a finite amount of time.

Finite Query Languages for Sequence Databases
In this program, pairs of sequences from relation R act as inputs to the last two rules, which define shuffle(X;Y;Z).These rules recursively scan the input sequences X and Y while constructing an output sequence, Z. Starting with the empty sequence, these rules nondeterministically scan a symbol in X or a symbol in Y , and then append this symbol to the growing shuffle sequence, Z.

2
Before defining Domain Bounded Recursion, we need to develop some preliminary notions.

Reasoning about Length
To determine if a program is finite, we need to reason about what it does.In particular, we need to reason about the lengths of any new sequences created by the program.If these lengths can be bounded, then the program is finite.This section develops a simple formalism for comparing the "lengths" of two sequence terms.The idea is that terms can be compared on the basis of their possible instantiations.For instance, if X is a sequence variable, then we would say that the term X abc is "longer" than the term X.This means that any instantiation of the one term is longer than the corresponding instantiation of the other term.This idea will allow us to reason about programs, and to develop conditions under which they are guaranteed to be finite.As a first step, we develop the notion of the symbolic length of a sequence term.This is an arithmetic expression in which symbols and numbers can appear.For example, if X is a sequence variable, then its symbolic length is the symbol L X .Likewise, if X Y is a sequence term, then its symbolic length is L X + L Y .The symbolic length of a constant sequence (e:g:, actg) is its actual length (e:g:, 4).Such expressions allow us to reason about the lengths of partially specified sequences.The reasoning is tractable because we are dealing with just a tiny subset of arithmetic.
To reason about sequence terms such as X[N : M ], we need to reason about the index terms N and M .We therefore introduce the notion of the symbolic value of an index term.Like symbolic lengths, symbolic values are arithmetic expressions in which numbers and symbols can appear.For example, if N is an index variable, then its symbolic value is the symbol V N .In general, the symbolic value of an index term depends on the sequence term in which it is embedded.
For example, in the sequence term X[N : end], the index term end represents the last character in the sequence X.Thus, in the term acgt[2 : end], the symbolic value of end is 4, while in the term actgactg[2 : end], its symbolic value is 8.The following definition makes these ideas precise.

Definition 2 [Symbolic Length and Value]
The symbolic length of a sequence term, s, is an arithmetic expression, denoted L(s).The symbolic value of an index term, n, in the context of s is also an arithmetic expression, denoted V (n;s).These expressions are built from integers, two binary connectives (+ and -), and a collection of symbols.They are constructed in a mutually recursive fashion as follows: -Symbolic Lengths: ) is the expression V (n 2 ; s ) V ( n 1 ; s ) + 1 .

-Symbolic Values:
If n is an integer, then V (n;s) = n .If N is an index variable in V I , then V (N;s)is the symbol V N .V (end; s) = L ( s ) .e:g:, V (end; X) = L X V ( end; atcgatcg) = 8 V ( n 1 n 2 ; s ) is the expression V (n 1 ; s ) V ( n 2 ; s ) .e:g:, V (N + 3 Database Programming Languages, 1995

Finite Query Languages for Sequence Databases
Here are some sequence terms and their symbolic lengths: Symbolic lengths can be manipulated in a variety of ways.For instance, we can add and subtract two symbolic lengths to obtain another symbolic length.In some situations, we can also evaluate a symbolic length to obtain an integer.For example, if a symbolic length contains only integers and no symbols, then it can be evaluated in the normal way.Even if a symbolic length contains symbols, these symbols may cancel out, so the expression can be evaluated; e:g:, the value of L X + 4 L X 2 is 2.This gives two well-defined situations in which symbolic lengths can be evaluated.In fact, these are the only such situations.After all, if a symbol does not cancel itself out, then a symbolic length cannot be evaluated; e:g:, the expression L X L Y + 3 cannot be evaluated.This idea gives us a mechanism with which to compare two symbolic lengths.

Definition 3 [Comparability]
Two sequence terms s 1 , s 2 are comparable if the symbolic expression L(s 1 ) L(s 2 ) can be evaluated,4 to yield an integer, k.If k > 0 then s 1 is longer than s 2 .If k = 0 then s 1 is the same length as s 2 .
On the other hand, the terms s 5 = S[1:N ] and s 6 = S[1:M ] are incomparable.To see this, note that L(s 5 ) L(s 6 ) reduces to V N V M , which cannot be evaluated.Similarly, the terms s 7 = X Y and s 8 = Y are incomparable.In this case, L(s 7 ) L(s 8 ) reduces to L X , which cannot be evaluated.2

Constrained Variables
Another notion that we need is constrained variables.Intuitively, we need to infer when a variable ranges over a fixed domain that does not grow during query evaluation [24].For example, in the rule p(X[1 : 3]) q(X), the variable X is constrained, since it is forced to range over sequences in relation q.However, in the rule p(X) q(X[1 : 3]), variable X is not constrained.To see this, suppose the database contains the fact q(abc).Then the index term X[1 : 3] can take on the value abc, which means that X can be any sequence that has abc as a prefix.Thus, X can range over an infinite domain, including sequences of unbounded length.As another example, consider the following rule: In this case, X is a constrained variable, and so is Y .First, X is constrained to range over the sequences in relation q; and then Y is constrained to range over subsequences of X.
These ideas motivate the following definition.In this definition, and throughout this paper, we use the notation (p; i) to refer to the i th attribute (or argument) of predicate p.The ideas of constrained variables and symbolic length lead to a simple formulation of structural recursion in Sequence Datalog.For example, consider the following rules, where q is a base predicate: p(X;) q(X): p(X;X[1 : N + 1]) q(X); p ( X;X[1 : N ]): Here, p(X;Y )is true iff X is a sequence in q and Y is a prefix of X.To see this, note that if X is a sequence in q, then p(X;) is true, by the first rule.Then, using the second rule, p(X;X[1 : 1]) is true, then p(X;X[1 : 2]) is true, then p(X;X[1 : 3]), and so on up to p(X;X).After this, X[1 : N ] is undefined, so recursion stops.The rules thus scan each sequence in q from beginning to end, which is a canonical example of structural recursion.There are two points to observe here.(i) In both rules, variable X is constrained by the predicate q. (ii) The second argument of p grows with each bottom-up application of the rules.The notion of "growth" can be made precise by comparing the symbolic lengths of terms in the head and body of a rule using Definition 3. In this case, X[1 : N + 1 ] is longer than X[1 : N ].
The following definition generalizes this idea.
Definition 5 [Growing Attributes] Suppose predicate p occurs in the head of a rule and occurs once in the rule body.For each k, attribute (p; k) grows in the rule if the sequence term in attribute (p; k) in the head is longer than the sequence term in attribute (p; k) in the body.In addition, attribute (p; k) does not shrink in the rule if the sequence term in attribute (p; k) in the head is longer than or the same length as the sequence term in attribute (p; k) in the body.

Domain Bounded Recursion: Definition
We have now developed the concepts needed to define Domain Bounded Recursion.The idea is to allow recursion through construction, but in a controlled and limited way.The result is a class of Sequence Datalog programs that we call domain bounded programs.This subset is defined in terms of four restrictions on Sequence Datalog programs.The first two restrictions are not strictly necessary, since they can be generalized without much difficulty; however, they simplify the theoretical development.The last two restrictions are the heart of domain bounded recursion.They are based on the ideas of constrained variables and symbolic length, developed above.The restrictions all apply to recursive constructive rules, that is, to recursive rules that have a constructive term in the head.
The first restriction we impose is that in recursive constructive rules, the head must have exactly one constructive argument.A constructive argument is an argument that contains a constructive term.Thus, the left-hand rule below is allowed, but the right-hand rule is not: p(X; X Y) q(X); p ( X;Y ): p(X Y;XY) q(X); p ( X;Y ): The second restriction we impose on Sequence Datalog is that recursive constructive rules be linear.Recall that a rule is linear iff the predicate in the head is mutually recursive with the predicate of at most one atom in the body [6].
Actually, we require more than mere linearity, since we disallow mutual recursion through construction.5 Thus, the predicate symbol in the head of a recursive constructive rule must also occur in the body of the rule.We call this restricted linear recursion through construction.For example, if q is a base predicate, then the following rule is restricted linear: p(X; X Y) q(X); p ( X;Y ): The rulebases in Examples 4.1 and 1.5, defining the predicates shuffle and echo, are both restricted linear.This property of a program can easily be checked in polynomial time (polynomial in the number of rules).
Note that mutual recursion and non-linear recursion are still allowed.However, they are not allowed in constructive rules.We thus have all the power of classical Datalog at our disposal (since Datalog is a subset of Sequence Datalog).Moreover, abolishing mutual recursion through construction does not limit our expressive power, since mutual recursion can always be reduced to non-mutual recursion.

Finite Query Languages for Sequence Databases
The third restriction we impose is based on an idea we call constructive variables.Consider a recursive constructive rule.As above, suppose the rule is restricted linear and has exactly one constructive term in the head.Suppose this term is in argument (p; k).Because the rule is restricted linear, predicate p also occurs in the body, exactly once.Argument (p; k) in the body contains at most one sequence variable, Z (since constructive terms are not allowed in rule bodies).If Z also occurs in argument (p; k) in the head, then we call Z the constructive variable of the rule, e.g.Variable Z is the constructive variable of predicate shuffle in Example 4.1.In other words, Z is passed from argument (p; k) in the body to argument (p; k) in the head, and in the process, some other sequence is appended to it.This property allows the rule to recursively construct new sequences, which can lead to the unbounded generation of new sequences.It is this kind of behavior that we want to limit and control using structural recursion.

Definition 6 [Domain Bounded Program] A Sequence Datalog program is domain bounded if every recursive constructive rule in the program satisfies all the following conditions:
1. the head has exactly one constructive argument; 2. the rule is restricted linear; 3. except for the constructive variable (if any), every sequence variable is constrained; 4. there is some non-constructive argument that grows in the rule, and every other non-constructive argument does not shrink in the rule.
As mentioned earlier, only items 3 and 4 in this definition are strictly necessary.The others serve to simplify the development.(A more general treatment will soon be available in [18].)Finally, in rule 2 , the first argument grows and the second does not shrink; and in rule 3 , the second argument grows and the first does not shrink.The program is therefore domain bounded.

5 Complexity and Expressibility
In this section we prove that domain bounded programs are finite.We shall actually prove a stronger result, showing that, although finite, domain bounded recursion is highly expressive.In fact, domain bounded programs can generate sequences of exponential length, as the following example shows.We now prove complexity and expressibility results for domain bounded programs, and show that they capture exactly the class of elementary sequence functions [20], that is, the class of sequence functions with hyper-exponential time complexity.

Definition 7 [Elementary Functions]
The class of elementary sequence functions, E, is defined in terms of the hyperexponential functions, hyp i (n).These latter functions are defined recursively as follows: hyp 1 (n) = n hyp i+1 (n) = 2 hypin , for i > 1 Database Programming Languages, 1995 Finite Query Languages for Sequence Databases hyp i is called the hyper-exponential function of level i.The set of elementary sequence functions is the set of sequence functions that have hyper-exponential time complexity, that is, the set of sequence functions: The following theorem characterizes the expressibility of domain bounded programs in terms of these functions.
Recall from Section 2 that a language L is said to express a class of sequence functions, C, if (i) each sequence function expressible in L has complexity in C, and conversely, (ii) each sequence function with complexity in C can be expressed in L. Proof: (Sketch) To prove the lower expressibility bound, we show that given any Turing machine that runs in hyperexponential time, its computations can be encoded as a domain bounded program.The program generates a counter of hyper-exponential length using a technique similar to that in Example 5.1.This establishes a lower complexity bound, which leads directly to a lower expressibility bound, since we are dealing with sequence functions, and not more-general sequence queries.
To prove the upper expressibility bound, we show that domain bounded programs can compute at most a hyperexponential number of sequences, and that the maximum length of these sequences is at most hyper-exponential in the size of the initial database.Thus, the domain size of the least fixpoint is hyper-exponential in domain size of the database.The least fixpoint can therefore be computed in hyper-exponential time. 2

Example 4 . 3 [
Continued Shuffling] Consider the predicate shuffle in Example 4.1.There are two recursive constructive rules this program, 2 and 3 .Both are restricted linear and both have one constructive argument in the head.For both rules, Z is the constructive variable, and all other sequence variables (X and Y ) are constrained by predicate R.

Example 5 . 1 [
Long Sequences] The following program is domain bounded:doubling(; 1) true: doubling(X[1:N + 1 ] ; Y Y )input(X); doubling(X[1:N ]; Y ) : Given a sequence, , of length n in predicate input, the predicate doubling computes a sequence of 1's of length 2 n by doubling the length of a unit sequence n times.

Theorem 2 (
Expressibility of Domain Bounded Programs) Domain bounded programs express the class E of elementary sequence functions.
The following program is another way of expressing the query in Example 1.4.

Finite Query Languages for Sequence Databases the
Let be a rule, let S be a sequence variable, and let p be a predicate occurring in the body of .We say that S is constrained by attribute (p; i) in if at least one of the following holds: variable S is the i th argument of some occurrence of predicate p in the body of ; body of contains an equality atom of the form S = S 1 [N 1 :N 2 ] where S 1 is constrained by (p; i) in .