Advances in Databases and Information Systems 1997

In this paper, an optimal method for evaluating linear recursive datalog queries is proposed. The method is based on the concepts of so-called heritage appearance function and heritage selection function. By computing such functions in topological order, a counting-like strategy can be implemented, which requires only linear time for non-cyclic data.


Introduction
Deductive databases generalize relational databases by including not only base predicates (or relations), but also derived predicates (or views).A derived predicate is defined by means of one or more deductive rules.
A lot of strategies for processing deductive rules, especially for recursively defined rules have been proposed (see [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][17][18][19][20][21][22][23]).In this paper, we confine ourselves to the counting method [4,20] for linear recursion and try to improve its performance in the case of non-cyclic data.This method seeks to perform a compile-time transformation of the database, based on the query form, into an equivalent form which enables a bottomup computation to focus on relevant tuples.As with the magic set method [3,4], the transformed programs consist of two rule sets: counting rules and modified rules.Thus, the computation can be done in a twophase approach.In the first phase, we produce a counting set by evaluating the counting rules.In the second phase, we produce all answers by evaluating modified rules with the counting set being used to restrict the computation.According to the graphic analysis performed in [17,18], the worst-case time complexity of this method is O(ne), better than magic sets.Here, we introduce two new concepts: heritage appearance function and heritage selection function, and transform many algebraic operations into simple computations of such functions (i.e., some boolean operations) in topological order.In this way, high efficiency can be obtained not only due to the simplicity of boolean operations, but also due to the elimination of much redundancy by using binary sequence (string of 1's and 0's) property.
The paper is organized as follows.In the next section, we introduce the concepts of linear recursive queries, query graph and query dependency graph, and describe the counting method in a graphical formalism.In Section 3, we define the heritage appearance function and heritage selection function, and present an optimal evaluation algorithm for canonical strongly linear recursion (CSLRs), which reduces the cost of both the first and second phase of the counting method to O(e).In Section 4, we discuss more complicated linear recursion, i.e., non-interdependent recursion (NILR) and interdependent recursion (ILR).In Section 5 and 6, we prove the correctness of the refined algorithm and compare the time complexity of the refined algorithm with other wellknown strategies.Section 7 is a short conclusion.

Basic Definitions
The language of a deductive database consists of the variables, constants, and predicate names in the data base.We adopt some informal notational conventions for them.Variables will normally be denoted by the letters u, v, x, y and z (possibly subscripted).Constants will normally be denoted by the letters a, b and c (possibly subscripted).Predicate names will normally be denoted by the letters p, q, r, and s (possibly sub scripted).In the absence of function symbols, a term is either a constant or a variable.Occasionally, it will be convenient not to apply these conventions rigorously.In such a case, possible confusion will be avoided by the context.An atom is an n-ary predicate, p(t 1 , t 2 , ..., t n ), n 0, where p is a predicate name and t 1 , t 2 , ..., t n are terms.A literal is an atom or the negative of an atom.A positive literal is just an atom.A negative literal is the negation of an atom.A rule is a first-order formula of the form q p 1 ; p 2 ; :::; p m , m 0.
q is called the head and the conjunction p 1 , p 2 , ..., p m is called the body of the rule.Each p i is a body literal.When m = 0, the rule is of the form q and is known as a unit clause.
An atom p(t 1 , t 2 , ..., t n ), n 0 is ground when all of its terms t 1 , t 2 , ..., t n , are constants.A ground rule is one in which each atom in the rule is ground.A fact is a ground unit clause.The definition of a predicate p is the set of rules which have p as the head predicate.A base predicate is defined solely by facts.The set of facts in the database is also known as the extensional database.A rule that is not a fact is known as a derivation rule.A derived predicate is a predicate which is defined solely by derivation rules.A derived (base) literal is one whose predicate is derived (base).The set of derivation rules is also known as the intensional database or program.
In addition, in this paper, we use the term graph to refer to the directed graph, since we don't discuss the undirected ones at all.We assume that a graph G is specified as follows: for each node v i in the graph, there is a set of successors adj(v i ) = f v j j (v i , v j ) is an edge of Gg.Without loss of generality, we assume that G has no self-loops, i.e., for all nodes v i , v i = 2 adj(v i ).For an edge (v i , v j ), node v i is called the source or tail and node v j is called the destination or head of the edge.We denote the transitive closure of a graph G by As will be seen, our algorithms is based on depth-first traversal of graphs, so we review now some relevant definitions on it.Depth-first traversal induces a spanning tree on a graph based on the order in which nodes are visited.If we assume that the main routine in depth-first traversal is visit(v i ) for a node i, then there is an edge (v i , v j ) in the spanning tree, if there is a call to visit(v j ) during the execution of the call visit(v i ).
An edge (v i , v j ) in the graph G is called a tree edge, if it belongs in the spanning tree.An edge (v i , v j ) in the graph G but not in the spanning tree is called a forwardedge, a back edge, or a cross edge, if in the panning tree, v j is a descendant of v i , v j is an ancestor of v i , or v j is not related to v i with an ancestor-descendant relationship, respectively.For every strong component, its node r on which visit(r) is first called is the root of the strong component.
Advances in Databases and Information Systems, 1997

Query Graph and Query Dependency Graph
We distinguish among three kinds of linear recursions: canonical strongly linear recursion (CSLR), non-interdependent linear recursion (NILR) and interdependent linear recursion (ILR).A CSLR is a program contains only one linear recursive rule besides the non-recursive rules.For example, the following program is a typical CSLR. (1) rpx; y : , f l a t x; y rpx; y : , upx; z; rpz;w; downw;y: In contrast, an NILR may contain any many recursive predicates but no interdependency happens.That is, each recursive predicate in an NILR does not appear in the body of any rule defining the other.As an example, consider the follwoing abstract program: qx; y : , s 1 x; y, qx; y : , s 2 x; y, s 1 x; y : , r 1 x; y, s 1 x; y : , p 1 x; z; s 1 z;w; q 1 w;y, s 2 x; y : , r 2 x; y, s 2 x; y : , p 2 x; z; s 2 z;w; q 2 w;y, in which recursive predicates s 1 and s 2 are non-interdependent.The ILR is the most complicated linear recursion, where some recursive predicates may be interdependent.That is, one recursive predicate may appear in the body of a rule defining another.For example, in the following program, recursive predicate q 1 is dependent on q 2 while q 2 itself depends on both q 3 and q 4 .q 1 x; y : , p 1 x; z; r 1 z;y, q 1 x; y : , p 1 x; z; q 1 z;w; q 2 w;u; q 3 u; y, q 1 x; y : , p 1 x; z; q 1 z;w; q 2 w;u; q 3 u; y, q 2 x; y : , p 2 x; z; q 3 z;w; q 2 w;u; q 4 u; y, q 3 x; y : , r 3 x; y, q 3 x; y : , p 3 x; z; r 3 z;w; q 3 w;y, q 4 x; y : , p 4 x; y, q 4 x; y : , p 4 x; z; q 4 z;y.
In order to investigate the behavior of a CSLR evaluation, we associate a directed graph (called query graph) with each query against a CSLR program.A query graph basically consists of three parts: up-part (UP), flatpart (FP), and down-part (DP).The UP is that relation part which is reachable from the constants in the query.The FP is that part which can be reached using the non-recursive rule, and the DP can be reached using the recursive rule.For example, if the up, f l a t , and down predicates in the above CSLR program are defined as: up = f a 1 ; a 2 , a 1 ; a 3 , a 1 ; a 4 , a 1 ; a 5 , a 2 ; a 3 , a 2 ; a 4 , a 2 ; a 5 , a 3 ; a 4 , a 3 ; a 5 , a 4 ; a 5 g f l a t = f a 5 ; b 2 , down = f b 5 ; b 4 , b 4 ; b 3 , b 3 ; b 2 , b 2 ; b 1 , g , then the query graph representing the query ?, rpa 1 ; y will be as shown in Figure 1, where the edges going up represent tuples in UP, the broken edges represent tuples in FP, and the edges going down represent tuples in DP.

We will use the notation G u
the subgraphs induced from: UP, FP, and DP, respectively.In Figure 1, node a 1 2 N u represents the constant in the query rpa 1 ; y and we call such a node the source node.Given a subgraph G (i.e., G can be G u , G f , or G d ) and a subset X of G, we denote by adjGX (adj ,1 GX) the set of all nodes v j such that the edge v i ; v j ( v j ; v i ) is in G and v i is in X.In other words, adjGX is the set of all nodes that are adjacent to some node in X, whereas adj ,1 GX is the set of all nodes having at least one adjacent node in X.It is easy to see that these two sets of nodes can be computed using the rules stated above, which can be also expressed as relation-algebra expressions.
As an example, consider the above query graph.Say that X = fa 4 ; a 5 g .Then adjG u X = fa 5 g, adj ,1 G u X = fa 1 ; a 2 ; a 3 ; a 4 g , and adjG f X = fb 5 g.
In order to specify the program decomposition strategy with respect to NILRs and ILRs, we define a graph, called query dependency graph, for each linear recursive program as follows.The nodes of the graph represent the recursive predicates appearing in the program.An edge p ! q connects p and q iff p appears in the body of some rule defining q and is labelled with an adornment that is expected for p in terms of the current query submitted to the system.An adornment for an m-ary predicate p(t 1 , t 2 , ..., t m ) is a string of length m made up of the letters b and f, where b stands for bound and f stands for f r e e .We obtain an adornment for a predicate as follows.During a computation, each argument t i , 1 i m, of the literal p(t 1 , t 2 , ..., t m ) is expected to be bound or free, depending on the information flow.If t i is expected to be bound (free), it acquires a b (f ) annotation, and so the length of the adornment string is m.For example, for the ILR program shown above, we have a query dependency graph as shown in Figure 2 if the query against this program is of the form: ?-q 1 (c, y).Obviously, the query dependency graph for an NILR is just a node set, containing no edges at all; and the query dependency graph for an ILR is a graph containing no cycles (otherwise, it is non-linear recursive.) In fact, a query dependency graph corresponds exactly to a composition of the corresponding program.
That is, each node p in the graph can be thought of as a CSLR defining p with the other recursive predicates (appearing in it) handled as non-recursive ones.These CSLRs are connected to each other with the adornment Advances in Databases and Information Systems, 1997 labelling edges.More importantly, in terms of such a graph, each CSLR can be compiled (as discussed in Section 3) independently but only once.Of course, there may be many different methods for partitioning a program.For our purpose, however, decomposing a program into a set of CSLRs is desired so that the optimization ideas proposed in Section 3 can be directly employed for a decomposed program.

Counting Method
Since we are going to refer frequently to the counting method, we feel that it will be convenient to describe it in a graphical formalism [18].Figure 2 represents the basic idea of the counting method (essentially, it corresponds to the implementation of the supplementary counting method [4].)It works as follows.Let U i i 0 contain all nodes v in G u that have distance i from the source node s.In the first phase, the method computes U i (notice that, in general, such sets are not disjoint and are called the counting set.This process corresponds to the evaluation of the counting rules.)Suppose that U g contains the nodes with the greatest distance (thus g) from s.In the second phase, we start computing the set D g of all nodes in G f that are adjacent to some nodes of U g in G f .Then we compute D g,1 as the set of all nodes in G d that are adjacent to some node of U g,1 in G f and that are adjacent to some node of D g in G d .We continue until we compute D 0 , which contains all the answers (answer nodes) of the query (this process corresponds to the evaluation of the modified rules.)If the graph G u is cyclic, this version of the counting method is not safe.
U 0 := fsg; i := 0; while U i 6 = ; do begin Figure 3. Counting method ¿From the description of the algorithm, we can see that in the case of acyclic data the first loop can be performed O(jN u j) times and every iteration has cost O(jE u j).Therefore, the total cost of the first phase is O(jN u j j E u j ).Similarly, the second phase has a total cost of O(jN u j j E d j ).Hence, the cost of the counting method for acyclic queries is O(n e), where n and e denote the number of nodes and edges, respectively, in the graph representing the input relations.
The purpose of this paper is to reduce the time complexity of both first and second phases using a new method based on the notions of heritage appearance function and heritage selection function.In the following, we discuss these two functions and the corresponding computation methods in Section 3.

Optimal Algorithm for CSLRs
In this section, we describe our efficient algorithm for CSLRs.First, in Subsection 3.1, we discuss the optimization for CSLRs without cyclic data in detail.Then, we sketch a method for CSLRs with cyclic data briefly in Subsection 3.2

CSLRs without Cyclic Data
Similar to the counting method, our algorithm works in a two-phase manner.In the first phase, we compute the appearance function sequences for each node of G u in O(e) time.Then, in the second phase, we extract the Advances in Databases and Information Systems, 1997

Speeding up the Counting Method by Computing Heritage Functions in Topological Order
answers in terms of such appearance function sequence in some way.As will be seen later, the time requirement of the second phase is also bounded by O(e).

The First Phase of the Algorithm
Here we describe the first phase of the counting method.First, we present the concept of appearance function which was introduced in [1] to describe the possible distances of a node v in G u from a source node s.
De nition 1 The appearance function A v;s iof a node v with respect to a source node s is a binary valued function: For all integer i 0: Then the sequence A v;s =A v;s 0; A v;s 1; A v;s 2, ... ... corresponds to the different appearances of v with different distances from s.For example, for the query graph shown in Figure 1, we have A a5;a1 = 01111, A a4;a1 = 01110, A a3;a1 = 01100, A a2;a1 = 01000, A a1;a1 = 10000.
An observation shows that in the case of non-cyclic data, if the height of G u is h (we define the height of an acyclic graph to be the number of the nodes on the longest path in the graph), then the length of each A v;s is bounded by h and h j N u j .Thus, the number of 1's appearing in all appearance function sequences with respect to G u (denoted as N 1,bit ) is bounded by h j N u j .If we can find a method to generate each "1" only once, the time complexity of the first phase will be reduced to O(h j N u j ).In fact, we can find an algorithm which can generate all such sequences in linear time.
In order to generate each "1" only once, we introduce another concept, so called heritage appearance function for the nodes of G u , which can be defined as follows.

De nition 2
The heritage appearance function H A v v 0 ;s i of a node v 0 with respect to a node v (and a source node s) is a binary valued function: This process can be summarised as follows.(In the following algorithm, the operation ShiftRightA v;s shifts A v;s right 1 bit, filling the emptied position with 0.) procedure f i r s t , phase begin (1) generate G u ; (2) find a topological order for G u ; let f v 1 , v 2 , ..., v n g be the corresponding topological order and v 1 = s be the source node; A v1;s := "10... 0"; for i = 2 to n do let v i1 ; :::; v ij 2 f v 1 ; v 2 ; :::; v i,1 g be the direct precedents of v i for j = 1 to l do H A v i j v i ;s := ShiftRightA vi j ;s ; A vi;s := H A v i 1 v i ;s ::: H A v i l v i ;s ; end Obviously, the time complexity of the above algorithm is O(jE u j).On the one hand, G u can be generated in O(jE u j) time and the topological order for it can also be found in O(jE u j) (see [16]).On the other hand, the cost of generating an A v;s is bounded by O(d v ) and then the total cost of generating all appearance function sequences is In practice, the entire time spent for doing the shifting operations and the OR operations for a node can be taken to be O(1).Therefore, the total cost of generating all appearance function sequences should be O( N u j).

The Second Phase of the Algorithm
In terms of the appearance function sequences, the answers can be extracted by determining the distances for the nodes of G d (the graph induced from the down-part DP).First, we define the heritage selection functions.
Advances in Databases and Information Systems, 1997

Speeding up the Counting Method by Computing Heritage Functions in Topological Order
Then we discuss how such functions can be used to find the correct answers in linear time.
In [1], a concept: selection function is introduced to provide a termination condition for the counting method in the case of cyclic data.Based on this notion, we propose a new concept: heritage selection function In terms of the definition of the appearance functions, we know that if v 2 N u , w 2 N d , s is the source node, and v ;e2E f , then w is in the answer set iff A v;s i = A w;e i= 1 for some i Since the flat relation is many-to-many in general, we define the following function to represent the union of all the appearance functions of those nodes v 2 N u such that v ;e2E f .De nition 3

De nition 5
The generalized selection function GH e i is a binary valued function.For all integer i 0: for some e 1 such that e 1 ; e 2 N d , 0 otherwise.
In terms of the above definitions, we have the following proposition.
Proposition 2 If e; e 0 is an edge in E d and GH e k = 1 for some k 1, then we have GH e 0 k , 1 = 1 proof.It follows directly from the definitions of S e k and GH e k.
¿From the above discussion, we know that GH e indicates the distances of all those nodes from e, which belong to the answer set if they are in G d .For each e 2 N f N d , we can produce its S e by evaluating the (modified) non-recursive rule with the corresponding appearance function sequences being propagated (see Definition 3).Obviously, this can be done in linear time (see [1] for a more detailed description).Similar to the treatment of the first phase, we then generate a graph corresponding G d and compute all generalized selection function sequences in a topological order, thereby generating the answers dynamically in terms of the generalized selection function sequences computed so far.In the following algorithm, the operation ShiftLeftGH e returns a binary sequence obtained by shifting GH e left 1 bit and filling the emptied position with 0.
procedure second , phase begin (1) generate S e for each e 2 N f N d ; (2) generate G d ; (3) find a topological order for G d ; let f e 1 , e 2 , ..., e n g be the corresponding topological order for i = 1 to n do let e i1 ; :::; e i l 2 f e 1 ; e 2 ; :::; e i,1 g be the direct precedents of e i for j = 1 to l do insert e i into the answer set; end Note the statement marked with *.By executing this statement, each node is checked whether it belongs to the answer set in terms of its generalized selection function sequence.That is, if the 0th position of the generalized selection function sequence of a node e is "1", then e should be in the answer set (and then e is called an answer node).It is because in this case either s; e is in E f , where s is the source node, or there is a path of length 2i + 1 ; i 0 , from the source node s to e such that the first i edges are in E u , the (i+1)th edge

Speeding up the Counting Method by Computing Heritage Functions in Topological Order
is in E f , and the last i edges are in E d .(see Proposition 4 given in the next section.) Example 1 Continue with our running program.Suppose that the facts in the database can be represented as a graph shown in Figure 4. Given the query ?, rpa 1 ; y , the algorithm f i r s t , phase first generates G u , which corresponds to the up part of the graph shown in Figure 4. Then the topological order for it can be found in linear time: a 1 ,! a 2 ,! :::a n,1 ,! a n .
For this example, the algorithm second , phase will produce only one selection function sequence: S bn = "011 ... 1".
The selection function sequence for each S bj (j = 1, ..., n -1) can be taken to be "00 ... 0".Then, second , phase will generate a graph which corresponds to the down part of the graph shown in In this order, the generalized selection function sequences will be generated and in terms of them the answer set will be produced dynamically: fb n,1 ; b n , 2 ; :::; b 1 g.

About Cyclic Data
The evaluation of recursive queries in the case of cyclic data is intrinsically a difficult problem.There has been considerable effort directed toward this issue in recent years and several efficient methods have been proposed, which improve the performance to O(n e).Here, we try to combine the technique discussed in Section 3 with some ideas devepled in the previous research and show a new method which requires only O(e + n 2 ) time to produce all answers to a query.
In terms of the method proposed in [1], the appearance function sequnce of a node appearing in cycles can be described with two subsequences: a transiet sequence and a steady sequence.That is, for some node v its A v;s will be represented as a regular expression of the form * over the alphabet set 0, 1, where stands for the transiet part and stands for the steady part.For example, the appearance function sequence of the node v 4 in the graph shown in Figure 5(a) is of the form: 000(100)*.Then = "000" and = "100".When a node appears in more than one cycles, it becomes harder to determine and without doing duplicate work.For example, the appearance function sequence of v in the graph shown in Figure 5(b) is 00101(1)*, which can be obtained by performing a depth-first traversal of the graph or by a simple computation using the formular given in [1].An important property of such a sequence is that if the number of nodes of a graph (possibly containing cycles) is n, then the length of , denoted j j, is less than n 2 .The proof of this property can be found in [1], and a similar proof can be found in [22].Based on this property, a method is presented by [13], which requires O(n e) time to compute and .
To reduce the time complexity, we associate each node with a mark bit, which is initialed and changed as follows.First, we determine the topological order for G u by ignoring back edges of it.Then, we compute the appearance function sequence along the topological order iteratively.At the beginning, we set each mark bit to be "0".(The appearance function sequence for each node is also assigned "0".)Whenever a node is encountered during a scan of the topological order, we calculate the new appearance function sequence for it in terms of the equation given in Definition 1.If the value is the same as the old one or the length of the new appearance function sequence is larger than n 2 , we set its mark bit to be "0", otherwise to "1".Therefore, when we compute a new appearance function sequence for a node, we do not consider those precedants of it, whose mark bit is "0", because no contribution to the new sequence value for the conrrent node will be made by them.Obviously, much work will be saved in this way.In addition, it provides a sufficient and necessary condition for termination when all the mark bits become "0".
In general, the idea described above can not be directly employed to handle more complex linear recursive programs.However, we can always partition the rules of a program into several CSLRs so that the optimization idea discussed in the previous section can be directly exploited.In the following, we address this problem in detail.
First, we consider NILRs, which can simply be decomposed.For example, to evaluate a query like ?-q(c, y) against the NILR program given in 2.2, we compute ?-s 1 (c, y) and ?-s 2 (c, y) separately, which will be derived from the original query.Obviously, the technique developed in 3.1 can directly be utilized for each ?-s i (c, y) (i = 1, 2).Accordingly, we have the following algorithm.procedure evaluation , f o r , N I L R query begin for each NILR q i involved in the evaluation of query do call f i r s t , phase; call second , phase; end In terms of the query dependency graph, we know that a recursive algorithm should be implemented to evaluate an ILR.Let q 1 , ..., q i , ..., q m be recursive predicates involved.We can associate each q i with a set of query graphs of the form: G i = G i u G i f G i d with each for a query of the form ?-q i (c i , y), where c i represents a constant appearing the query generated with respect to q i during an evaluation and can be dynamically determined by the constant propagation.Then, for each query graph, the technique developed for CSLR can be utilized since each q i (c i , y) is evaluated essentially against a canonical strongly linear recursive program.
To this end, we change f i r s t , phase and second , phase a bit so that recursive calls can be implemented.In fact, the corresponding procedures have the following form: f i r s t , phase , I L R (c, q), second , phase , I L R ( q ).Their formal descriptions are given below.procedure f i r s t , phase , I L R begin fgenerate G u ; at each moment (during the generation of G u ) a recursive predicate q i is encountered, do the following operations: construct ?-q i (c i , y), where c i represents a constant determined by the constant propagation; call f i r s t , phase , I L R (c i , q i ); call second , phase , I L R ( q i );g find a topological order for G u ; let f v 1 , v 2 , ..., v n g be the corresponding topological order and v 1 = s be the source node; A v1;s := "10... 0"; for i = 2 to n do let v i1 ; :::; v ij 2 f v 1 ; v 2 ; :::; v i,1 g be the direct precedents of v i for j = 1 to l do H A v i j v i ;s := ShiftRightA vi j ;s ; P i;j2A indegreei outdegreej = Oe p e q ), where A denotes the set of answer tuples.The graph shown in Figure 6 helps to clarify this result.j graph for "p" an answer tuple ( i, j ) in A graph for "q" i Figure 6: Illustration for time complexity analysis ¿From this graph, we see that crossing an answer tuple, say i; j, each edge incident to j will be visited indegreei times by the magic set method.Since the number of answer tuples is bounded by (n p n q ), the cost of the magic set method is O(e p e q ) [17,18].In recent years, there has been considerable effort directed toward the extension of the counting method for dealing with cyclic relations, such as the level-cycle merging method proposed by [14,22], the synchronized counting method [1] and the method proposed by Haddad and Naughton [13].All those methods try to reduce the time complexity to O(n e) in the case of cyclic data.But no progress has been made in the direction of decreasing the time complexity of the counting method itself.In addition, a lot of experiments have been done [5] and show that QSQR, a well-known top-down strategy [21], has the same time complexity as the magic set method.At an abstract level, the expansion phase of QSQR can be viewed as two processes: a constant propagation process and a variable instantiation process.The former corresponds to the traversal of the graph for "p".The latter corresponds to the traversal of the graphs for "r" and "q".Therefore, the analysis for the magic set method applies to QSQR.

Conclusion
In this paper, two new concepts: heritage appearance function and heritage selection function have been introduced and an efficient algorithm for evaluating recursive queries have been developed.Based on the computation of such functions in topological order, this algorithm reduces the cost of the counting method significantly and can be used to treat with non-cyclic relations.The algorithm is efficient not only due to the simplicity of boolean operations, but also due to the elimination of much redundancy by using binary sequence property.In the case of non-cyclic data, the algorithm requires only O(n + e) time, where n and e denotes the number of nodes and edges, respectively, in the graph representing the input relations.

Figure 2 :
Figure 2: Illustration of query dependency graph for each node e 2 G d to select those nodes which belong to the answer set and are reachable from e.In this way, any repeated access to an edge in G d can be replaced by a simple boolean computation.In what follows, we first give the definition of the selection function.Then we define the notion of the heritage selection function and derive a refined algorithm based on the computation of the function for each node of G d in a topological order.

Figure 5 :
Figure 5: Cyclic data s 2, ... ... can be easily computed by shifting A v;s right 1 bit and filling the emptied position with 0. Further, we will use H A v;s to denote the sequence , v 2 , ... v j are the direct precedents of v in G u and is used to denote OR operation on the corresponding positions of two binary sequences respectively.This definition hints an efficient computation method, in which all appearance function sequences can be generated in linear time.First, we have the following proposition.Let s be the source node of a G u .Then, for all nodes v of G u except s, we have Advances in Databases and Information Systems, 1997 , v 2 , ... v j are the direct precedents of v in G u .proof.It follows directly from the definitions of A v;s and H A v;s .Based on the above proposition, we propose an algorithm which works in a two-step manner and can generate all appearance function sequences for the nodes of G u in linear time.In the first step, we produce a directed graph corresponding to G u .In the second step, we first find a topological order (for G u ) with the property that all precedents of a node n i are before n i in the order.Then we compute the appearance function sequence for each node in such an order as follows.At the beginning, the appearance function sequence of the source node A s;s is initialized to "100 ... 0".Then, the second node in the topological order can be obtained by shifting A s;s right 1 bit, filling the emptied position with 0. The ith node can be computed using the equation given in Proposition 1.Since the appearance function sequences of all direct precedents of a node have already been generated before the node is encountered, we can evaluate the appearance function sequence for each node of G u in a topological order without difficulty.More importantly, in this way, each A v;s can be generated by only performing d v shifting operations and d v logical OR operations, where d v represents the indegree of v.That is, to compute the appearance function sequence for a node v, we need only to shift each (i = 1, ..., d v ) right 1 bit and then to add them together, where v i (i = 1, ..., d v ) are the direct precedents of v.
Proposition 1 The selection function S e i of a node e in G d is a binary valued function:Then the sequence S e = S e 0; S e 1; S e 2, ... ... indicates the distances from e with the property that if a node in G d possesses one of such distances from e, it should be in the answer set.This property is called the selection property.For example, if we have u; e, v ;e2E f and A u;s = 01010, A v;s = 00101, where s represents the source node in G u , then S e = A u;s A v;s = 01111, which indicates that the nodes (in G d ) with the distance one, two, three or four from e are answer nodes.¿From the definition of the selection function, one can rewrite condition (1) as follows: A node w 2 N d is in the answer set iff for some e 2 N f N d , Now consider a sequence S e 0 obtained by shifting S e left one bit and filling the emptied position with 0. Such a sequence has an important meaning that if there exists a node v such that e; v 2 E d , then S e 0 indicates the distances of some nodes from v which belong to the answer set if they are in G d .This property, together with the fact that the answer node can be checked dynamically in terms of selection function sequences, provides a possibility to optimize the evaluation of the second phase.The heritage selection function H e e 0i of a node e 0 (in G d ) with respect to a node e is a binary valued function.For all integer i 0: ... ... indicates the distances from e 0 , which have the selection property.We will use H e to denote the sequence H e1 e H e2 e :::H e l e , where e 1 , e 2 , ... e l are the direct precedents of e in G d .Intuitively, H e can be used to propagate control information to extract answer subsets correctly.
0 otherwise.1 if S e i + 1 = 1 or H e1 e i + 1 = 1 for some e 1 such that e 1 ; e 2 N d , 0 otherwise.Advances in Databases and Information Systems, 1997