A Note on the Relationships Between Logic Programs and Neural Networks

Several recent publications have exhibited relationships between the theories of logic programming and of neural networks. We consider a general approach to representing normal logic programs via feedforward neural networks. We show that the immediate consequence operator associated with each logic program, which can be understood as implicitly determining its declarative semantics, can be approximated by 3-layer feedforward neural networks arbitrarily well in a certain measure-theoretic sense. If this operator is continuous in a topology known as the atomic topology, then the approximation is uniform in all points.


Introduction
Logic Programs and Neural Networks are two important paradigms in Artificial Intelligence.Their abilities, and our theoretical understanding of them, however, seem to be rather complementary.Logic Programs are highly recursive and well understood from the point of view of declarative semantics.Neural Networks can be trained but yet lack a declarative reading.Recent publications, for example [4,13,14], suggest studying the relationships between the two paradigms with the long-term aim of merging them in such a way that the advantages of both can be combined.
The results we wish to discuss draw heavily on the work of Hölldobler, Kalinke and Störr [13,14], which we will in part generalize.It will be convenient to briefly review their approach and their results.For undefined notions, see Section 2.
In [13], a strong relationship between propositional logic programs and 3-layer feedforward and recurrent networks was established.For each such program P, a 3-layer feedforward network can be constructed which computes the single-step or immediate consequence operator T P associated with P. To this end, each atom in P is represented by one or more units in the network.If the program is such that iterates of T P , for any initial value, converge 1 to a unique fixed point of T P , (which can be understood to be the declarative semantics2 of P), then the network can be cast into a recurrent network which settles down into a unique stable state corresponding to the fixed point.On the other hand, for each 3-layer network a propositional logic program P can be constructed such that the corresponding operator T P is computed by the network.
The first named author acknowledges financial support under grant SC/98/621 from Enterprise Ireland.
In [14], an attempt was made to obtain similar results for logic programs which are not propositional, that is, for programs which allow variables.The main obstacle which has to be overcome in this case is that the Herbrand base, which is the domain on which the logic programs operate, is in general infinite; it is therefore not possible to represent an atom by one or more units in the network.The solution suggested in [14] uses a general result due to Funahashi [8], see Theorem 3.1, which states that every continuous function on a compact subset of the real numbers can be uniformly approximated by certain types of 3-layer neural networks.By casting the T P -operator into such a function, approximating the single-step operator is shown to be possible.
In order to obtain a continuous real-valued function from T P , metrics were employed in [14].For acyclic3 logic programs, a complete metric 4 can be obtained which renders the single-step operator a contraction, see also [19].By identifying the single-step operator with a mapping on the reals, a contractive, and therefore continuous, realvalued function is obtained which represents the single-step operator.This function can in turn be approximated by neural networks due to the result of Funahashi mentioned above.For certain kinds of acyclic programs 5 , the resulting network can then again be cast into a recurrent network which settles down into a unique stable state corresponding to the unique fixed point of the operator.
In this paper, we will investigate a more general approach to representing the single-step operator for (nonpropositional) normal logic programs by neural networks.
After some preliminaries in Section 2, we will, in Section 3, use a result from [18] which characterizes continuity of the single-step operator in the atomic topology, and apply the approximation theorem of Funahashi in order to approximate single-step operators by neural networks.
In Section 4, we will show that for any given normal logic program, its associated single-step operator can be realized as a Borel measurable real-valued function.An approximation theorem due to Hornik, Stinchcombe and White [15], see Theorem 4.1, can then be applied to show that each single-step operator for any normal logic program can be approximated arbitrarily well by neural networks in a metric % defined in measure-theoretic terms in Section 4, see Theorem 4.1.
Finally, in Section 5, we briefly discuss our results and indicate the extensions of them which we are currently pursuing.

Preliminaries
We assume that the reader is familiar with basic topological and measure-theoretic notions as well as basic notions from logic programming and the theory of neural networks.However, we will review some of the concepts which are important in the sequel.

Logic Programs
A normal logic program is a finite set of clauses (in the sense of first order logic) of the form 8(A B 1 ^ ^Bk ^:C 1 ^ ^:C l ); usually written as A B 1 ; : : : ; B k ; :C 1 ; : : : ; :C l ; where A, all the B i and all the C j are atoms over some underlying first order language.As usual, we refer to a negated atom, for example :C, as a literal.We call A the head of the clause, and B 1 ; : : : ; B k ; :C 1 ; : : : ; :C l the body of the clause.Our notation and terminology generally follow that of [16], which is also our reference for background in logic programming.
We will work over Herbrand interpretations only.Thus, B P denotes the Herbrand base for a given program P, that is, the set of all ground atoms over the underlying language of P. As usual, Herbrand interpretations for P can be identified with subsets of B P .In turn, this means that the set I P of all Herbrand interpretations for P can be identified with the power set of B P , and this identification will normally be made.The set of all ground instances of clauses in P will be denoted by ground(P ).
Note that B P is countably infinite if there is at least one function symbol that occurs in P. In the sequel, we will impose the mild condition that this is indeed the case.In fact, if no function symbols occur in P, then B P is finite.Thus, P can be thought of as a propositional program, and this case was handled in [13] as already noted in the introduction.
A major tool for analysing logic programs (and for assigning a denotational semantics to a program) is the single-step or immediate consequence operator T P : I P !I P associated with a given program P. This operator is defined as follows: for I 2 I P , we set T P (I) to be the set of all A 2 B P for which there exists a clause A B 1 ; : : : ; B k ; :C 1 ; : : : ; :C l in ground(P ) such that I j = B 1 ^ ^Bk ^:C 1 ^ ^:C l (so, in the classical two-valued logic we are using here we have, for all i; j, that B i 2 I and C j 6 2 I).It is well-known that an interpretation I 2 I P is a model of P if and only if T P (I) I, that is, if and only if I is a prefixed point of T P .In particular, the fixed points of T P are called the supported models of P and they play an especially important rôle in the theory.In many ways a program P can essentially be represented by its associated single-step operator, and we will take up this point again in the discussion in Section 5.
We now proceed to define the atomic topology Q on I P , which will play a major rôle in the development.The atomic topology was introduced in [18] as a generalization of the query topology [3], but coincides with it in the context of Herbrand preinterpretations.Note also that the collection of basic open sets is countable.

The Cantor Topology
As mentioned above, I P can be identified with the powerset of B P .It can therefore also be identified with the set 2 BP of all functions from B P to f0; 1g (or to any other two-point space).Using this latter identification, the topology Q becomes a topology on the function space 2 BP , and is exactly the product topology (of point-wise convergence) on 2 BP if the two-point space is endowed with the discrete topology, see [18] for details.
If we interpret I P as the set of all functions from B P to f0; 2g, so that we now take the two-point space as f0; 2g, we can identify I P with the set of all those real numbers in the unit interval 0; 1] which can be written in ternary form without using the digit 1; in other words we can identify I P with the Cantor set.The product topology mentioned above then coincides with the subspace topology inherited from the natural topology on the real numbers, and the resulting space is called the Cantor space C. Thus, the Cantor space C is homeomorphic to the topological space (I P ; Q), and in the following : I P !C will denote a homeomorphism between I P and C, see [18] for more details.It is wellknown that the Cantor space is a compact subset of R, see [21], and we can define l(x) = maxfy 2 C : y xg and u(x) = minfy 2 C : y xg for each x 2 0; 1] n C.
We refer the reader to [21] for background concerning elementary topology.

Neural Networks
A 3-layer feedforward network (or single hidden layer feedforward network) consists of an input layer, a hidden layer, and an output layer.Each layer consists of finitely many computational units.There are connections from units in the input layer to units in the hidden layer, and from units in the hidden layer to units in the output layer.The input-output relationship of each unit is represented by inputs x i , output y, connection weights w i , threshold , and a function as 4th Irish Workshop on Formal Methods (IWFM-00), EWIC, British Computer Society, 2000.follows: The function , which we will call the squashing function of the network, is usually nonconstant, bounded and monotone increasing, and sometimes also assumed to be continuous.We will specify the requirements on that we assume in each case.
We assume throughout that the input-output relationships of the units in the input and output layer are linear.The output function of a network as described above is then obtained as a mapping f : R r !R with f(x 1 ; : : : ; where r is the number of units in the input layer and the constants c j correspond to weights from hidden to output layers.

Measurable Functions
) is a sequence of sets in M, then the union S A n 2 M. The pair (X; M) is called a measurable space.A function f : X !X is said to be measurable with respect to If M is a collection of subsets of a set X, then the smallest -algebra (M) containing M is called the -algebra generated by M. In this case, a function f : X !X is measurable with respect to (M) if f ?1 (A) 2 (M) for each A 2 M. If B is the subbase of a topology T , and B is countable, then (B) = (T ).If B is a subbase of the natural topology on R, then (B) is called the Borel -algebra on R, and a function which is measurable with respect to this -algebra is called Borel-measurable.A measure on (R; (B)) is called a Borel-measure.
We refer the reader to [2] for background concerning elementary measure theory.

Approximating Continuous Single-Step Operators by Neural Networks
Under certain conditions, given in Theorem 3.2, the single-step operator associated with a logic program is continuous in the atomic topology.By identifying the space of all interpretations with the Cantor space, a continuous function on the reals is obtained which can be approximated by 3-layer feedforward networks.We investigate this next.
The following Theorem can be found in [8, Theorem 2].

Theorem
Suppose that is non-constant, bounded, monotone increasing and continuous.Let K R n be compact, let f : K !R be a continuous mapping and let " > 0. Then there exists a 3-layer feedforward network with squashing function whose input-output mapping f : K !R satisfies max x2K d(f(x); f(x)) < ", where d is a metric which induces the natural topology on R.
In other words, each continous function f : K !R can be uniformly approximated by input-output functions of 3-layer networks.
We already know that the Cantor space C is a compact subset of the real line and that the topology C inherits as a subspace of R coincides with the Cantor topology on C. Also, the Cantor space C is homoeomorphic to I P endowed with the atomic topology Q.Hence, if the T P -operator is continuous in Q, we can identify it with a mapping (T P ) : C ! C : x 7 !(T P ( ?1 (x))) which is continous in the subspace topology of C in R.
The following characterization of continuity in Q was given in [18, Theorem 9].

Theorem
Let P be a normal logic program.Then T P is continous in Q if, for each I 2 I P and for each A 2 B P with A 6 2 T P (I), either there is no clause in P with head A or there is a finite set S(I; A) = fA 1 ; : : : ; A k ; B 1 ; : : : ; B k 0g of elements of B P with the following properties: 4th Irish Workshop on Formal Methods (IWFM-00), EWIC, British Computer Society, 2000.
(ii) Given any clause C with head A, at least one :A i or at least one B j occurs in the body of C.
As a corollary, one obtains that programs without local variables 6 have continuous single-step operators.It can also be shown that acyclic 7 programs have continuous single-step operators.For the slightly larger classes of acceptable [1] and locally hierarchical [5] programs 8 , the single-step operator is in general not continuous, see [11,19].
We can now present our first main theorem.

Theorem
Let P be a normal logic program.If, for each I 2 I P and for each A 2 B P with A 6 2 T P (I), either there is no clause in P with head A or there is a finite set S(I; A) = fA 1 ; : : : ; A k ; B 1 ; : : : ; B k 0g of elements of B P satisfying the properties (i) and (ii) of Theorem 3.2, then T P (more precisely (T P )) can be uniformly approximated by input-output mappings of 3-layer feedforward networks.
In particular, this holds for the operator T P if P is acyclic or does not contain any local variables.Proof: Under the conditions stated in the Theorem, the single-step operator T P is continuous in the atomic topology.Using a homeomorphism : I P !C, the resulting function (T P ) is continuous on the Cantor space C, which is a compact subset of R. Applying Theorem 3.1, (T P ) can be uniformly approximated by input-output functions of 3-layer feedforward networks.

Approximating the Single-Step Operator by Neural Networks
By Theorem 3.1, continuous functions can be uniformly approximated by input-output functions of 3-layer feedforward networks.It is also possible to approximate each measurable function on R, but in a much weaker sense.We will investigate this in the present section.
The following was given in [15, Theorem 2.4]

Theorem
Suppose that is a monotone increasing function from R onto (0; 1).Let f : R r !R be a Borelmeasurable function and let be a probability Borel-measure on R r .Then, given any " > 0, there exists a 3-layer feedforward network with squashing function whose input-output function f : R r !R satisfies % (f; f) = inff > 0 : fx : jf(x) ?f(x)j > g < g < ": In other words, the class of functions computed by 3-layer feedforward neural nets is dense in the set of all Borel measurable functions f : R r !R relative to the metric % defined in Theorem 4.1.
We have already noted that the operator T P is not continuous in the topology Q in general, nor is it continuous in the Scott topology on I P in general (it satisfies this latter property when P is definite, that is, when P contains no negative literals).We proceed to show next that the single step operator has the pleasing property that it is Borel measurable for arbitrary programs, and therefore that it can always be extend ed to a measurable function on R.

Proposition
Let P be a normal logic program and let T P be its associated single-step operator.Then T P is measurable on (I P ; (G)) = (I P ; (Q)).Proof: We need to show that for each subbasic set G(L), we have T ?1 P (G(L)) 2 (G).First, let L = A be an atom.If A is not the head of any clause in ground(P ), then T ?1 is the head of a clause in ground(P ), then there are at most countably many clauses A A i1 ; : : : ; A iki ; :B i1 ; : : : ; :B ili in ground(P ) with head A, and we obtain T ?1 P (G(A)) = i G(A i1 ; : : : ; A iki ; :B i1 ; : : : ; :B ili ) which is indeed in (G).
Now suppose that L = :A is a negative literal.If A is not the head of any clause in ground(P ), then T ?1 P (G(:A)) = I P 2 (G).So assume that A is the head of some clause in ground(P ).If there is a unit clause with head A, then T ?1 P (G(:A)) = ; 2 (G).So assume that none of the clauses in ground(P ) with head A is a unit clause.Then there are at most countably many clauses A A i1 ; : : : ; A iki ; :B i1 ; : : : ; :B ili in ground(P ) with head A. We then obtain T ?1 which is indeed in (G).
By means of Proposition 4.2, we can now view the operator T P as a measurable function (T P ) on C by identifying I P with C via the homeomorphism .Since C is measurable as a subset of the real line, this operator can be extended 9to a measurable function on R and we can now state our second main theorem.

Theorem
Given any normal logic program P, the associated operator T P (more precisely (T P )) can be approximated in the manner of Theorem 4.1 by input-output mappings of 3-layer feedforward networks.
In fact, we are able to strengthen this result a bit by giving an explicit extension of T P to the real line.We define a sequence (T n ) of measurable functions on R as follows (where l(x) and u(x) are as defined earlier): ( (T P )(l(x)) + (TP )(u(x))?(TP )(l(x)) u(x)?l(x) if x 2 3 ?1 ; 2 3 ?1 ] 0 otherwise T i (x) = ( (T P )(l(x)) + (TP )(u(x))? (TP )(l(x)) u(x)?l(x) (x ?l(x)) if x 2 S 2 3 i?2 k=1 (2k ?1)3 ?i ; 2k 3 ?i] 0 otherwise for i 2. We define the function T : R !R by T(x) = sup i T i (x) and obtain T(x) = (T P (x)) for all x 2 C and T( (I)) = (T P (I)) for all I 2 I P .Since all the functions T i , for i 1, are piecewise linear and therefore measurable, the function T is also measurable.Intuitively, T is obtained by a kind of linear interpolation.
If i : B P !N is a bijective mapping, then we can obtain a homeomorphism : I P !C from i as follows: we identify I 2 I P with x 2 C where x written in ternary form has 2 as its i(A)th digit (after the decimal point) if A 2 I, and 0 as its i(A)th digit if A 6 2 I.If I 2 I P is finite or cofinite 10 , then the sequence of digits of (I) in ternary form is eventually constant 0 (if I finite) or eventually constant 2 (if I cofinite).Thus, each such interpretation is the endpoint of a linear piece of one of the functions T i , and therefore of T.

Corollary
Given any normal logic program P, its single-step operator T P (more precisely (T P )) can be approximated by input-output mappings of 3-layer feedforward networks in the following sense: for every " > 0 and for every I 2 I P which is either finite or cofinite, there exist a 3-layer feedforward network with input-output function f and x 2 0; 1] with jx ?(I)j < " such that j (T P (I)) ?f(x)j < ".
Proof: We use a homeomorphism which is obtained from a bijective mapping i : B P !N as in the paragraph preceeding the Corollary.We can assume that the measure from Theorem 4.1 has the property that f x; x+"]g " for each x 2 R. Let " > 0 and I 2 I P be finite or cofinite.Then by construction of T there exists an interval (I); (I)+ ] with < " 2 (or analogously (I)? ; (I)]) such that T is linear on (I); (I)+ ] and jT( (I))?T (x)j < " 2 for all x 2 (I); (I) + ].By Theorem 4.1 and the previous paragraph, there exists a 3-layer feedforward network with input-output function f such that % (T; f) < , that is, fx : jT(x) ?f(x)j > g < .By our condition on , there is x 2 (I); (I)+ ] with jT(x)?f(x)j < " 2 .We can conclude that j (T P (I))?f(x)j = jT( (I))?f(x)j jT( (I)) ?T(x)j + jT(x) ?f(x)j < " as required.
It would be of interest to strengthen this approximation for sets other than the finite and cofinite elements of I P , although it is interesting to note that the finite interpretations correspond to compact elements in the sense of domain theory, see [20].

Conclusion
There are two aspects to this work.On the one hand, one can consider the problem of approximating the T P operator, associated with logic programs P, by means of input-output functions of multi-layer neural networks, as we have done here.This, in detail, involves relating properties of the network to classes of programs for which the approximation is possible.It also involves the consideration of what mathematical notions of approximation are useful and appropriate.Here we have discussed two well-known ones: uniform approximation on compacta, and a notion of approximation closely related to convergence in measure.Both these strands need further investigation, and this paper is an account of our work to date which is at an early stage of development.In the other direction, and we have not discussed this at all here except in passing, is to view logic programs as fundamental and to view the approximation process as a means of giving semantics to neural networks based on the declarative semantics of logic programs.There is considerable point in doing this in that the semantics of logic programming is well understood whilst that of neural networks is not, but is something to be taken up elsewhere.
At the detailed mathematical level, the mapping P 7 !T P is not injective.So, although the single-step operator can basically be used to represent a program semantically, different programs may have the same single-step operator.This fine tuning is lost by our representation of logic programs by neural networks.However, passing to classes of programs with the same single-step operator is something that is often done in the literature on semantics and in fact is exactly the notion of subsumption equivalence due to Maher, see [17].Moreover, there exist uncountably many homeomorphisms : I P !C; for example, every bijective mapping from B P to N gives rise to such a homeomorphism as observed in the paragraph preceeding Corollary 4.4.So there is a lot of flexibility in the choice of and therefore in how one embeds I P in R. The homeomorphism used in [14] employed the quaternary number system.
In [14], as mentioned in the introduction, the neural network obtained by applying the approximation Theorem of Funahashi was cast into a recurrent network which settled down in a unique stable state corresponding to the unique fixed point of the single-step operator of the underlying program P. Strong assumptions had to be placed on P to make this possible: P was required to be acyclic with an injective level mapping.Acyclicity of the program yields the existence of a complete metric on I P with respect to which its single-step operator is a contraction.For larger classes of programs such metrics are yet unknown, and there are indications that they do not exist.However generalized metric structures on I P can render T P a contraction, and these matters are currently under investigation by the authors, see e.g.[9,11,12,19], using various methods including those of many-valued logic.
with a neural network.The second is to examine this process in terms of semantics from both the point of view of logic programming and the point of view of neural networks.These are both ongoing projects of the authors, but we thank the referee for highlighting them and their importance.
Let P be a normal logic program.The set G = fG(A) : A 2 B P g fG(:A) : A 2 B P g, where G(L) = fI 2 I P : I j = Lg for each literal L, is the subbase of a topology called the atomic topology Q on I P .The basic open sets of Q are of the form G(A 1 ) \ \ G(A k ) \ G(:B 1 ) \ \ G(:B l ), of course, being derived from the subbasic open sets in the usual way, and we will denote such an open set by G(A 1 ; : : : ; A k ; :B 1 ; : : : ; :B l ).