Document Retrieval by Relevance Terminological Logics

Information Retrieval (IR) is presented as the task of retrieving the documents that are relevant to a given query. In the context of a Terminological Logic (TL) based approach to IR, this amounts to embodying a notion of relevance in the logical implication relation of the chosen TL. Among the many possible readings of the term "relevance", the one captured by relevance logic, and in particular by first-order tautological entailment, can be viewed as a promising source of inspiration, to the end of incorporating a logic-based form of relevance in the inference mechanismof TLs. 
 
The aim of this paper is to present a Relevant Terminological Logic, while maintaining the desired "relevance" flavour of relevance logics, which could be considered as a base towards a suitable document description logic for IR purposes.


Introduction
Terminological Logics (TLs, for short), also called Description Logics or Concept Logics, have widely gained popularity because of the vastity of their possible applications.These logics allow both the representation of concepts / terms / classes and their structuring according to a partial order, and to state that individuals are instances of concepts.Typically, concepts are build out of two types of symbols, primitive concepts and primitive roles (called attributes in databases).These primitive symbols can be combined, using appropriate operators, in order to obtain more complex concepts and roles, whose semantics is given in a Tarsky style: an interpretation maps concepts into subsets of the domain and roles into binary relations over the domain.For example, given the concepts Document, InformationRetrieval, TerminologicalLogics and Italian,therole sdealswith and author, the operators for conjunction of concepts (u), existential (9) and universal (8) quantification over roles, then the concept (or document description) Document u 9dealswith:InformationRetrieval u 9dealswith:TerminologicalLogics u 8author:Italian is a complex concept which denotes the set of documents dealing with Information Retrieval, with TLs, and whose authors are all Italian.
Assertions are build up using individuals, concepts and roles.For example, given the individuals Umberto and doc211, and the complex concept above, then the assertion Document u 9dealswith:InformationRetrieval u 9dealswith:TerminologicalLogics u 8author:Italiandoc211 (1) states that "doc211 is a document dealing with Information Retrieval and TLs, and whose authors are all Italian"; whereas the assertion authordoc211; Umberto (2) This work has been carried out in the context of the project FERMI 8134 -"Formalization and Experimentation in the Retrieval of Multimedia Information", funded by the European Community under the ESPRIT Basic Research scheme.
states that "Umberto is an author of document doc211".
Recently, a TL, MIRTL, has been proposed as a model for a logic-based approach to Information Retrieval [9, 18] (IR, for short).In a logic based approach to IR, an IR system can be described as in Figure 1 (note, that our aim is to describe only a possible base schema and not a detailed one).
There are mainly two important parts in such a system.The Document Base (DocB) and the Document Knowledge Base (DocKB).DocDB : For a document of the DocB, by using indexing technics, pattern recognition technics, natural language processing, manual processing, etc., the system gets logical descriptions about the content and layout of this document.The set of all the descriptions of the documents in DocB forms the DocDB.Therefore, DocDB is the set of logical descriptions of the documents in the DocB.DomDB : The DomB describes knowledge about the application domain.Typically, it contains three subsets: the Domain Knowledge (DomK), the Dictionary Base (DicB) and the Synonymous Base (SynB).

DomK :
The DomK contains knowledge about a specific application domain of the system, i.e. documents are all about laws, or are invoices, or are about music styles, or are computer science technical reports, or are medical reports, etc. and, thus, includes specific information about document's content and layout.DicB and SynB : The DicB and SynB contain term and synonymous definitions which are specific to the application domain, as general term definitions.
The other modules handles the interaction of an user to the system.In particular, the User Query Module (UQM) enables an user to formulate a query Q by means of a language, e.g. a formal query language as SQL, or natural language, or natural language with images and sound (a multimedia document), or whatever is needed.Note that a query could be also a (short) multimedia document.In fact in this case the goal of the system could be: retrieve those documents whose content is relevant to the content of the query document.
The query Q is successively "formalized" into a formula of the underlying logic.The Logic-based Query Module (LQM) is responsible for query solving, and thus queries the DocKB about those documents which are "relevant" to the given query, getting a set of unique document identifiers (the IDs of the relevant documents), called answer set.This set is sent to the Answer and User Relevance Feedback Module (AURFM).This module interacts with the DocB and, by means of the answer set, returns in some form information about the set of documents identified by the answer set, e.g. the documents themselves, an explanation about the retrieval of each document (i.e.why the system has retrieved a document), etc.. Finally, user relevance feedback could be achieved in the following way: 1. the user selects those informations about the retrieved documents, i.e. by selecting a subset of documents, pieces of documents, explanations, etc., which he retains relevant to his purposes; 2. this set is then "formalized" into a formula of the underlying logic in a similar way as for the query formulated by the user, getting a relevance feedback formula, which is sent to the LQM.Now the LQM can combine the user query with the relevance feedback formula, yielding a new query which can be seen as a "more precise reformulation" of the original user query; 3. finally, a new answer set is retrieved.Now, typically, IR is presented as the task of retrieving the documents that are relevant to a given information need (query).The task of IR can thus be described in logical terms as the extraction, from a DocKB, of those documents d that, given a query q make the formula d !q valid, where ! is the logical implication relation of the adapted logic: for example, in TLs, the task of IR is formalized as: given a set of assertions (the DocKB) and a query concept Q, what is the set of individuals (documents id) d such that logically implies Q(d) (in symbols, j = Q(d))?
In the context of TL-based approach to IR, this amounts to embodying a notion of relevance in the logical implication relation of the chosen TL.It is clear that the notion of logical implication alone is not adequate for IR purposes, as, for example, probabilities are missing.This means that TLs needs to be extended including other features.A good candidate seems to be a TL based logic which incorporates: Probabilistic extension : Probabilistic versions of TLs [18,22] could be investigated as a means of making explicit various sources of uncertainty, such as uncertainty related to domain knowledge and uncertainty related to automatic document representation, which is typical in IR; Concrete domain : Incorporating kinds of data types of the underlying concrete domain [2] as "string", "integer", "link" (link to the position of a keyword in a document, link to another related document, etc.), etc.; Rule language : Including rules for the DocKB module, as, for example, in [7]; Closed World Assumption, Closed Domain Reasoning, etc. Close world reasoning and closed domain reasoning seem to be suitable for IR purposes, as they are close to usual databases reasoning [6,15,16].
Therefore, in order to achieve a notion of relevance implication in such an extended TL, as first step, a notion of relevance must be enclosed into the inference engine of the "pure" TL (TL without any extensions).Among the many possible readings of the term "relevance", the one captured byrelevance logic [1], and in particular by first-order tautological entailment, can be chosen as a promising source of inspiration to the end of incorporating a logic-based form of relevance in the inference mechanism of TLs.The underlying tenet of the criticism of relevance logicians to classical logic is that relevance of a premise to a conclusion is essential for asserting the implication between the premise and the conclusion.As a consequence of this criticism, the key concern underlying relevance logics is that of formalizing in a logic-based way a more suitable form of relevance than material implication.
In semantical terms, this amounts to adopting a four-valued semantics [3,8] for TLs, thus obtaining Relevant Terminological Logics, where assertions can be not only true or false, but also neither true nor false (a state of affairs which is known as unknown), and also both true and false (a state of affairs which is known as contradiction).
Logics of this kind have already been used in Knowledge Representation and Reasoning in order to avoid the so-called paradoxes of logical implication, like a ^:aj =band b j = a _:afor all b, when reasoning on concepts and individuals.These logics have also been proven to have a generally better computational behaviour than their two-valued analogues [8,11].Unfortunately, we have observed that the adoption of the by now classical four-valued semantics [11] results in a too drastic loss of inferential capabilities for IR.
The aim of our work is to present a less restrictive four-valued semantics for TLs, while maintaining the desired "relevance" flavour of relevance logics, which could be considered as a suitable core, towards a TL based extended logic for IR purposes.

Document Retrieval by Relevance Terminological Logics
In particular, we will present a four-valued variant of ALC (a powerful concept language including conjunction, disjunction, and negation of concepts, together with existential and universal quantification on roles), which we will call (ALC) 4 , and which can be considered as a good representative for AL-languages.
We will also show that the defined entailment relation captures a close (structural) relationship between a DocKB and a query.Therefore, the defined entailment relation could arguably be a good theoretical and practical bases towards a logic-based approach to IR.The rest of this paper is organized as follows.In the next section we will give the syntax and semantics of (ALC) 4 .
In Section 3 we will discuss our semantics and we will show, by means of examples, differences to standard two-valued semantics, differences to existing four-valued semantics and its suitability for IR, whereas Section 4 concludes.
2 The four-valued concept language ALC 4   In this section we present the syntax and semantics of (ALC) 4 1 .For a more general presentation with respect to the operators allowed in TLs, see [10].For a more extensive discussion on four-valued TLs, see [11].

Syntax
We assume two disjoint alphabets of symbols, called primitive concepts and primitive roles.The letter A will always denote a primitive concept and the letter R will denote a primitive role.The concepts (denoted below by C and D) and the roles (which in (ALC) 4 are always primitive) of the language (ALC) 4 are formed out of primitive concepts and roles according to the following syntax rule: :C j (concept negation) 8R:C j (universal quantification) 9R:C (existential quantification) Furthermore, we assume an alphabet of symbols called individuals, disjoint from the alphabets of primitive concepts and primitive roles.Individuals will be denoted below by a and b.Anassertion is an expression of type C(a) (meaning that a is an instance of C), where a is an individual and C is an (ALC) 4 concept, or an expression of type R(a; b) (meaning that a is related to b by means of R), where a and b are individuals and R is an (ALC) 4 role.Finally, an (ALC) 4 knowledge base is a finite set of assertions.
Note that in the following, we use parentheses whenever we need to disambiguate concept expressions.For example, we will write (8R:C) u D to mean that the concept D is not in the scope of 8R.

Semantics
The formal semantics of the logic (ALC) 4 is four-valued.The four truth values are the elements of 2 ft;fg , the powerset of ft; fg, i.e. ft; fg, fg, ftg and ffg.These values are best understood as epistemic states of a reasoning system about some proposition.Under this view, if the truth value of a proposition contains t, then the system has evidence to the effect -or beliefs -that the proposition is true.Similarly, if the truth value of a proposition contains f, then the system believes that the proposition is false.The truth value fg corresponds to lack of knowledge, and the truth value ft; fg corresponds to inconsistent knowledge.
In four-valued semantics it is possible to have inconsistent knowledge about some proposition without being totally inconsistent.This property, which is shared by other relevance logics, is touted as one of the advantages of relevance logics, especially when modelling states of knowledge.The interpretation function can best be understood as an extension function of two separate two-valued extensionsthe positive extension and the negative extension -defined next.Definition 2.2 Let I be an interpretation.The positive extension of a concept C , written C I + , is the set of domain elements that are known to belong to the concept, and is defined as fd 2 I : t 2 C I (d)g.Thenegative extension of a concept C , written C I , is the set of domain elements that are known not to belong to the concept, and is defined as fd 2 I : f 2 C I (d)g.The positive and negative extension of roles are defined similarly.
Note that, unlike standard semantics, these two sets need not to be complement of each other 3 .Domain elements that are members of neither set are not known to belong to the concept and are not known not to belong to the concept.This is a perfectly reasonable state for a system that is not a perfect reasoner or does not have complete information.Domain elements that are members of both sets can be thought of as inconsistent with respect to that concept in that there is evidence to indicate that they are in the extension of the concept and, at the same time, not in the extension of the concept.This is a slightly harder state to rationalize but can be considered a possibility in the light of inconsistent information.
The extensions of concepts and roles have to meet certain restrictions, designed so that the formal semantics respects the informal meaning of concepts and roles.For example, the positive extension of the concept AuB must be the intersection of the positive extension of A and B and its negative extension must be the union of their negative extensions, thus formalizing the intuitive notion of conjunction in the context of the four-valued semantics.

About modus ponens on roles
The following entailment relationships are readily verified: 1 j Document u 9dealswith:Relevance u LogicBased u 9type:Article t Book t TechnicalReportdoc324 (11) as seen above, doc324 is a document dealing with a logic-based approach to relevance, whose type is article, book or technical report; 1 j MultimediaDocu9author:Italiandoc2 i.e. from the fact that Umberto is the author of doc211 and that all authors of doc211 are Italian, it can be inferred that Umberto is Italian, hence that, doc2 is a multimedia document with an Italian author; 1 j 9dealswith:Pisa u 9doctype:Movievt2 i.e. vt2 is retrieved as a response to the query "retrieve all movies dealing with Pisa".In fact, vt2 is a component of doc2 of type movie.All components of doc2 are dealing with Pisa.Therefore, vt2 is dealing with Pisa.
We claim that all three inferences are reasonable for IR purposes, in particular note that12 and 13 are not trivial inferences.All three inferences are mainly based on the following key observation on our semantics, which in this case differs significantly from all other approaches.In fact, we allow modus ponens on roles (MPR, for short): i.e. for all concepts C and D, for any role R, and for all individuals a; b f8R:Ca;Ra;bgj Cband f8R:Ca; 9R:Dagj 9R:C u Da (14) This kind of inference is not allowed by other four-valued TLs, as, for example, in [11]).The key difference lies into the semantics of the 8 operator.Patel-Schneider's condition in this case looks like t 2 8R:C I d iff 8 e 2 I ;f 2R I d;e or t 2 C I e f 2 8R:C I d iff 9 e 2 I ;t 2R I d;e and f 2 C I e It is easy to see that, according to this semantics, there exists an interpretation I which satisfies 1 and such that both t 2 author I doc211 I ; Umberto I andf 2 author I doc211 I ; Umberto I withoutbeingt 2 Italian I Umberto I , and thus, 1 does not entail Italian(Umberto) and thus, 1 6 j MultimediaDocu9author:Italiandoc2.
We claim that MPR is very useful for IR and, in general, for real problems, and therefore we provided it in our framework.

About paradoxes of logical implication
In the following we will show what kind of inferences are not captured in our framework.The first two examples are about the so-called "paradoxes of logical implication" when reasoning on concepts and individuals.
First, note that the knowledge base 1 has a "local inconsistency" in classical terms about Umberto's nationality, without being totally inconsistent.In fact, we have 1 j Italian u:ItalianUmberto (15) and thus, 1 j MultimediaDocu9author:Italiandoc2

Document Retrieval by Relevance Terminological Logics
This means that doc2 is retrieved in both cases since in 1 there is evidence to the fact that doc2 is an instance off the queries MultimediaDoc u9author:Italian and MultimediaDoc u9author::Italian.On the other hand, in the document base 1 there is nothing about vt2's authors.Therefore, as we would expect, 1 6 j ((9doctype:Text) u (8author:fUmbertog))(vt2) (18) In two-valued semantics, since 1 is inconsistent 1 j = 2 ((9doctype:Text) u (8author:fUmbertog))(vt2) Clearly, this last kind of inference in not acceptable in IR: there is nothing in 1 about vt2 which is relevant to the query (9doctype:Text) u (8author:fUmbertog).
This example shows, in a simple way, one of the advantages of a four-valued semantics: inconsistent knowledge bases (from a two-valued semantics point of view) do not entail everything.
Dually, concepts, whose extensions are always the entire domain of an interpretation, are not necessarily entailed by every knowledge base.In fact, 1 j (9:doctype:(Movie t Text))(ut2) (20) and 1 6 j (8doctype:(Movie t:Movie))(ut3) (21) which we feel both correct.In fact, (20) follows directly from 1 , whereas (21) holds since there is an interpretation I, such that e 2 I and t 2 doctype I (ut3;e)and Movie I (e)=;.This is a state of affairs which models the fact that in 1 there is no evidence about ut3's document type, whatever it could be.
Whereas, with respect to two-valued semantics, we have 1 j = 2 (8doctype:(Movie t:Movie))(ut3) (22) To our opinion, missing this last kind of inference is important for IR purposes, since we want relevance of the premise to the conclusion.

About reasoning by cases
Finally, case reasoning does not work within our semantics.Consider the following knowledge base (9dealswith:Galileo)(doc74) g The meaning of 4 is: doc74 is a document, dealing with Galileo, with two components, t741, which is an Italian text, and t742.t743 is not an Italian text.Moreover, is a translation of t741 and t743 is a translation of t742.

Figure 1 :
Figure 1: A logic-based IR system

Definition 2 . 1
An interpretation I =( I ; I )consists of a non empty set I (the domain of I) and a function I (the interpretation function of I) such that 1.I maps every concept into a function from I to 2 ft;fg ; 1 Although we restrict our attention to a four-valued variant of ALC, our framework can be applied to other languages as well.Document Retrieval by Relevance Terminological Logics 2. I maps every role into a function from I I to 2 ft;fg ; 3. I maps every individual into I ; 4. a I 6 = b I ,ifa 6 = b 2 .

Definition 2 . 3 Definition 2 . 4
Let I =( I ; I )be an interpretation.The interpretation function I has to meet the following equations for concepts: for each d 2 I t 2 (C u D) I (d) iff t 2 C I (d) and t 2 D I (d) f 2 (C u D) I (d) iff f 2 C I (d) or f 2 D I (d) t 2 (C t D) I (d) iff t 2 C I (d) or t 2 D I (d) f 2 (C t D) I (d) iff f 2 C I (d) and f 2 D I (d) t 2 (:C) I (d) iff f 2 C I (d) f 2 (:C) I (d) iff t 2 C I (d) t 2 (8R:C) I (d) iff 8 e 2 I ;t 2R I (d;e) implies t 2 C I (e) f 2 (8R:C) I (d) iff 9 e 2 I ;t 2R I (d;e) and f 2 C I (e) t 2 (9R:C) I (d) iff 9 e 2 I ;t 2R I (d;e) and t 2 C I (e) f 2 (9R:C) I (d) iff 8 e 2 I ;t 2R I (d;e) implies f 2 C I (e) Observe that, according to the intuitive meaning of the qualified existential operator 9, (9R:C) I + =(:8R::C) I + and (9R:C) I =(:8R::C) I .The notion of subsumption between two concepts is defined in terms of their positive extensions as follows.Given two (ALC) 4 -concepts C and D: 1. C subsumes D (written D v C )iffD I + C I + , for every interpretation I; 2. C is equivalent to D, written C D,i ffC I + = D I + , for every interpretation I.With respect to assertions, we have the following definitions.2This restriction on individuals, called unique name assumption, ensures that different individuals denote different elements of the domain.

3
In two-valued standard semantics, we have: C I + \ C I = ; and C I + [ C I = I , or equivalently C I = I nC I + .