A Logical Relational Approach for Information Retrieval Indexing

In a relational indexing approach (see e.g. Farradane’s work), information is carried by a ﬁxed set of relationship types over an underlying set of terms. The idea is that the essence of the meaning of information is encapsulated in the relationships between terms. The importance of relationships is now widely recognized within many ﬁelds such as relational databases and knowledge representation formalisms. These ﬁelds have substantially improved our understanding of relationships and the problems involved in trying to formalize them. However, although those relationships can be correctly represented by almost all the well-known formalisms in such ﬁelds, they are not exploited as much as the objects by concrete operations. In information retrieval, previous attempts at managing relationships have mainly addressed structural aspects, and exclude the manipulation of index expressions by relational operations. This paper suggests a prime use of the relation properties through a logical framework, in a way that it can improve the effectiveness of the matching operation.

retrieval system is still viewed as a system that selects documents on the basis of a matching operation between a document representation d and a query representation q. If the matching operation deems a document as being sufficiently similar to the query, then the document is assumed likelytoberelevantand returned to the user. According to the logical model of information retrieval, the task of the system can be described as the extraction, form the document base, of those documents d that, given a query q, make the formula d ! q valid, where d and q are formulae of the underlying logic, and "!" denotes the logical consequence decision formalized by the logic in question. Such a logic-based approach originates in the work of Cooper [5] who provided a formal definition of relevance in terms of logical consequences. However, this approach to information retrieval and to relevance, as well as the classical approaches [6,7], is commonly called the topical approach [8,9] in which only the document's and the query's representations matter. As mentioned by Saracevic [10], it has been known for a long time that topicality is not the only criterion of a user's relevance judgment. A number of others factors also affect the relevance result such as the user's knowledge, the expected use of information, the application domain and so on. Van Rijsbergen [1] pointed out this problem when he proposed the use of a non-classical logic for information retrieval.
A general part of information retrieval systems is the index language, which is the language used to represent the documents in the collection, and the request of the seeker. The creation of the internal representation of a document is a prominent function of any information retrieval system. It is often referred to as the indexing process. The outcome of this process is a set of index expressions that supposedly summarize the information content of a document. The index expressions can be keywords, parse trees, semantic structures and so on. Central to our approach in this paper is the assumption that the output of the internal representation can influence considerably the effectiveness of the information retrieval system, especially when this output consists of some semantic structures (conceptual graphs, parse trees, infons, etc.). Indeed, as mentioned by [11], the most intricate or carefully designed retrieval algorithm can not compensate for inappropriately represented documents. It is then evident that the index language must be much more expressive and richly structured than the keywords-based languages usually adopted by the classical information retrieval systems. As a consequence, the index expressions should have a complex internal structure that we must handle with the utmost care: the more complex the index expressions are in their structures and contents, the more the underlying semantics of this complexity has to be made explicit. We claim here that the knowledge we have about the resulting indexing expressions can improve the matching process.
Recently, some authors make use of the knowledge representation formalisms to analysis document's information content and information need, and to evaluate the relevance of a document to a query. Good examples of such formalisms, experienced within the RIME [12] and MIRTL [13] projects, are based on a formalism derived from the notion of Conceptual Dependency and on Terminological Logics, respectively. Usually, these formalisms allow for a number of term-forming operators by means of which one may build semantic structures starting from a basic repertory of simple terms and relationships (see figure 1) 1 . Hence, a semantic structure is an expression of the form E i Relation E j ,w h e r eE i and E j are simple or complex terms, and Relation is a relationship holding between the terms. As a consequence, one can explore in addition to the thesaurus of terms (synonymy, related terms, etc.), the relationships properties, like symmetry, transitivity and so on, to derive more knowledge. This can be done through a logical framework following the ideas of Nie [14]. He showed that the logical model suggested by van Rijsbergen [1] seems to be a very promising approach to represent the impact of the system's knowledge on the matching operation. 1 In this example, bears-on and has-for-value are relations, while OPACITY,LUNG and TISSULAR are terms.
Hence, various kinds of inference rules can then be determined, and they form the derivation system which drives the plausible inference process between the documents and the queries. This derivation system will be defined over the language of index expressions. has-for-value

Keywords:
Opacity, Lung, Tissular The contributions made by this paper are twofold. Firstly, some useful relational operations in information retrieval are given. These operations allow us to handle the relationships that may occur in the documents and the queries representations. Secondly, we can generate more, hopefully precise, derivations and as a consequence more relevant answers to the user. The remainder of the paper is organized as follows. In Section 2 we recall the classical indexing approaches used in information retrieval. In Section 3 we will show how the semantical properties of index expressions can be used to generate more derivations rules. We define rules that take into account the relationships involved in these index expressions. In Section 4 we will show how the relationships will be organized in case of real information retrieval systems, some particular problems concerning the application of such an approach to information retrieval are also mentioned, while further directions of investigations are described in Section 5.

Current logical approaches
One can define a relation as a set of tuples that represents a relationship among objects in the universe of discourse.
Each tuple is a finite, ordered sequence of objects [15]. Tuples are in the universe of discourse, and can be represented as individual objects. Usually, in the literature the words relation and relationship refer to the same thing. In CycL [15], relations are called relationships and in LOOM [16], relations are called relations. Usually, relations are denoted by predicates. A fact that a particular tuple is an element of a relation is denoted by relation-name (arg 1 , arg 2 ...arg n ), where the arg i are the objects in the tuple. In the case of binary relations, the fact can be read as "arg 1 is relationname arg 2 "or"arelation-name of arg 1 is arg 2 ".
One of the first indexing strategies used in information retrieval was the extraction of keywords or topics from documents and the valuation of their importance according to their frequency of appearance. It has often been noted BCS IR Colloquium, 1997 that this approach neglects the important relations between keywords [9]. Sometimes, those relationships are even removed from the indexing process by an initial selection of words that removes the common words in the document (like "is", "on", "of", etc.); This is mainly done out of practical considerations. Such a classical approach can benefit from the availability of efficient algorithms that automatically extract the keywords from the document. As a result the relationships in which the keywords were involved are no longer taken into account. However, these relationships could be used to provide contextual information for driving the matching process. In fact, in the real word, relationships between objects play an important role. In the field of relational databases [17] or information systems engineering (e.g. Merise [18]), the importance of relations has already been recognized.
In the well-known information retrieval CACM test collection, we have noted a high occurrence frequency for some relations. After analysis, we have converted this collection into an homogeneous representation including only the title and the abstract fields. The number N of words in the collection was then computed. We has developed a tool that allows to search for sentences containing a sequence of words. It is possible to include parameters in these sentences. Hence, by the query "the #parameter on #parameter" we mean that we are looking for all sentences in the collection beginning with the word "the" followed by any string, followed by the relation "on" and ending with any string. Some example of these sentences are shown in the figure 2, with their respective probability of occurrence in the collection, according to the number N.  In information retrieval the representation of the relationships between keywords takes its root in the work of Farradane described in [19,20]. Farradane introduced the idea that much of the meaning in information objects is denoted in the relationships between terms. For example, "John drinks a cup" exhibits a functional dependence relationship type between John and cup. Hence, whereas in classical information retrieval approaches such a sentence will be indexed by the two keywords John and cup, Farradane projected the idea that the relationship drink must also be represented in the final index of the sentence. In fact, in this special example, the indexing of the sentence by the keywords John and cup can even be called erroneous, as obviously John drinks the content of the cup (which can be wine, coke, water, etc.) and not the cup itself. The representation of the relationship drink will avoid such an ambiguity as we all know that we don't drink cups but only the liquid they contain.
A parallel was drawn with the conceptual model from the database world where relationship types between entities play an important role; the index description of a document consisting solely of keywords would be like an BCS IR Colloquium, 1997 entity relationship model without relationship types. In Farradane's work, information was carried by a fixed set of relationship types over an underlying set of terms. This conception bears a close resemblance to a large class of knowledge formalisms such as terminological logics [21], conceptual graphs [22] and situation theory [3]. A terminological logic is a subset of first order logic with equality that contains only unary relations, representing sets of objects in the domain (referred to as concepts) and binary relations (called roles) linking together the objects of the domain. A conceptual graph is a bipartite, connected, finite and oriented graph of concepts and conceptual relations.
In the graphs, concept nodes represent entities, attributes, states and events, and relation nodes show how the concepts are interconnected [22]. Finally, the situation theory formalism introduces the important notion of infon [23,3]. An infon is a structure that represents the information that a relation R holds or does not hold between a particular set of objects. The use of those three formalisms in information retrieval was partially motivated by the fact that they all allow to represent more complex index terms and they offer enough capabilities to represent objects and relationships between objects [13,24,25,26,27]. Recent studies [28,9] about the impact of structured documents on both indexing and retrieving have shown the need to represent some aggregative relationships that satisfy a set of properties and constraints.
However, if the relationships between terms can be represented by almost all the actual knowledge representation formalisms, they are not exploited as much as the objects by the primitive operations given by those knowledge formalisms. Moreover, no behavior can explicitly be associated to relationship types, and neither does relation-based reasoning allow one to explore the implicit knowledge that can be of beneficial use in the matching process. For example, it has been shown in [29], that although the terminological logics are able to represent the relationships needed to index complex and structured documents, it is far from clear how to describe their characteristics. For instance, it is not possible to specify mathematical properties about roles (symmetry, transitivity, ...) without considering all possible derivable facts as a part of the knowledge base. Let us take a simple example: assume that Aggregateos i ;o s j and Aggregateos j ;o s k are two assertions (facts) in the terminological knowledge base . Also, assume that the Aggregate role is transitive. In such case, there is no possibility in terminological logics to infer Aggregateos i ;o s k from the knowledge base . Hence, a document indexed by Aggregateos i ;o s j and Aggregateos j ;o s k will not be retrieved by a query containing Aggregateos i ;o s k . In order to avoid this problem, we must add explicitly this derivable fact to the knowledge-base . However doing this would be very expensive, especially in cases where the document base contains several thousands of structural objects. In order to avoid practical problems like this, the introduction of some relational operations can help to describe the relationships behavior. Hence, we aim in the following sections to operate directly on the relationship types that can occur in semantical index expressions. We will propose a technique that can infer new information by analyzing the properties of the relations such as symmetry, transitivity, semantical behavior, arrangement with others relationships, and so on. This will be done through a general logical framework for studying relationships and their semantical properties. This framework captures the semantical information of the relations for information retrieval purposes by specifying their properties by way of inference rules.

A logical relational approach
Suppose one wants to build a logical system for a given application. First of all, one must specify which language L is to be used for defining the notion of well formed formulae in this system. This will allow us to construct the axioms and rules capturing the system. In other words, it is not enough to know the behavior of this logical system, as one must also know how it is presented. Here, we are interested in a logical framework for information retrieval.
Hence, if one considers the deduction as the retrieval operation, this means that the relevance relationship between document and query can be established in terms of the axioms and rules belonging to the logical system. As a result, the index language we use for the description of documents and queries constitutes the one specifying the syntax of a well formed formula in the above system. Following Farradane's idea, the index expressions must be able to represent the following form : E i Relation E j where E i and E j are simple or complex descriptors, and Relation is a relation holding between the descriptors. We choose the situation theory [30,23] as the derivation system's language: The semantical content of each document depends greatly on the relationships between objects and the index expressions it contains, and these relationships are the means of describing how the objects are combined. It seems interesting then to capture the behavior and the properties of these relationships through a set of inference rules in order to generate more knowledge about the content of each document. Such relationship properties will be defined by the set of axioms and rules contained in the derivation system.  By definition we assume that the aboutness inference 2; is reflexive, this seems to be an inherent property of aboutness in many IR models [3]; i.e., 8e 2LO;R;e2; e. Using this property, one may infer that the operator is idempotent: The empty index expression " is a neutral element for : 8e 2L;e "2;e A query can been seen as a request for information. It can therefore also been represented as an aggregation of expressions from the language LO; R. What remains now is the question of how to derive the query from the document description in a way that will improve the effectiveness of the matching operation.
Given the definitions above, we offer to operate directly on relations, that means we propose to exploit their semantical properties and the information they bear implicitly. Hence, in addition to the semantic networks that establish the relationships that objects have to other objects (i.e., the thesaurus), usually used in classical information retrieval approaches, for each relation we give its properties, its behavior and its arrangement with others relations.
All these information will be captured through a set of derivation rules.

The Mathematical Properties
General mathematical properties can be used to augment relations implicitly contained in the knowledge's system.
Such a relations can be used to improve the matching process between the document and the query. For instance suppose that in an image, we have three objects o 1 ;o 2 and o 3 and that they are connected by the "cover" spatial relationships, then we can add a new "cover" relationship between o 1 and o 3 such as:: cover cover cover o1 o2 o3 Indeed, in the raster mode, we say that an object o 1 covers another object o 2 if, and only if, all the pixels of the object o 1 are included in the pixels of the object o 2 , i.e. pixel(o 1 ) pixel(o 2 ). This relation is transitive as mentioned by [32].
It should be noted that these mathematical properties are more easy to formalize when they are associated to binary relations. However, although it is theoretically possible to split a relation into binary relations [22], we choose here to handle both binary and N-ary relations. Indeed, it is perhaps the case that due to some practical considerations, one may need some N-ary relations. For instance in [29], it has been shown that in order to cope with the complexity of structured multimedia documents, some tertiary relations are needed. To represent such relations, we need two primitive binary relations which are more difficult to manage: we must check if all the properties of the N-ary relation are faithfully represented. In fact one can easily see that the only use of binary relations requires the creation of a large number of relations which makes dealing with them hard and rather complex their treatments. Using the N-ary relations, the number of relations will decrease and a better computational behavior of the underlying operational system will be expected. On the other hard, one can remark that only the N-ary relations having an interest in information retrieval will be considered in the framework. As their semantical properties are known, it is not hard to include them such as derivation rules.
The mathematical properties of relations can be defined as follows: Sequentiality hhR;e1;e2;:::;e k ;1ii hhR;e k ;e k+1 :::;e k+m ;1ii hhR;:::e N,1 :::;e k+m ;1ii In case R is an N-ary relation, e N,1 denotes a combination of N , 1 expressions chosen from the set e 1 , ..., e k+m,1 . We can have N , 1 possible derivations, according to which combination of expressions we take.
For example in case of tertiary relations, the following derivations are valid if the relation R satisfy the above For instance, the relation "Between" that is a tertiary relation satisfy the two above derivation rules. This property is a general case of the transitivity relation.
The permutation rule expresses the claim that the order of linking some objects in the relation does not have any influence on the derivation process.

Permutation
hhR;e1;e2;:::;ei;:::;ej;:::;e k ;1ii hhR;e1;e2;:::;ej;:::;ei;:::;e k ;1ii Consider for instance the relation Between. If we assume that hhBetween;X;Y;Z;1ii means "X is between Y and Z", then the document "Reading is between London and Bristol" is about the query "Reading is between Bristol and London", as the Between relation is symmetric. We have the following instantiation of the "Permutation"rule: Example hhBetween;Reading;London;Bristol;1ii hhBetween;Reading;Bristol;London;1ii In case of binary relations, this property corresponds to the classical notion of Symmetry. For instance, the relation Married-To is Symmetric. The symmetry property is primarily intended to increase the number of the aboutness theorems [3].
Relations could have the behavior of a function. In the work of Chiaramella and al. [28], it has been shown that in order to allow a solid formal basis to express the retrieval of structured documents, some specific functions must be introduced. They can satisfy one of the following logical constraints: there is no i 1 ;:::;i n ;e 1 ;e 2 such that Ri 1 ;:::;i n ;e 1 ^Ri 1 ;:::;i n ;e 2 there is no e; i 1 ;:::;i n ;j 1 ;:::;j m such that Re; i 1 ;:::;i n ^Re; j 1 ;:::;j m In our logical framework, the first constraint can be expressed as follows:  We can have then the following example: Example hhPa rt-Of;o1;o2;1ii hhContained-In;o2;o3;1ii hhContained-In;o1;o3;1ii

Link with others relations
The definition of links between relations enables one to deduce new information about the indexing expressions, using relation-based reasoning. Hence, relations can be handled as well as concepts.
Like the synonymy in case of keywords or concepts (considered as unary relations), one may express that two relations are the same. The following "Alias" rule is a way to specify that two relations have the same extension and then that they are logically equivalent. Like for the terms (keywords, concepts, etc.), in information retrieval, this rule may be useful to avoid the omission of some relevant answers when the name of the relation mentioned by the user is not the same as the one used in the indexing process. A link between relationships that can be interesting for information retrieval is introduced by the following Inversion rule. This rule signifies the fact that the relations R 1 and R 2 are equivalent when their arguments are swapped is mentioned. Relationships of this kind are often used in some knowledge representation formalisms, especially in the case of binary relations. For instance in many terminological logics, a special operator inv is introduced. This operator applied to a given role (binary relation) produces an inversion of its arguments. This derivation rule allows for instance to derive that an image in which A tree is on the left of a house is relevant to a query looking for all images representing A relation R 1 can be a sub-relation of another relation R 2 . Intuitively, if these relations are viewed as sets of tuples, R 1 is a subset of R 2 . In other words, every tuple of R 1 isalsoatupleofR 2 , i.e., if R 1 holds for some objects o 1 ;o 2 ;:::;o n , then the relation R 2 holds for the same arguments. The following Subrelation rule expresses this fact. Note that a relation and its sub-relation must have the same arity N. The next rule denotes the Simultaneity property. Let for instance n and m be the respective arities of R 1 and R 2 . Let i 1 ;i 2 ;:::;i n,1 and j 1 ;j 2 ;:::;j m,1 be a set of indexing expressions. The Simultaneity property states the fact that each indexing expression e taking part in the relation R 1 by R 1 e; i 1 ;i 2 ;:::;i n,1 , necessarily takes part in the relation R 2 such that R 2 e; j 1 ;j 2 ;:::;j m,1 and vice versa. In our framework, this property can be expressed as follows: Simultaneity hhR1;e;i1;:::;in,1;1ii hhR2;e;j1;:::;jm,1;1ii For instance, if we assume that hhPa rents;x;y;z;1ii means "the parents of x are y and z, with y being the mother and with z being the father", then a document about hhPa rents;x;y;z;1ii is also about hhHave-As-Father,x ,z;1ii.
Another rule claims that for some pairs of relations R 1 and R 2 having the same arity N, there is no set of objects o 1 ;o 2 ;:::;o n that can be linked together at the same time by R 1 and R 2 . In terms of our framework, we write: Preclusion hhR1;e1;e2;:::;en;1ii hhR2;e1;e2;:::;en;0ii In case of binary relations, if a document is about an expression such as hhW ith; TRAIN; RESERVATION;1ii,then this document is not about a query looking for Trains without Reservation,i.e.hhW ithout; TRAIN; RESERVATION;1ii.
Here, the relations "With"and"Without" preclude each other. This property is interesting for information retrieval as it may be used to determine the non-aboutness [3].

A theoretical study of relations
In the previous section, we have presented a general logical mechanism of relation-based reasoning. It is based on a selection of rules that possess some important properties, according to real cases we have encountered in today's information retrieval systems [32,29]. Detailed investigations of what an interesting relation for information retrieval consists of are needed in order to develop an operational system. Indeed, it remains clear that only a specific set of relations are useful or need to be characterized in case of information retrieval. For instance, as mentioned by Palmer [33], the verbs do not convey useful properties. On the other side, we already know that one needs a specific class of aggregative relations in order to handle the structured multimedia documents [28]. For example, one can represent the information that a book b 1 is composed of two chapters c 1 and c 2 by the relations Aggregate(b 1 , c 1 )and Aggregate(b 1 , c 2 ), with Aggregate an aggregative relation. Another class of useful relations in information retrieval is exposed in the work of Mechkour described in [32]. This work proposes a model for images retrieval and describes a class of spatial relationships that may occur in an image. Twelve spatial relations were proposed in this model, like far, near, touch, in, etc..
Our logical mechanism can be integrated to all the knowledge formalisms that don't allow for relation-based reasoning. Hence, this will resolve their limitations in case where the query refer to some common properties about the relations. The result is an hybrid model that combines the original characteristics of the chosen formalism and our relation-based framework, in a way that it can handle relations, as well as concepts. For instance, assume that a document d is represented by the full nodes in the figure 4. Hence, this document will be retrieved by the query q, only if the properties of the relations Aggregate and Cover are applied. As introduced by Sowa [ Figure 4: A relation-based reasoning example Some specific classes of relations can be defined. These classes will serve as a basis to express the properties of relations, their relation relationships and their links with others relations, in order to make explicit their behavior.
Indeed, in addition to the fact that it is more easy to determine those properties at the level of the same class, it is usually the case that only relations of a common class can be agenced or linked together. A study of the different kind of relationships used in the information retrieval literature shows that mainly prepositions and verbs are used as relations in the indexing expressions. For example, in Palmer's work [33], an index expression consists of a noun phrase, called a conceptual group, where the relationships are essentially French prepositions. Bruza's approach adopts the same principle with English prepositions [34].
As a consequence, a classification of this particular kind of relations can have an interest for our approach. In the The relation-based mechanism will rely on such a classification to determine for each relation its properties and behavior. This will be done within each specified class on the relations it contains. Depending on the information contained in the corpus and the document's nature (textual, multimedia, structured, etc.), other classes may be added and analyzed, such as the class of aggregative relations in case of structured documents.
It should be noted that it is difficult to use prepositions correctly. Most of them have several different functions; for instance the dictionary lists eighteen main uses of the preposition at [35], though probably only few of them are of interest in information retrieval. In order to differentiate those kind of relations and to specify exactly what kind of the relation at we use 3 , one may specify a signature for each relation. The signature of an n-ary relation R is a sequence of n objects type ht 1 ;t 2 ;:::;t n i that specifies the types of the objects that can be linked by this relation.
Here a hierarchy (lattice) of objects is needed. To be valid each argument (object) of the tuple must be a specialization (a sub-class) of the corresponding type in the signature, according to the hierarchy of objects. For example the relation Loc 4 linking an Object to a place may have h ; PLACEi as a signature, where denotes the universal object denoting all the individuals of the domain. We write t 1 t 2 to mention the fact that the type t 1 is a specialization of the type Using the signature, one is able to express the similarity between the sentences of the type "the adventures of Alice" and "Alice's adventure". This can not be deduced by Palmer's [33] and Bruza's [34] approaches. Hence, for each relation, one may specify the following parts: The label of the corresponding relation, such as Loc for Locality relation.
The class of the relation, such as EXACT for relation loc, considered as a special case of the at preposition in the figure 4.
The signature of the relation, like h ; PLACEi for the relation Loc.
An informal comment about the use of the relation, such as "A locality is a relation linking an Object to a Place".
The Mathematical properties of the relation, for example the Loc relation may satisfy the Left Exclusivity rule.

Its arrangement with others relations
Their links with others relations, for instance: the Loc relation is a Subrelation of the relation Place (see figure   4),hereweusetheSubrelation rule.  Table 1: Prepositions information is not added but rejected. Depending on the context reasoning approach and the established arrangement between relations, this problem can be resolved. For instance, as the relation hhFather-Of;a;b;1ii precludes the relation hhChild-Of;a;b;1ii, it will be impossible in case of monotonic reasoning to add an information like hhChild-Of; John; Jack;1ii if hhFather-Of; John; Jack;1ii already exists in the system's knowledge about the document. Some other questions can be discussed such as the influence of the application order of the rules on the matching process, or whether all the derivable information from the indexing expression should be added to the system's knowledge about the document. For instance, an automatic application of the inversion or Alias rules may produce redundant information.

Conclusion
In information retrieval some indexing processes combine information items by bringing them into a relationship, in order to describe the information content of a document more precisely. In this paper we have proposed a logical framework for studying such relationships and their impact on the matching process. The framework captures some relationship features and properties. Within this framework, rules have been outlined that enables the generation of new implicit information contained in the document. The work presented here does not refer to a particular indexing language: our relational indexing approach does not mention in which way the index terms are achieved.
The indexing terms could be any element of any complex language, such as conceptual graphs, terminological logics and so on. The construction of the framework is work in progress. It constitutes a first step in the understanding of relations and their influence on the matching process in information retrieval. Our interest is focused on the theoretical aspects, refining the accuracy of the framework and evaluating the effectiveness of the relation approach on both recall an precision. Although the expressive power of our framework has been demonstrated, we still have paid no attention to uncertainty aspects. This is an important focus for future work. For example, the valuation of the fact that the index expression hhTa lk in g -To; John; Mary; ;1ii is about hhListening-To; Mary; John; ;1ii is very often true but not always. It depends on the context, for instance whether they are in the classroom or in a meeting. From an implementation point of view, we are investigating this framework within a conceptual graph environment [32].