Eraser Lattices for Documents and Sets of Documents

Automatic schemes for the analysis of Natural Language based on word cooccurrence counting have been very successful in capturing meaning, like automatically grouping words referring to similar concepts, or documents about similar topics. In this work, a more general framework is proposed to represent documents and measurements geometrically, in a way directly related with the representation of measurement in Quantum Theory.


INTRODUCTION
Co-Occurrence of terms in text has been successfully used for automatically extracting semantic information from text documents (see [3], [4]).In this work, a different approach is proposed, based in transformations that act on documents in a way that is analogous to how projectors act on vectors.These transformations, called Selective Erasers, are defined in section 2. The underlying assumption behind this work is that suitably defined order relations between these measurement transformations are able to capture semantic contents of the text.

SELECTIVE ERASERS (SE) AND THEIR INCLUSION RELATIONS
A SE is defined as a transformations that find the occurrences of certain low-level feature in the document, preserve the surroundings, and erase the rest.This general definition is not very useful, because it does not specify what kind of low-level feature can be preserved together with its surroundings, and how these surroundings to be preserved can be defined.A more usable definition of a SE is given in [2] for the particular case of term occurrences (as low-level features) in text documents: A SE is a transformation E(t, w) which erases every token that does not fall within any window of w positions around an occurrence of term t in a text document.These Erasers act as transformations on documents producing a modified document with some erased tokens, much as projectors act on vectors or other operators.This concept was first introduced in [1], where some of their properties are shown, in particular, those they share with measurements as described in Quantum Theory.They can also be shown to include well known measurements such as occurrence and co-occurrences of terms and n-grams.
A very important characteristic is that some erasers will include others, which means that there will be pairs of Erasers such that what one preserves is included in what the other preserves.Each eraser will preserve small "windows" of text; when those corresponding to eraser A include within them those corresponding to eraser B, we can say that eraser A includes B for the considered text.The structure of these relations has been discussed in [2].The formal condition for an inclusion relation between erasers (which will be denoted E(t 1 , w 1 ) D E(t 2 , w 2 ) when it holds on document D) would be then:

FROM ERASERS TO PROJECTORS
Equation (1) defines a relation that is analogous to the inclusion relation between projectors on subspaces of a vector space.Changing SEs by projectors, and documents by vectors, the relations stand in the same way.The problem of representing Erasers and documents can be addressed through the following ansatz: For a certain term t, the family of Erasers centred on it E(t, w) would be accurately represented by a set of commuting projectors with rank f (w), where f is a monotonic function, This way relation E(t, w) E(t, w + δ) are guaranteed for any integer, positive δ.The correspondence would be: Two projectors of the same rank corresponding to different central terms can be converted to each other by a unitary transformation, just like a term-swapping would convert the corresponding SEs:

RANK OF PROJECTORS
A topic can be thought of as the set of documents about it, and can therefore be represented by inclusion relations.Suppose that for a document D 1 it is the case that E(A, w 1 ) D1 E(B, w 2 ), and for document D 2 , dealing with the same topic, it holds that E(A, w 3 ) D2 E(B, w 4 ).The relation that holds for both documents would therefore be descriptive of the topic: The increase in the width difference necessary to produce inclusion relations is crucial to determine the geometric representation of the Erasers.In the example of (4), the difference in width increases from (w 1 −w 2 ) or (w 3 −w 4 ) to (max(w 1 , w 3 )−min(w 2 , w 4 ))).Empirical evaluations shown in the figure suggest that this width can increase linearly with the number of documents considered.
To draw a vector analogy, we can consider the width factor of a SE can be considered as analogous to the rank of a projector.The join of two projectors will always include both of them, so join projectors can be related in this analogy to an including SE.Let us set a finite threshold for overlap 1 − to consider a unitary vector as lying in a subspace.Projector Π can be considered as the join of R disjoint 1-dimension subspaces, where R is its trace.A random rank-1 projector will only increase the rank if it is not included in any of these, and since these non-inclusions are independent events, the probability of increasing rank is the product of R = T r(Π) identical terms The curve showing the expected increase of rank with random vector in a space of dimension 100 is shown in figure 1 suggests that the dimension required to represent erasers as projectors is behind 40.The point where the curve starts showing a negative curvature, like that of the rank curves for projectors, will probably be only approached with bigger collections or more frequent terms.A closer study of this kind of curves could suggest which is the number of dimensions required to represent sets of Selective Erasers as projectors on subspaces of a Hilbert space.

FIGURE 1 :
FIGURE 1:Measurement of widths required to produce inclusion relations, on approximately 2000 documents from TREC-1 that were assessed as relevant to 50 different topics.The linear increase of width suggest not to establish direct proportionality between width and rank of the corresponding projector.In the figure on the right, the average rank of joins made with random rank-1 projectors are shown.The different curves represent different threshold criteria to consider a vector as lying in a subspace (threshold for inner product).The less tight the threshold, the more the line gets closer to the straight line