Perfect Encoding: a Signature Method for Text Retrieval

A new methodology is introduced, where blocks of text are replaced by a compressed, fully reversible, signature pattern. Full reversibility implies zero information loss, thus the new method is termed Perfect Encoding. The method’s analytical model is produced and, where applicable, contrasted with the current practice in signature ﬁle organizations. Analysis results indicate that it comprises a potential candidacy for information retrieval implementations. In particular, perfect encoding has the potential to develop into an alternative or complementary scheme to inverted or signature ﬁle based systems.


Introduction
Free text indexing methodologies, like the inverted file and signature file approaches, enjoy applicability in the modern Information Retrieval (IR) environment [6,4].The inverted file approach is characterized by its efficiency in text retrieval operations whereas the Signature File (SF) involves a simple structure and requires significantly less storage overhead.The Superimposed Coding Signature File (SC-SF) comprises the most widely used signature file variation.SC-SF is applied in indexing objects for a variety of textbased applications [13,7,11].
SC-SF considers the textbase to consist of a number of logical blocks, each block involving a constant, pre-specified, number of distinct, non-common words (D).An F -bit signature or descriptor, consisting of m ones (1s) set in the [1...F ] range, is associated with each word in the logical block.For each block of text, its D word signature patterns are bit-ORed and produce an F -bit block signature pattern.Block signature patterns are used as an intermediary, compressed representation for text indexing purposes.
The SC-SF intermediary representation utilizes nearly 10% of the storage utilized by the corresponding text base.This is a significant improvement over the inverted file environment which calls for a storage overhead of 100% or larger [3,10].In this respect, SC-SF comprises a compromise between inverted file and full-text scanning methods.Its structure is highly modular and allows for efficient query processing in parallel architectures [8,14,9].
A desirable development would be to combine the speed and the exactness of the inverted file organization with the low stor-Proceedings of the International Workshop on Advances in Databases and Information Systems (ADBIS'96).Moscow, September 10-13, 1996. Moscow: MEPhI, 1996.age overhead, modularity and simplicity of the signature file.The current study comprises an effort in the direction of improving the signature file organization by achieving a higher text compression rate as well as by eliminating the information loss involved.Figure 1 comprises a simplified (F =12, D=2, m=4) configuration.A two word (documents, collection) logical block of text is considered, together with its 12 bit long block signature pattern.When four single word queries (documents, collection, book and text) are processed against the intermediary binary representation of text, the latter is successful in predicting the presence of documents and collection: every single 1 in each word signature pattern has the corresponding bit position in the block signature pattern set to 1, too.The scheme is also successful in predicting the absence of the word file: bit position number 10 is set to 1 in the word signature pattern but registers a 0 in the block signature.
In the case of the single word query book, the SC-SF configuration in Figure 1 is seen to introduce information loss.The word is erroneously taken to be present in the block, due to the 1s of its signature corresponding to 1s set by either one of the two other words present.Searching for the word book is said to result into a False Match or a False Drop, thus being indicative of the information loss introduced by the SC-SF text compression stage.The rate at which false matches occur is reflected by the False Drop Probability (FDP) metric [4]: F DP = Prob{signature qualifies where block does not} Alongside with the storage utilized by the signature file and the method's query processing efficiency, F DP comprises an important performance relating parameter for SC-SF.An increase in storage utilization decreases F DP and vice-versa.
SC-SF performs optimally when F , m and D resume values which make Relation (1) hold true [3].
When this happens, half of the binary positions in the average block signature pattern register an "1" value, the other half register a "0" value, and F DP is given by Relation (2) [12].

SC-SF Vocabulary
Under the SC-SF configuration, Relation (1) holds true and the scheme appears to able to "accommodate" a potential vocabulary of up to V0 = F m distinct word signature patterns.The latter may safely be considered to be infinitely large in practice.
Let the average SC-SF block signature pattern involve x ones and F −x zeroes.Figure 2 shows the distribution of 1s for a typical (F =600, m=7, D=60) SC-SF configuration [2].Analysis shows x to be very near the F/2 value [10].It is noted that F/2 is usually much larger than m.This implies a very large number of potential distinct word signatures "present" in the average block signature pattern: D0 = F/2 m .
In practice, the number D of text words placed in each SC-SF block is much smaller than the number of detected word signatures D0, i.e.D << D0.As a consequence, SC-SF introduces information loss in the form of false matches/drops.For example, for the SC-SF configuration in Figure 2 where D=60, Let w0 be a word, randomly chosen from the V0 terms in the potential vocabulary.Relation (3) calculates the probability Pw 0 for w0 to have its 1s match with 1s in the average block signature pattern.
As stated previously, the average SC-SF block signature pattern is one which is half-full with 1s.Each one of the 1s present in the word's signature patterns thus has a probability of 0.5 to match a 1 in the block signature.Pw 0 is thus given by: D.L. Lee, Y.M. Kim and G.Patel in [10] note that the number of blocks containing the query term (true drops) is a parameter which may safely be ignored in single-level SC-SF configurations.This being the case in the current study, plus by having D << D0, Relations (2) and ( 4) suggest that Pw 0 is the well known false drop probability metric (F DP ).
Let us assume that the actual vocabulary, herewith referred as the practical vocabulary, encountered in real textbases is of size V .The practical vocabulary will be a subset of the potential vocabulary defined by the F m patterns, so V ≤ V0.Assuming a uniform distribution of the V real words over the V0 potential words as well as of the D terms amongst the D0 patterns in the average block signature: Relation (5) suggests that when D << D0 then V << V0.Thus, one may safely assume that SC-SF deals with a practical vocabulary which is finite rather than infinite in size.
Combining Relations ( 1), ( 3), ( 4) and ( 5): Relation ( 6) allows for the calculation of V given the values for D and m.An SC-SF configuration often used in real life applications is one where F =1000, D=100 and m=7.Utilizing Relation (6), the size of the corresponding practical vocabulary is of the order of 13,000.This value is much smaller than the potential word vocabulary for the configuration in question, i.e.F m = 1000 7 = 1.94E + 17.However, a 13,000 word vocabulary suffices for most practical applications: the number of attributes is usually less than a few hundred in a formatted database [4] and a vocabulary of 10,000 distinct terms is reported to correspond to a textbase of about 100,000 words [1].

Perfect Encoding
A new methodology is introduced, termed Perfect Encoding (PE), whereby blocks of text are replaced by a compressed, signature like pattern.The method's analytical model is established in the sequel and, where applicable, contrasted with the current practice.Unlike the classical superimposed coding approach (SC-SF), PE establishes a 100% accurate, fully reversible intermediary representation of the textual information.Text words are assumed to be taken from a vocabulary of large but finite size V (e.g.V = 30-100,000 words).Each word is mapped on a unique integer wi in the [1...V ] range, e.g. by using a perfect hash function [5].In accordance with the analysis presented in section 2, the finite vocabulary assumption should not be considered as a drawback for PE.
Similarly to SC-SF, each PE logical block of text involves a fixed, predetermined number (D) of distinct, non-common words drawn from the vocabulary in question.The block is said to comprise a message, labelled by a unique message number M , where M ∈ [1, 2, ..., Mmax].Two messages differ for as long as there exists at least one word which belongs to only one of the corresponding blocks.The maximum number of messages encountered for a given (V , D) configuration equals the maximum message number Mmax.Thus, the latter equals the number of times D items may be drawn from a population of V : The minimum space required to encode each one of the Mmax messages is F bits, such that: Section 4 which follows shows perfect encoding to fully utilize the encoding capacity of the F bits long block signature pattern.Assuming a typical F =1000, D=100 configuration for PE, the size of the vocabulary supported is calculated to be of the order of 38,900 words.The latter is nearly 3 times larger than the one calculated for the corresponding SC-SF organization in section 2.

PE Signature Creation and Query Processing
Given the vocabulary size V and the blocking factor D, Relation (7) calculates the total number of distinct messages produced.In addition, each word is hashed on a unique word number w k ∈ [1...V ], where k=1,...D marks the position of the word in the block of text considered.
To construct the PE block signature pattern, the block is mapped on a message number M which is calculated by considering the D word numbers present.More specifically, the D word numbers are first arranged in descending order: wD < wD−1 < ... < w1.The subscripts are rearranged so that w1 and wD correspond to the largest and the smallest word numbers present, respectively.The wDwD−1wD−2...w1 sequence of word numbers is taken to temporarily identify the corresponding message.Temporary message identifiers of this type are sorted in descending lexicographic order.Once sorted, the message with the "largest" temporary identifier is assigned message code number M =0, followed by message code number M =1, etc.The message which corresponds to the "smallest" identifier is assigned message code number M = V D − 1.Each message code number M is used to uniquely identify the message in question.The scheme produces the most compact form of block (message) signature.
For the PE method to efficiently construct the block signatures, message number M must be computable by means of an expression which involves the block's w1, w2, ..., wD numbers.Relation ( 9) is the expression used for the message encoding scheme in question: For example, Considering the seventh row in Table 2, one has w4=4, w3=6, w2=8 and w1=9.Relation (9) calculates the corresponding M (message code number) value as follows: Following the construction of the PE signature file, information is retrieved by considering a single word with word number q and by asking whether q is present in message M .The pseudocode in Figure 3 processes q against M and an exists=true outcome implies that q is one of the D words present in M .Appendix I refers to the principle behind the just introduced PE block signature construction algorithm.
The kc term in Figure 3 is calculated (recursively) by: where U (x) symbolizes the floor of log 2 x, i.e.U (x) = log 2 x .At this point, it should be noted that the present study does not emphasize on establishing the most efficient query processing algorithm for PE.Rather, it aims at outlining a new framework which achieves the highest possible compression of information without introducing information loss.In this respect, the relative performance of various signature file variations may be checked against exists=false; current = V − q D ; while (current < M ) and (current < V D ) {a = 0; while (a < q) and (current < M ) {c = 1; repeat until (c = 2 a ) or (M < current) {calculate the kc number; Figure 3: Query processing pseudocode for PE that of PE.The PE framework may also be used as a guide for creating other, less efficient in storage utilization but faster than PE, information encoding/decoding methodologies of the signature file type.

Comparison: PE vs. SC-SF
In the lines which follow, PE is considered next to SC-SF with regard to the storage overhead introduced as well as the efficiency in processing user queries.A plus for perfect encoding is that it introduces zero information loss.As a consequence, no full text scanning operation needs to be conducted during the query processing stage.Apart from the CPU overhead involved, this fact also implies additional storage savings as the textbase does not need to be present at the end user end in a network based realization.

Storage Overhead
Let us refer to a message by the term Perfect Encoding Block Signature (PE signature) to differentiate it from the classical block signature which is termed SC-SF Block Signature (SC-SF signature).Let also S be the size, in bits, of the signature file for both the PE and SC-SF configurations.The signature file, together with the file of pointers marking the beginning of each logical block of text, comprise the total storage overhead.Let S Ovh stand for the percentage of increase in storage introduced by the intermediary representation.Symbolizing by F the size of the block signature, by l the size (in bits) of the average word and by N the number of words in the document, S Ovh is given by: Relation (10) suggests that the F/D ratio may be interpreted as a quality measuring factor: for as long as F/D remains fixed, S Ovh remains constant.
Perfect encoding starts differing from the classical signature file when one considers the dependency of S Ovh on D. For the SC-SF method, F , D and m comprise a set of design parameters.Furthermore, Relation (1) has been found to optimize S Ovh and the false drop probability rate for SC-SF.Thus for SC-SF: The right hand side of Relation (11) suggests that in the case of SC-SF, S is independent of D.
In the case of PE, full utilization is made of the encoding capability inherent to the F bits long block signature pattern: Combining Relations ( 7) and ( 12): As it is derived in Appendix II, a D → D + 1 change results into a negative variation for S, namely ∆S < 0. In other words, the larger the block size D becomes, the smaller is the PE introduced storage overhead.
From Relation (1) for SC-SF and Relations ( 7) and ( 12) for PE, F is plotted against D in Figure 4.The text base environment considered in this example consists of N =1000 words, taken from a vocabulary of size V =400.Two curves are plotted for the SC-SF scheme, one which involves word signatures with m=1 and a second with m=7.The two SC-SF configurations are labelled SC − SF (m = 1) and SC − SF (m = 7), respectively.It is noted that the two SC-SF lines in Figure 4 do not extend past the F =V =400 value.When F =V , it is simpler plus more efficient to map each word on a separate bit position in the block signature.Simpler because the number of bit positions in the block signature pattern equals the number of words in the vocabulary: one may thus establish a simple one-to-one correspondence between words in the vocabulary and numbers in the [1,...,F ] range.This special case in considered in section 6 and results into a simple information encoding/decoding organization labelled Exactly Reversible Signature File (erSF. The PE structure in Figure 4 is seen to utilize less and less storage (F ) as D approaches the vocabulary size (V ).As expected, SC-SF(m=7) utilizes more storage than SC-SF(m=1).The latter achieves a text compression rate higher than that of the PE structure up to nearly D=250.However, both the SC-SF(m=1) and SC-SF(m=7) schemes introduce information loss (i.e.false drops), whereas PE does not.Quite rightly, the PE storage overhead drops to zero when D=V .When the block size equals that of the vocabulary, each block is sure to contain all the words: a fact which needs zero bits to be encoded.In the classical SC-SF configuration, Relation (11) suggests that S depends only on m and remains constant as D varies.The PE curve was produced by utilizing Relation (13).In accordance with the analysis, S decreases monotonously as D increases.Section 6 comments on the erSF curve shown in Figure 5.

Query Processing Efficiency
In the case of the classical SC-SF, the complexity of the query processing algorithm is linear with m, the number of bit positions set to 1 in the average word signature pattern.Moreover, one needs to take into consideration the extra processing overhead associated to full text scanning which is necessary in order to resolve the false drops present in the SC-SF output.
Perfect encoding involves no information loss but its query processing algorithm (Figure 3) is less efficient than that of the SC-SF structure.It is for this reason that perfect encoding is herewith presented as a framework for signature file organization rather than a methodology directly applicable in real life, as such.Section 5.1 presents a case where this approach pays off: erSF is seen to relate to zero information loss plus involve a simple signature file structure leading to SC-SF comparable query processing performance.

Exactly Reversible Signature File
It is obvious that P E falls behind SC-SF in query processing efficiency.However, the exactly reversible signature file variation introduced in section 5.1 is indicative of the way P E may be considered as a framework for improving the performance of signature file organizations.The erSF structure efficiently encodes/decodes textual blocks into block signature patterns in a way which introduces zero information loss.
More specifically, it is more efficient to replace SC-SF by erSF for D values which are greater than or equal to the size of the vocabulary used.Let DF =V be the value of D for which F =V .For the curves in Figure 4, DF =V =278 for SC-SF(m=1), and DF =V =40 for SC-SF(m=7).Thus, the erSF structure encodes text with a direct word-to-bit position mapping mechanism when D=DF =V .The storage overhead associated with the lookup table for the word-to-bit position mapping may be avoided by utilizing a perfect hash function.
The erSF storage utilization curve is plotted in Figure 5. Perfect encoding is seen to achieve a higher text compression rate when compared to erSF.However, it is worth noting that when D approaches the V /2 value, the two curves converge to each other with PE always being better (lower S values) than erSF.Consequently, the erSF structure achieves a text compression rate which is nearly as good as that of the PE scheme when D ≈ V /2.In contrast to the SC-SF structure, erSF involves zero information loss.This fact, together with the simplicity of its structure which implies efficient query processing, makes erSF a good choice for the D ≈ V /2 region.A scheme which would approach perfect encoding in other regions of D values, much like erSF does in the D ≈ V /2 region, comprises a desirable objective and a subject of further research.

Conclusion
A new approach to information encoding and retrieval is introduced.Perfect Encoding (PE) is characterized by the following: • A finite sized vocabulary of words is considered, this being the case in most practical applications.
• The method does better than the classical Superimposed Signature File (SC-SF) by (a) involving a fully reversible signature pattern, (b) achieving a higher degree of information compression and (c) supporting a larger vocabulary of distinct terms in practice.
• The scheme comprises a framework for measuring the performance of signature file based information encoding structures.For when D ≈ V /2, a simple PE variation is introduced which is shown to achieve better performance than that of SC-SF.
Provided that its efficiency at the query processing stage is improved, perfect encoding has the potential to evolve into an alternative or complementary scheme next to the currently used inverted or signature file based information retrieval realizations.
Thus, we have: Since k varies in the [1,..,D] range it follows that: Combining the last two equations, one gets: Thus, for the PE envinronment, the larger the block size D becomes, the smaller is the introduced storage overhead.

Figure 4 :
Figure 4: The V =400, N =1000 environment: block signature size dependency on block size for P E and SC-SF

Figure 5 :Figure 5
Figure 5: The V =400, N =1000 environment: signature file size dependency on block size for P E, erSF and SC-SF

Table 1
(6)siders a number of SC-SF variations whereby F =60 and lists the values of V calculated by Relation(6)next toV0 = F m. Without compromising on the efficiency of SC-SF, it is seen that as D becomes larger, the size of the real word vocabulary becomes smaller.Clearly, V0, an infinitely large number for most practical applications, should not always be considered as the vocabulary size supported by SC-SF.

Table 1 :
SC-SF vocabulary size dependency on D and m when F =60

Table 2 lists
PE message code numbers assigned to blocks in the case of a V =9, D=4 environment.