Structured text retrieval by means of affordances and genre

This paper offers a proposal for some preliminary research on the retrieval of structured text, such as extensible mark-up language (XML). We believe that capturing the way in which a reader perceives the meaning of documents, especially genres of text, may have implications for information retrieval (IR) and in particular, for cognitive IR and relevance. Previous research on ‘shallow’ features of structured text has shown that categorization by form is possible. Gibson’s theory of ‘affordances’ and genre offer the reader the meaning and purpose - through structure - of a text, before the reader has even begun to read it, and should therefore provide a good basis for the ‘deep’ skimming and categorization of texts. We believe that Gibson’s ‘affordances’ will aid the user to locate, examine and utilize shallow or deep features of genres and retrieve relevant output. Our proposal puts forward two hypotheses, with a list of research questions to test them, and culminates in experiments involving the studies of human categorization behaviour when viewing the structures of emails and web documents. Finally, we will examine the effectiveness of adding structural layout cues to a Yahoo discussion forum (currently only a bag-of-words), which is rich in structure, but only searchable through a Boolean search engine.


INTRODUCTION
IR overlaps with numerous other fields of research: artificial intelligence (AI), human computer interaction (HCI) and natural language processing (NLP), to name but a few.This paper, however, focuses on overlaps of visual perception, genre and AI, merging and utilizing these for one particular goal: structured information text retrieval through skimming and categorization.The difficulties inherent in IR are many and various, but the over-riding problem has been how to find and display results with the highest relevance to users' needs, whether these are in the form of video, text, audio files or images.As a result, searching has been perceived as the most notable aspect of IR and has, up to now, been given the most recognition.Research involving the retrieval of structured text or document retrieval, however, is currently progressing at a rapid pace [1,2].As the quantity and size of the XML or extensible hypertext mark-up language (XHTML) document collections continue to expand (as web and digital libraries, for example) the need for IR systems which exploit structured text, as opposed to traditional bag-of-words (B-O-W) based IR systems, is also increasing.Structured textual documents are normally composed of several layers or sections which together form genres of text preserved, in particular, in XML/XHTML.A genre in this context is the set of structures, layout and style of writing (or as Dewdney et al. [3] state "conventions") which show the user the documents' purpose and form through its structure regardless of the topical nature of the writing.Since XML retains genre information, such documents should be explored for genre.Genres have been debated for thousands of years, see, for example, most notably, Plato, with his "Theory of Forms" [4].Genre has been cited in relation to the European Romantic movement of the 18th and 19th centuries, the Russian Formalists [5] and, arguably, most famously by Bahktin [6] in his essays on 'Speech Genres'.Genres have been used to categorize many things, such as music, prog-rock and punk; literary works, tragedy, comedy, etc, and as Yates et al. [7] explained, organizational communication: "In structurational terms, genres are social institutions that are produced, reproduced, or modified when human agents draw on genre rules to engage in organizational communication".
Our current work [8], which originally focused on the shallow features of genre, has now been extended to analyse the identification processes and actions employed by readers when searching for a relevant text, especially when skimming.Following Gibson's theory of 'affordances' [9], we examine the 'deep' features of genre, which are used by readers to determine the meaning and purpose of texts.Genre could hold great potential benefits for organisations, both financially and administratively, by allowing automatic and rapid information retrieval without the need for manual organization and sorting.In particular, the sorting and filtering of emails would benefit large organisations by improving their operational capability and reducing time consuming tasks.Section 2 describes the background to this research: XML, text categorization, genre, deep and shallow parsing of genre features, visual perception and affordances.Section 3 lists our hypotheses, whilst Section 4 outlines proposed experiments and issues for discussion.The paper closes with conclusions drawn from our research.

XML
XML and offshoot XHTML are becoming dominant formats for managing structured text, especially on the world wide web (WWW), and are in widespread use in many fields, varying from digital libraries and electronic publishing to the so called 'semantic web', for example, Wikipedia (XHTML).Although this research is mainly involved in whole document retrieval there may also be benefits for partial section retrieval.The form or structural information contained within the interface of specific XML or maintained in the tag information needs to be exploited."The increase in storage of digital documents in XML format has brought about the explosion in the development of systems to store, access and exploit the logical structure of such corpuses."Lalmas et al. [10] In this project, we investigate the usefulness of form features for the retrieval of structured documents.During earlier experiments, in press [8], useful indicators were identified which demonstrate the effectiveness of using genre for the purpose of text categorization, also known as text classification.We have found that genres can be discriminated by utilizing the shallow features of tags and grammar [11], that is, the structural information (or form) within the XML/XHTML documents.

Text categorization
The categorisation of documents is normally implemented by labelling and classification: "Text categorization (TC - also known as text classification, or topic spotting) is the task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set."Sebastiani [12].By grouping collections of documents into smaller groups which are usually pre-labeled, genre is utilized by the form (and sometimes content, style, function etc) of the documents whilst text classification traditionally discriminates by topic using such features as keywords.The applications of text categorization are numerous: automatic essay grading, spam filtering, and of course, genre filtering.Most categorization of digital media is based around the topical aspect of a document; research into genre therefore needs to diversify.Many authors, such as Rauber et al. [13], Benno Stein and zu Eissen [14], and Sebastiani [12] have recognized the value of genre in conventional libraries and information searches.Given the increase in diversity of document genres in digital libraries, there is clearly a desperate need for improvements in the organization and management of documents.Most retrieval is based on structure and content in collections, and not solely on structure, but this needs to change as genre types provide excellent features for distinguishing among types of documents.

Genre
Broadly speaking, genre research can be divided into two main schools of thought [15,16]: the North American School and the Sydney School.The former views genre as a socio-historical, rhetorically-oriented concept, with the emphasis placed on how texts function in social and interactional contexts [7], while the Sydney School is based on an applied linguistic approach, with the focus on formal textual features.For more information, the reader is directed to Breuer's [17] article.Contemporarily, a search for 'genre' in dictionaries usually reveals definitions such as: the classification of movies or literature as, for example, 'westerns', 'short stories' or 'detective novels'.Although, of course, this definition has some relevance, the term 'genre' embodies a much wider range of contexts.Watt [18], for example, refers to the opportunities offered by the "socially constructed communicative behaviours called genres to improve the efficiency of communal activities" and Yates et al. [7], in their pioneering work on the concept of genre, suggest that "Genres (e.g., the memo, the proposal, and the meeting) are typified communicative actions characterized by similar substance and form and taken in response to recurrent situations", which can be used to identify types of organizational communication.Current research on genre is based around three areas of interest [14]: literary theories of genre (including kinds of literature [6], automatic genre identification, and genre and the WWW [19,20].Although our research is mostly focused on automatic genre identification and genre and the WWW, the literary genres with the shallow and deep approaches explained in 2.4 and 2.5 are also applicable to genre identification and genre and the WWW.A search engine using the structural information implicit in most documents would assist the user to retrieve the correct type of information and articles, with high performance and effective filtering of the results.This is the key to genre: it is a way of exploring the standardized types of communication that emerge within a community of practice [21].There are many diverse types of genres (academic and scientific articles, biographies, news articles, memos, newsletters) and we have noted that new and emergent genres evolve gradually on email exchanges and social networking sites, for example, Wikipedia.Standard approaches to genre classification are often inadequate because the standard interpretation of user needs only takes into account the need to discriminate by topic, not by genre.When a user inputs a query into a search engine, this usually results in a wide range of documents of different topics being returned which need to be sifted through.For example, if the search query 'Tom Clancy Biography' is entered, the results returned contain interviews, web chats, biographies, book reviews and so on.We employ a genre-based approach, incorporating the deep and shallow parsing of genre features, to examine the textual features of documents, to categorize the purposes for which the text has been written and to identify the form elements which are common to various genres.We argue that essential improvements in performance can be achieved by using structural information to filter and reduce the cognitive load.

Shallow parsing -genre retrieval
Numerous shallow feature parsing experiments have been carried out with regard to XML/XHTML retrieval [22,23], but not on structured text for XML and XHTML genre research.Many papers have, however, been written on web genre research techniques [24,14,25,26] which are applicable for most digital collections.There are several underlying concepts that are persistent in genre definitions: the style, form, content and functionality of the document.Web genres incorporate the style, form, and content of the document, which are orthogonal and not related to the topic or classification of many genres.These three concepts are probably applied most consistently within WWW IR circles.The conceptual features of style, format, content and functionality can be used with other digital document formats, especially XML.Campbell & Toms [27] suggest that the conceptual features consist of a grouping of unique facets or levels, i.e. function, form and interface.The function is representative of the meanings of the words contained within the documents, the form refers to the layout or appearance, and, finally, the interface is the way in which the document is read or used.By looking at these conceptual features, many pieces of genre classification work can be seen to fit with the concepts they describe (Table 1).Taking style, form, content and function into consideration, there are hundreds of features that can be measured and the normal practice is to group them as feature sets.There is some debate regarding the overlapping of the sets to which some features can be assigned, but this overlap does serve a genuine purpose.It enables the classification to be tested against each feature set, for example, style versus form.Some documents do, of course, provide visual markers that allow the reader to conceptualize the format.Documents contain distinguishable features such as patterns which allow the reader to identify the purpose and content of a document (figure.1).A good account of the debate is found in Luštrek [31] who defines these features and concepts.We are proposing a new approach or set of features but this still belongs to the form (or structural) concept.

Call for Papers
Although shallow parsing methods for automatic genre retrieval are important to the project, our intention is also to extend the deep parsing methods to test whether genre could be useful for skimming texts.

Deep parsing -skimming for genre
Skimming is a technique for reading which is utilized to identify the main purpose of the text.It is performed at a speed several times faster than conventional reading and is normally used when a reader has a large amount of text to read and does not need to understand every word, for example, when a student has to perform a literature search.This technique dovetails neatly with the theory of affordances and genre.Watt [32] states: "These [genres]   are there to reduce cognitive load -there is no need for a person to read the whole text, the genre provides these filtering cues in its structure." The importance of genre for skimming text has yet to be realised in IR.Skimming has previously been used in the natural language processing (NLP) framework, for example, by DeJong [33] and Mauldin [34].DeJong developed a natural language parser named fast reading and understanding memory program (FRUMP) and Mauldin extended this parser (McFRUMP) in his flexible expert retrieval of relevant English text (FERRET) system.The skimming works detailed above are all works within narrow domains.Our research targets broader domains in IR -using emails, technologically structured documents and also documents structured through social consensus.Watt's [35] Open Book and Sentinel applications (web and email respectively) were inspired by De Jong's FRUMP (Predictor/Substantiator).We contend that documents acting as affordances (2.6) can also represent form and meaning (or purpose) to the reader by utilizing the genre rules, structure and patterns.

Visual perception and affordances
Many approaches have been developed for the study of visual perception, for example, Gestalt Theory, according to which "an image tends to be perceived according to the organization of the elements within it, rather than according to the nature of the individual elements themselves" [36].The laws of closure, proximity and similarity are important to this theory but a full description is beyond the scope of this review.Another example is provided by Marr, who identified different stages of perception to those identified by Gibson, and, most importantly, put forward the theory that the final stage in perception was recognition and not action.[36] The bottom-up approach and notion of affordance was first introduced by J.J Gibson [9] to explain his theories of visual perception.He proposed an alternative direct perception framework [37], which cuts against the grain of traditional belief, and his theory of 'affordances', in particular, is still very influential in the field of ecological psychology.Gibson [9] coined the term "affordance" and then explained how an affordance offers an 'animal' a particular actionable possibility which is independent of the animal's ability to visually perceive any possibility.He claimed that we perceive the affordance properties of the environment in a direct and immediate way.He discounted the theories contained within physics and moved towards the ideas of ecological psychology.
To explain affordances, we will briefly examine Gibson's cornerstone work.He defined affordances in the context of visual perception by explaining how animals visualise at the same levels of mediums, surfaces and atoms and perceive what the combinations of the three offer: "…the affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill." [9] According to Gibson affordances contain the following attributes: 2. The "values" and "meanings" of things in the environment can be directly perceived.The "values" and "meanings" are external to the perceiver.3. Affordances are relative to animals.They can only be measured in ecology, not in physics.
4. An affordance is an invariant.5. Affordances are holistic.What we perceive when we look at objects are their affordances, not their dimensions and properties.6.An affordance implies the complementarity of the perceiver and the environment.It is neither an objective property nor a subjective property, and at the same time it is both.It cuts across the dichotomy of subjective-objective. Affordances only make sense from a systemic point of view.
Examples of affordances are compiled by Zhang and Patel [38] who have attempted to compile a taxonomy in which they categorize possible affordances.The difficulties inherent in compiling such a work are obvious: take, for example, a bunch of grapes, which affords so much to different types of animals.Grapes can afford food, pleasure and sustenance to a human or poison and death to canines.
Unfortunately, the examples given by Zhang and Patel regarding cognitive and perceptive affordances are a little confused.The two perceptual examples, "stove-dials and hot-plates" and "men's and women's toilet signs" perhaps belong in a different category, under semiotics.We argue that a genre 'affords' its meaning and purpose in which the reader perceives the invariants in the ambient optic array and we intend to locate, examine and hopefully utilize the invariant cues (or layout of texts) such as whitespace for the purposes of structured text retrieval.

HYPOTHESES
During this first year of research into structured text retrieval, two hypotheses have emerged.The most important of these are:

PROPOSED EXPERIMENTS
We have demonstrated, in previous experiments [11,39], that parsing shallow form features of XML technology in retrieval (text categorization) is effective and we now wish to ascertain whether the previous XML genre experimental features [8] will transfer to other technically or socially structured document collections, such as email and WWW media, in particular XHTML.A full description of our previous experiments and the classification accuracies can be found in my earlier work [40] The plan is to explore genre features which are maintained and exploitable in the explicit XML using other highly structured corpuses, such as Wikipedia.
We also intend to examine the possibility of creating or extending skimming models, such as those found in Mauldin's Ferret [34], Salton [41], DeJong [33] and Schank et al. [42].
Although affordances in ecological psychology, genre and IR are relatively popular and well-known areas of research, combining these areas constitutes a new approach.Our new approach will also include utilising eyetracking experiments to obtain an accurate understanding of how humans view the invariant layout cues, such as white-space patterns or other formatting features which constitute genres and test whether genres act as 'affordances'.This eye-tracking technology, which can record eye-movements, should allow us to test our hypothesis that when we see a text, we perceive the invariants which indicate to us the purpose of the text.In this context, the genres act as affordances which allow a reader to make a decision regarding the relevance of the text and then add meaning later, which is a main tenet in Gibson's affordances, i.e. perception is followed by action This work has been inspired by the previous genre research carried out by Watt [18], and Campbell and Toms [27].
Two experiments have been planned so far: the first will test several genres of emails and the invariant features which are used to make decisions during the categorisation process.The second will be used on an arguably evolving set of genres which are inherent in the Wikipedia collection.Finally, the structure which occurs naturally in social networks will be investigated, for example, in a community of practice, such as a Yahoo discussion room and Wikipedia.At present, the Yahoo forums offer very poor search facilities to allow the users to search through the archives of messages.One possible approach we intend to employ is that of testing retrieval on the Yahoo discussion forum corpus in which approximately 75000 messages have already been crawled.These messages are rich in natural language but the question is whether retrieval is improved by exploiting genre patterns and rules which normally emerge through social consensus in a community of practice and have a distinctive purpose and form.

1 .
Affordances provided by the environment are what it offers, what it provides, what it furnishes, and what it invites.

TABLE 1 . Concept examples
[29]tionalityNumber of links in a web page; number of e-mail links.[29]