Focused Retrieval Using Topical Language and Structure

We investigate focused retrieval techniques that deal with the increasing amount of structure on the web. Our approach is to combine multiple representations of web information in a common framework based on statistical language models. In this framework, it will be possible to derive a topical language model of the actual language-use on web pages on a certain topic--such as arts, business, entertainment, education, etc.--using the unigrams and bigrams taken from the plain text of the web pages. Similarly, it will be possible to derive models of the structure of web pages to distinguish between blogs, FAQs, personal web pages, etc. Structural characteristics of a web page include, amongst others, tagname statistics and parent-child tags. We will build a multiple level language model to exploit the information contained in the topical language and structure models. The .GOV2 corpus will be used as a test collection on which queries will be run on different topical categories and on web pages with different structures. We plan to develop so-called parsimonious models to derive a compact representation and to handle dependencies between representations of the data.

retrieval as well, for instance smoothing using the geometric mean and backing-off by Ponte and Croft [1998]; and Dirichlet smoothing and absolute discounting suggested in a study by Zhai and Lafferty [2001].In a sense, all smoothing approaches somehow combine two representations of the data (a document model and a collection model) into a new representation.We plan to use similar approaches to combine many more document representations into a single representation.

Description of proposed research
Large-scale, general purpose web search engines (most notably Google) have been quite successful in keeping up with the size of the web by adding new sources of information to traditional full-text indexes: for instance anchor texts and hyperlink structure.However, the web is growing out of reach of these techniques, and companies like Google have started to offer specialized web search services focusing on for instance internet shopping (Froogle) and scienti¿F GRFXPHQWV *RRJOH 6FKRODU LQVSLUHG E\ &LWH6HHU >@ Such services use lay-out analysis techniques and simple information extraction techniques that exploit web conventions-extracting e.g., product names, prices, author names, etc.-to come up with domain-dependent structured document representations that are far more complex than the full-text/anchor text/link structure representations.As such, those services provide very accurate and focused search.
However, specialized search engines only provide focused search if the user is ¿UVW DEOH WR ¿QG WKH VSHFLDOL]HG search engine of his/her choice, if one that caters for the user's problem exists at all.Whereas specialized search engines are part of the answer to the increasing size of the web-they identify more structured information and provide more focused search-they also introduce new information overload problems.We believe we can have the best from both worlds: very focused and accurate search without the need for the user to preselect the domain beforehand.
Our main research problem is the following: • Can we provide accurate and focused search by adding structure and combining multiple, complex representations within a common information retrieval modeling framework using generic web retrieval models?
There is an increasing amount of structure in documents on the web.Like the specialized search engines mentioned above, we intend to develop structured document representations, where the structure is derived from the document text, the document structure, the URL, the hyperlink structure, time, geographic location, text classi¿FDWLRQ PDQXDOO\ DVVLJQHG PHWDGDWD DQG PRUH $OO WKHVH VRXUFHV RI HYLGHQFH FDQ SURYLGH FUXFLDO UHWULHYDO cues.We will start by focusing on structure derived from the document text and the document structure.Hence, our first research question is: • Can we use topical language and topical structure similarity to improve retrieval results?

Topical language similarity
We will investigate the particular language use for speci¿F WRSLFV DQG W\SHV RI ZHE SDJHV DQG EXLOG ORFDOL]HG models for them.For example, we can consider the type of language used on educational web pages or on home pages.We can anchor the topical content types on Internet directories as provided by the Open Directory, Yahoo!, Google, or on-line encyclopedias such as Wikipedia [Rode, 2004].Particular language usage provides a layer of topical language models that can be incorporated in the language modeling framework.

Topical structure similarity
Whereas a typical search engine will disregard almost all of the document markup, and focus primarily on the textual content presented to the user, we expect that there are valuable retrieval cues in structural aspects of web pages.This is highly related to the emergence of web conventions, in which speci¿F W\SHV RI ZHE FRQWHQW KDYH adopted a similar look-and-feel.As such there is great structural similarity between, for example, home pages of people, on-line product pages, FAQs, blogs, etc. Abstracting the structure of pages provides another layer of structural language models that can be incorporated in the modeling framework.
Besides using the information that we can find in the document collection, we also want to use the information that can be provided by the user implicitly or explicitly.A bottle-neck for providing more focused retrieval is in the shallowness on the client-side, i.e. users who provide no more than a few keywords to express their complex information needs.We want to use implicit and explicit feedback info to improve and to group the retrieval results.Therefore, our second research question is: • Can we use topical language and structure similarity in combination with implicit and explicit feedback to enhance the user's search experience?Implicit information about the users geographical location might for instance come from his/her IP number, or implicit user preferences might be derived from click-through information.Explicit feedback can be obtained in two ways.It can be in the form of a suggestion that is relevant to the given query, e.g."Do you want to focus on documents from around 11 September?"or "Are you looking for a person' s home page?".Using this feedback a new ranked list of results will be provided.The second option is to cluster the retrieved results into the predefined topical and structural categories.

Outline of Models
Our modeling framework will be based in so-called statistical language models for information retrieval [Hiemstra, 2001].The basic retrieval model has been successfully combined with models of non-content information that use for instance link information and URL-type in web retrieval [Hiemstra and Kraaij, 2005;Kamps et al, 2004;Kamps, 2005;Kraaij et al., 2002 ].Our approach is to go beyond the "document as a bag-of-words" models by bringing more and more sources of evidence into the models and to extend existing language modeling approaches in several ways.

Relevance models
Our approach to implicit feedback is related to relevance models [Lavrenko and Croft, 2003] in which the set of initially retrieved documents is included in the model as a layer between the document and the collection model.For example, we believe that there is great potential in the combination of metadata (either in the form of e.g., the Yahoo!directory, or metadata deduced from previous queries) with derived representations.Using such metadata, we will be able to derive topical and structural models, i.e., models of the language typically used in documents on a certain topic.A model built from documents that were assessed as relevant to a query that is similar to the new query can be used to provide more targeted search.Topical language and structure models can also be used to classify (initially retrieved) documents using text categorization techniques.This, in turn, provides crucial information for adjusting smoothing parameters, for result clustering, or for asking follow-up questions like "Do you want to focus on homepages?" • Parsimonious language models Usually, each language model representation is de¿QHG LQGHSHQGHQWO\ IURP WKH RWKHUV DQG WKH\ DUH FRPELQHG ODWHU on.However, independent de¿QLWLRQV RI ODQJXDJH PRGHOLQJ FRPSRQHQWV OHDG WR ODUJH UHGXQGDQW FRPELQHG representations.To model structurally complex document representations, we need a method that rewards parsimony.Parsimonious models [Sparck-Jones et al., 2003] explicitly address the relation between several representations of the document.As such, they effectively model for each partial representation what it adds to the document representation as a whole, thus avoiding redundancy.
It has recently been shown that parsimonious models improve retrieval performance in both text search [Hiemstra et al., 2004] and image search [Westerveld and De Vries, 2004].Techniques from parsimonious language models allow us to de¿QH WKRVH UHSUHVHQWDWLRQV LQ D OD\HUHG IDVKLRQ E\ H[SOLFLWO\ DGGUHVVLQJ dependencies between document representations.As additional but important bonus, avoiding redundancy results in significantly smaller models.This leads to a reduction of storage space, and a reduction of query processing time-two key requirements for effective large-scale web retrieval.

Research methodology and proposed experiments
We will start by investigating whether we can exploit topical language and structural similarity of web pages in a certain topic category or with a certain structure to improve retrieval results.We expect topical language models will work best for topical categories, i.e. if the topic category is 'Education' words like 'school' and 'course' are likely to occur.Also we expect structural similarity to be a good indicator of the structural type of a web page, i.e. a home page contains maybe a photo or a logo and some links to other pages.Or maybe it is the combination of topical language and structural similarity that produces the best results.
BCS IRSG Symposium: Future Directions in Information Access (FDIA 2007) We will investigate also if it is possible to recognize a structural type of webpage by using a language model that takes in plain text.For instance, words like 'welcome' and 'home page' could indicate the structural type of this page is 'Homepage'.Similarly it could be possible to recognize a topical category by certain structural characteristics.
Up to this moment, we have analyzed the type of documents in the .GOV2 corpus and queries used in previous TREC Terabyte tracks to specify a set of topic categories and a set of structural types of web pages.For practical reasons, the .GOV2 corpus will be used as the initial test collection for our experiments.To be able to create our models we need queries that fall within our topic categories and that find pages of different structural types of web pages.We have defined 13 topic categories, including some combinations of categories.Examples of topic categories are: We will reuse as many queries as possible from previous Terabyte tracks and of the Million Query track.For each topic category we can find 1 to 12 queries that have been assessed in previous Terabyte tracks and a query produces on average 200 relevant results.In total, our test set contains 8986 relevant web pages.However, the grouping of the relevant results does not cover the full range of topics that fall within a topic category for most of our topic categories.
Besides topical categories, we also started looking at different structural types of webpages.Some examples of structural types of web pages (not restricted to the .GOV2 corpus) are: Unfortunately, the .GOV2 corpus does not represent most of the above structural types.From past TREC results the only type that we can retrieve and analyze is homepages.Besides reusing queries we will therefore also create and assess a number of queries ourselves that cover each topic category and structural type of web page that is represented in the .GOV2 corpus.Another approach to find documents representative of a topic category to build a topical language model could be to take documents from an Internet directory such as Yahoo! or Wikipedia.Using the results from the reused and our own queries we can create topical language models using unigrams and bigrams taken from the plain text of the web pages.The structure of a web page is a bit more difficult to capture in a model.We can use the following characteristics to represent the structure of a web page: Once we have created a topical or a structural model we can use it to score either the query or the retrieved documents.The model with the highest score can be used as the extra layer in the language model to get a better ranking.We can use leave-one-out evaluation to determine whether our model is an improvement to the standard language model.

Discussion
Our research is still in an early stage, and we have only started to explore possible approaches to building more informed information retrieval models.Hence, we are open to any advice or comments on the current approach, or on alternatives and extensions of the approach.Specifically, there are a number of open issues that we would like to get feedback on:

•
The standard TREC test collection does not suit all our purposes.Are there other sources of training and test data that could be relevant for our project?

•
We plan to develop topics covering various topic categories and structural types.What would be a good way to create topics and how can we structure the assessment process?

•
We are very interested in the user's point of view.Are there related user studies in the literature?Can we use some of the research on implicit feedback?

Acknowledgements
This research is done in the framework of the EfFoRT (Effective Focused Retrieval Techniques) project.Other members of the research team include Dr.ir.Jaap Kamps, Dr.ir. Djoerd Hiemstra and MSc. Rongmei Li.This research is supported by the Netherlands Organization for Scientific Research (NWO, grant # 612-066-513).

•
use tagname statistics (like unigrams) • use parent-child tags (like bigrams) • use unlabeled graph and tree-similarity • categorize html-tags into more abstract semantic labels (image, section, paragraph, table)