We investigate focused retrieval techniques that deal with the increasing amount of structure on the web. Our approach is to combine multiple representations of web information in a common framework based on statistical language models. In this framework, it will be possible to derive a topical language model of the actual language-use on web pages on a certain topic—such as arts, business, entertainment, education, etc.—using the unigrams and bigrams taken from the plain text of the web pages. Similarly, it will be possible to derive models of the structure of web pages to distinguish between blogs, FAQs, personal web pages, etc. Structural characteristics of a web page include, amongst others, tagname statistics and parent-child tags. We will build a multiple level language model to exploit the information contained in the topical language and structure models. The .GOV2 corpus will be used as a test collection on which queries will be run on different topical categories and on web pages with different structures. We plan to develop so-called parsimonious models to derive a compact representation and to handle dependencies between representations of the data.
Content
Author and article information
Contributors
A.M. Kaptein
Conference
Publication date:
August
2007
Publication date
(Print):
August
2007
Pages: 1-6
Affiliations
[0001]Archives and Information Studies, University of Amsterdam
Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands