Focused Retrieval Using Topical Language and Structure

We investigate focused retrieval techniques that deal with the increasing amount of structure on the web. Our approach is to combine multiple representations of web information in a common framework based on statistical language models. In this framework, it will be possible to derive a topical language model of the actual language-use on web pages on a certain topic—such as arts, business, entertainment, education, etc.—using the unigrams and bigrams taken from the plain text of the web pages. Similarly, it will be possible to derive models of the structure of web pages to distinguish between blogs, FAQs, personal web pages, etc. Structural characteristics of a web page include, amongst others, tagname statistics and parent-child tags. We will build a multiple level language model to exploit the information contained in the topical language and structure models. The .GOV2 corpus will be used as a test collection on which queries will be run on different topical categories and on web pages with different structures. We plan to develop so-called parsimonious models to derive a compact representation and to handle dependencies between representations of the data.

Content

Author and article information

Contributors

A.M. Kaptein

Conference

Publication date: August 2007

Publication date (Print): August 2007

Pages: 1-6

Affiliations

[0001]Archives and Information Studies, University of Amsterdam

Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands

Article

DOI: 10.14236/ewic/FDIA2007.9

SO-VID: a2524cab-2205-4ca5-946d-c3313d9d88b5

License:

This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Conference name: BCS IRSG Symposium: Future Directions in Information Access 2007

Conference acronym: FDIA

Conference number:

Conference location: Glasgow

Conference date: 28-29 August 2007

Conference sponsor: Electronic Workshops in Computing (eWiC)

Conference theme: Future Directions in Information Access

History

Product

1477-9358 BCS Learning & Development

Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/FDIA2007.9

Self URI (journal page): https://ewic.bcs.org/

Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

Focused Retrieval Using Topical Language and Structure

Abstract