An Architecture for Efficient Document Clustering and Retrieval on a Dynamic Collection of Newspaper Texts

Clustering of related or similar objects has long been regarded as a potentially useful contribution to helping users navigate an information space such as a document collection. When documents are related by virtue of being about the same or similar topics, then this is often a good indicator that they will be relevant to the same queries and this can be used during the retrieval operation. Many clustering algorithms and techniques have been developed and implemented since the earliest days of computational information retrieval but as the sizes of document collections have grown these techniques have not been scaled to large collections because of their computational overhead. In this paper we describe a technique for clustering a collection of documents such as a collection of online newspapers which uses a number of short-cuts to make the process computable for large collections. Furthermore, our design is extensible in that it caters for a dynamic collection of documents which would be periodically, perhaps nightly, updated, amended or have deletions. An implementation of the clustering on an archive of the Irish Times newspaper is reported here.

Content

Author and article information

Contributors

Alan F. Smeaton

Conference

Publication date: March 1998

Publication date (Print): March 1998

Pages: 1-9

Affiliations

[0001]School of Computer Applications

Dublin City University

Glasnevin, Dublin 9, IRELAND

Article

DOI: 10.14236/ewic/IRSG1998.10

SO-VID: c0183a59-f03a-413f-842a-00fa95bd9a35

License:

This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Conference name: 20th Annual BCS-IRSG Colloquium on IR

Conference acronym: IRSG

Conference number: 20

Conference location: Autrans, France

Conference date: 25-27 March 1998

Conference sponsor: Electronic Workshops in Computing (eWiC)

Conference theme: BCS-IRSG Colloquium on IR

History

Product

1477-9358 BCS Learning & Development

Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/IRSG1998.10

Self URI (journal page): https://ewic.bcs.org/

Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

An Architecture for Efficient Document Clustering and Retrieval on a Dynamic Collection of Newspaper Texts

Abstract