Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing Web Directories to specify finer-grained topics on smaller training examples, while using the seemingly redundant information in related topics to make the process of gathering pages more efficient. We also suggest a link scoring formula that combines content, context and page lexical similarities to a given topic to prioritise the links for crawling. The initial experiments with the Open Directory Project show that the prioritised crawl provides significantly more pages than the breadth-first crawler. Also, the rate at which the number of relevant pages increases is much higher. Keeping the crawler close to the target subject allows “unproductive” periods to be reduced, by following links most likely to link to target pages.
Content
Author and article information
Contributors
Mark Greenwood
Goran Nenadic
Conference
Publication date:
October
2008
Publication date
(Print):
October
2008
Pages: 1-8
Affiliations
[0001]School of Computer Science, University of Manchester, Manchester, UK