769
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

      scite_
       
      • Record: found
      • Abstract: found
      • Conference Proceedings: found
      Is Open Access

      Lexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling

      proceedings-article
      ,
      BCS-IRSG Workshop on Corpus Profiling (IRSG)
      Workshop on Corpus Profiling
      18 October 2008
      Topic Focused Web Crawling, World Wide Web, Information Retrieval, Lexical Profiling
      Bookmark

            Abstract

            Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing Web Directories to specify finer-grained topics on smaller training examples, while using the seemingly redundant information in related topics to make the process of gathering pages more efficient. We also suggest a link scoring formula that combines content, context and page lexical similarities to a given topic to prioritise the links for crawling. The initial experiments with the Open Directory Project show that the prioritised crawl provides significantly more pages than the breadth-first crawler. Also, the rate at which the number of relevant pages increases is much higher. Keeping the crawler close to the target subject allows “unproductive” periods to be reduced, by following links most likely to link to target pages.

            Content

            Author and article information

            Contributors
            Conference
            October 2008
            October 2008
            : 1-8
            Affiliations
            [0001]School of Computer Science, University of Manchester, Manchester, UK
            Article
            10.14236/ewic/IRSG2008.4
            5de94cbd-121c-42c0-ab24-159b6ff444ee
            © Mark Greenwood et al. Published by BCS Learning and Development Ltd. BCS-IRSG Workshop on Corpus Profiling

            This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

            BCS-IRSG Workshop on Corpus Profiling
            IRSG
            London
            18 October 2008
            Electronic Workshops in Computing (eWiC)
            Workshop on Corpus Profiling
            History
            Product

            1477-9358 BCS Learning & Development

            Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/IRSG2008.4
            Self URI (journal page): https://ewic.bcs.org/
            Categories
            Electronic Workshops in Computing

            Applied computer science,Computer science,Security & Cryptology,Graphics & Multimedia design,General computer science,Human-computer-interaction
            Topic Focused Web Crawling,World Wide Web,Information Retrieval,Lexical Profiling

            Comments

            Comment on this article