800
views
0
recommends
+1 Recommend
1 collections
    4
    shares

      Celebrating 65 years of The Computer Journal - free-to-read perspectives - bcs.org/tcj65

      scite_
       
      • Record: found
      • Abstract: found
      • Conference Proceedings: found
      Is Open Access

      Advanced Information Retrieval from Web Pages

      proceedings-article
      BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA)
      Future Directions in Information Access
      28-29 August 2007
      web information retrieval, information extraction from web
      Bookmark

            Abstract

            A lightweight, web based with near to real-time speed algorithm is proposed in this work. It is able to retrieve main parts (menu, main text, header and footer) of a randomly selected web page entirely using CSS, JavaScript, frames, layers, images, etc. for retrieval. Moreover shortcomings of wellknown modern algorithms for content retrieval from web pages are discussed in this proposal. The algorithm is useful for the improvement of existing: searching, content matching, summaries making, web graph calculation, and etc. engines. Moreover it is practical as a data provider for classification and data mining. The experimental results of a PHP realization of the algorithm showed near to real-time speed, 20-25% error rate for the multipurpose mode and less than 1% error rate for the specific mode.

            Content

            Author and article information

            Contributors
            Conference
            August 2007
            August 2007
            : 1-6
            Affiliations
            [0001]Tallinn University of Technology

            Ehitajate tee 5,

            19086 Tallinn, Estonia
            Article
            10.14236/ewic/FDIA2007.12
            8678d48b-229e-44a3-acc8-9cc017f96c84
            © A. Vedeshin. Published by BCS Learning and Development Ltd. BCS IRSG Symposium: Future Directions in Information Access 2007, Glasgow

            This work is licensed under a Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

            BCS IRSG Symposium: Future Directions in Information Access 2007
            FDIA
            Glasgow
            28-29 August 2007
            Electronic Workshops in Computing (eWiC)
            Future Directions in Information Access
            History
            Product

            1477-9358 BCS Learning & Development

            Self URI (article page): https://www.scienceopen.com/hosted-document?doi=10.14236/ewic/FDIA2007.12
            Self URI (journal page): https://ewic.bcs.org/
            Categories
            Electronic Workshops in Computing

            Applied computer science,Computer science,Security & Cryptology,Graphics & Multimedia design,General computer science,Human-computer-interaction
            web information retrieval,information extraction from web

            Comments

            Comment on this article