A lightweight, web based with near to real-time speed algorithm is proposed in this work. It is able to retrieve main parts (menu, main text, header and footer) of a randomly selected web page entirely using CSS, JavaScript, frames, layers, images, etc. for retrieval. Moreover shortcomings of wellknown modern algorithms for content retrieval from web pages are discussed in this proposal. The algorithm is useful for the improvement of existing: searching, content matching, summaries making, web graph calculation, and etc. engines. Moreover it is practical as a data provider for classification and data mining. The experimental results of a PHP realization of the algorithm showed near to real-time speed, 20-25% error rate for the multipurpose mode and less than 1% error rate for the specific mode.
Content
Author and article information
Contributors
A. Vedeshin
Conference
Publication date:
August
2007
Publication date
(Print):
August
2007
Pages: 1-6
Affiliations
[0001]Tallinn University of Technology
Ehitajate tee 5,
19086 Tallinn, Estonia