Access Restriction

Author Neiva Lopes Figueiredo, L. ♦ Almeida Ferreira, A. ♦ Tavares de Assis, G.
Source IEEE Xplore Digital Library
Content type Text
Publisher Institute of Electrical and Electronics Engineers, Inc. (IEEE)
File Format PDF
Copyright Year ©2014
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword path expression ♦ Visualization ♦ Accuracy ♦ Web pages ♦ main data region ♦ rendering information ♦ Rendering (computer graphics) ♦ HTML ♦ wrapper ♦ Browsers ♦ Data mining ♦ visual information
Abstract Extracting data from web pages is an important task for several applications, such as comparison shopping and data mining. Much of that data is provided by search result pages, in which each result, called search result record, represents a record from a database. One of the most important steps for extracting such records is identifying, among different data regions from a page, one that contains the records to be extracted. An incorrect identification of this region may lead to an incorrect extraction of the search result records. In this paper, we propose a simple but efficient method that generates path expression to select the main data region from a given page, based on the rendering area information of its elements. The generated path expression may be used by wrappers for extracting the search result records and its data units, reducing its complexity and increasing its accuracy. Experimental results using web pages from several domains show that the method is highly effective.
Description Author affiliation: Dept. de Comput., Univ. Fed. de Ouro Preto, Ouro Preto, Brazil (Neiva Lopes Figueiredo, L.; Almeida Ferreira, A.; Tavares de Assis, G.)
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research ♦ Reading
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2014-10-22
Publisher Place Brazil
Rights Holder Institute of Electrical and Electronics Engineers, Inc. (IEEE)
e-ISBN 9781479969531
Size (in Bytes) 467.50 kB
Page Count 9
Starting Page 24
Ending Page 32

Source: IEEE Xplore Digital Library