Thumbnail
Access Restriction
Subscribed

Author Gottlob, Georg ♦ Koch, Christoph
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2004
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Complexity ♦ HTML ♦ MSO ♦ Expressiveness ♦ Information extraction ♦ Monadic datalog ♦ Regular tree languages ♦ Web wrapping
Abstract Research on information extraction from Web pages (wrapping) has seen much activity recently (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog over trees as a wrapping language. We show that this simple language is equivalent to monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and propose MSO as a yardstick for evaluating and comparing wrappers. Along the way, several other results on the complexity of query evaluation and query containment for monadic datalog over trees are established, and a simple normal form for this language is presented. Using the above results, we subsequently study the kernel fragment $Elog^{™}$ of the Elog wrapping language used in the Lixto system (a visual wrapper generator). Curiously, $Elog^{™}$ exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified.
ISSN 00045411
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2004-01-01
Publisher Place New York
e-ISSN 1557735X
Journal Journal of the ACM (JACM)
Volume Number 51
Issue Number 1
Page Count 40
Starting Page 74
Ending Page 113


Open content in new tab

   Open content in new tab
Source: ACM Digital Library