Thumbnail
Access Restriction
Open

Source CiteSeerX
Content type Text
File Format PDF
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Semi-structured Data Using Landmark ♦ Text Region ♦ Host Language ♦ Simple Implementation ♦ Xml Sgml Document ♦ Main Operator ♦ Wide Variety ♦ Unique Textual Landmark ♦ Landmark Search Operator ♦ Text File ♦ Iterator Class ♦ Web Page ♦ Xml Sgml Tag Pair
Abstract This paper introduces landmark search operators for extracting data from poorly formatted Web pages, plain text files, and XML/SGML documents lacking grammars. The emphasis is on ease of use, and a fast, simple implementation, which can be readily ported to a wide variety of host languages. There are two main operators: one using unique textual landmarks to divide text regions into smaller regions suitable for further search, and an operator that searches for XML/SGML tag pairs, and returns the matches as regions. An iterator class allows a search to be carried out repeatedly. 1.
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article