### Document Spanners: A Formal Approach to Information ExtractionDocument Spanners: A Formal Approach to Information Extraction

Access Restriction
Subscribed

 Author Fagin, Ronald ♦ Kimelfeld, Benny ♦ Reiss, Frederick ♦ Vansummeren, Stijn Source ACM Digital Library Content type Text Publisher Association for Computing Machinery (ACM) File Format PDF Copyright Year ©2015 Language English
 Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science Subject Keyword Information extraction ♦ Document spanners ♦ Finite-state automata ♦ Regular expressions Abstract An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just $\textit{spanner}$ for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a $\textit{primitive}$ representation extract relations directly from the input string; those defined in an $\textit{algebra}$ apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables. We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the $\textit{regular}$ spanners—the closure of the first kind under standard relational operators. The $\textit{core}$ spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature. ISSN 00045411 Age Range 18 to 22 years ♦ above 22 year Educational Use Research Education Level UG and PG Learning Resource Type Article Publisher Date 2015-05-06 Publisher Place New York e-ISSN 1557735X Journal Journal of the ACM (JACM) Volume Number 62 Issue Number 2 Page Count 51 Starting Page 1 Ending Page 51

#### Open content in new tab

Source: ACM Digital Library