Access Restriction

Author Navarro, Gonzalo ♦ Puglisi, Simon J. ♦ Valenzuela, Daniel
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2014
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Computer programming, programs & data
Subject Keyword Document retrieval ♦ Compressed text databases ♦ Top-k retrieval ♦ Wavelet trees
Abstract Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioinformatics, multimedia databases, and web mining. This requires a more general setup where text and pattern can be general sequences of symbols, and the classical inverted indexes developed for words cannot be applied. While linear-space time-optimal solutions have been developed for most interesting queries in this general case, space usage is a serious problem in practice. In this article, we develop compact data structures that solve various important document retrieval problems on general text collections. More specifically, we provide practical solutions for listing the documents where a query pattern appears, together with its frequency in each document, and for listing $\textit{k}$ documents where a query pattern appears most frequently. Some of our techniques build on existing theoretical proposals, while others are new. In particular, we introduce a novel grammar-based compressed bitmap representation that may be of independent interest when dealing with repetitive sequences. Ours are the first practical indexes that use less space when the text collection is compressible. Our experimental results show that, in various real-life text collections, our data structures are significantly smaller than the most space-efficient previous solutions, using up to half the space without noticeably increasing the query time. Overall, document listing can be carried out in 10 to 40 milliseconds for patterns that appear 100 to 10,000 times in the collection, whereas $top-\textit{k}$ retrieval is carried out in $\textit{k}$ to 10 $\textit{k}$ milliseconds.
ISSN 10846654
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2015-01-07
Publisher Place New York
e-ISSN 10846654
Journal Journal of Experimental Algorithmics (JEA)
Volume Number 19
Page Count 46
Starting Page 1
Ending Page 46

Open content in new tab

   Open content in new tab
Source: ACM Digital Library