### General Document Retrieval in Compact SpaceGeneral Document Retrieval in Compact Space

Access Restriction
Subscribed

 Author Navarro, Gonzalo ♦ Puglisi, Simon J. ♦ Valenzuela, Daniel Source ACM Digital Library Content type Text Publisher Association for Computing Machinery (ACM) File Format PDF Copyright Year ©2014 Language English
 Subject Domain (in DDC) Computer science, information & general works ♦ Computer programming, programs & data Subject Keyword Document retrieval ♦ Compressed text databases ♦ Top-k retrieval ♦ Wavelet trees Abstract Given a collection of documents and a query pattern, document retrieval is the problem of obtaining documents that are relevant to the query. The collection is available beforehand so that a data structure, called an index, can be built on it to speed up queries. While initially restricted to natural language text collections, document retrieval problems arise nowadays in applications like bioinformatics, multimedia databases, and web mining. This requires a more general setup where text and pattern can be general sequences of symbols, and the classical inverted indexes developed for words cannot be applied. While linear-space time-optimal solutions have been developed for most interesting queries in this general case, space usage is a serious problem in practice. In this article, we develop compact data structures that solve various important document retrieval problems on general text collections. More specifically, we provide practical solutions for listing the documents where a query pattern appears, together with its frequency in each document, and for listing $\textit{k}$ documents where a query pattern appears most frequently. Some of our techniques build on existing theoretical proposals, while others are new. In particular, we introduce a novel grammar-based compressed bitmap representation that may be of independent interest when dealing with repetitive sequences. Ours are the first practical indexes that use less space when the text collection is compressible. Our experimental results show that, in various real-life text collections, our data structures are significantly smaller than the most space-efficient previous solutions, using up to half the space without noticeably increasing the query time. Overall, document listing can be carried out in 10 to 40 milliseconds for patterns that appear 100 to 10,000 times in the collection, whereas $top-\textit{k}$ retrieval is carried out in $\textit{k}$ to 10 $\textit{k}$ milliseconds. ISSN 10846654 Age Range 18 to 22 years ♦ above 22 year Educational Use Research Education Level UG and PG Learning Resource Type Article Publisher Date 2015-01-07 Publisher Place New York e-ISSN 10846654 Journal Journal of Experimental Algorithmics (JEA) Volume Number 19 Page Count 46 Starting Page 1 Ending Page 46

#### Open content in new tab

Source: ACM Digital Library