Thumbnail
Access Restriction
Subscribed

Author Arroyuelo, Diego ♦ Navarro, Gonzalo
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2011
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Computer programming, programs & data
Subject Keyword Lempel-Ziv compression ♦ Compressed data structures ♦ Indexed text search
Abstract Given a text $\textit{T}[1¨\textit{n}]$ over an alphabet of size σ, the full-text search problem consists in locating the $\textit{occ}$ occurrences of a given pattern $\textit{P}[1¨\textit{m}]$ in $\textit{T}.$ Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, $4nH_{k}(T)+o(nlogσ)$ bits of space, where $H_{k}(T)$ is the $\textit{k}-th$ order empirical entropy of $\textit{T}).$ In practice, the average locating complexity of the LZ-index is $\textit{O}(σ$ $\textit{m}$ $log_{σ}$ $\textit{n}$ + $\textit{occ}$ $σ^{m}/2),$ where $\textit{occ}$ is the number of occurrences of $\textit{P}.$ It can extract text substrings of length ℓ in $\textit{O}(ℓ)$ time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use $nH_{k}(T)+o(nlogσ)$ bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require $2(1+&epsis;)nH_{k}(T)$ + $\textit{o}(\textit{n}logσ)$ bits of space, for any 0<&epsis; <1. They have an average locating time of $\textit{O}(1/&epsis;(\textit{m}log$ $\textit{n}$ + $\textit{occ}$ $σ^{m/2})),$ while extracting takes $\textit{O}(ℓ)$ time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.
ISSN 10846654
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2010-12-01
Publisher Place New York
e-ISSN 10846654
Journal Journal of Experimental Algorithmics (JEA)
Volume Number 15
Page Count 54
Starting Page 1.1
Ending Page 1.54


Open content in new tab

   Open content in new tab
Source: ACM Digital Library