Thumbnail
Access Restriction
Subscribed

Author Myers, Gene
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©1999
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Approximate string search ♦ Bit-parallelism ♦ Sequence comparison
Abstract The approximate string matching problem is to find all locations at which a query of $length\textit{m}$ matches a substring of a text of length $\textit{n}$ with $\textit{k}-or-fewer$ differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in $\textit{agrep}.$ These algorithms compute a bit representation of the current state-set of the $\textit{k}-difference$ automaton for the query, and asymptotically run in either $\textit{O}(\textit{nm/w})$ or $\textit{O}(\textit{nm}$ log $&sgr;/\textit{w})$ time where $\textit{w}$ is the word size of the machine (e.g., 32 or 64 in practice), and &sgr; is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only $\textit{O}(\textit{nm/w)}$ time by virtue of computing a bit representation of the $\textit{relocatable}$ dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of $\textit{k},$ and it is found to be more efficient than the previous results for many choices of $\textit{k}$ and $small\textit{m}.$ Moreover, because the algorithm is not dependent on $\textit{k},$ it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an $\textit{O(kn/w)}$ expected-time algorithm for the case where $\textit{m}$ may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx $\textit{w}$ entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with $\textit{all}$ existing algorithms except for some filtration algorithms that are superior when $\textit{k/m}$ is sufficiently small.
ISSN 00045411
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 1999-05-01
Publisher Place New York
e-ISSN 1557735X
Journal Journal of the ACM (JACM)
Volume Number 46
Issue Number 3
Page Count 21
Starting Page 395
Ending Page 415


Open content in new tab

   Open content in new tab
Source: ACM Digital Library