### A fast bit-vector algorithm for approximate string matching based on dynamic programmingA fast bit-vector algorithm for approximate string matching based on dynamic programming

Access Restriction
Subscribed

 Author Myers, Gene Source ACM Digital Library Content type Text Publisher Association for Computing Machinery (ACM) File Format PDF Copyright Year ©1999 Language English
 Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science Subject Keyword Approximate string search ♦ Bit-parallelism ♦ Sequence comparison Abstract The approximate string matching problem is to find all locations at which a query of $length\textit{m}$ matches a substring of a text of length $\textit{n}$ with $\textit{k}-or-fewer$ differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in $\textit{agrep}.$ These algorithms compute a bit representation of the current state-set of the $\textit{k}-difference$ automaton for the query, and asymptotically run in either $\textit{O}(\textit{nm/w})$ or $\textit{O}(\textit{nm}$ log $&sgr;/\textit{w})$ time where $\textit{w}$ is the word size of the machine (e.g., 32 or 64 in practice), and &sgr; is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only $\textit{O}(\textit{nm/w)}$ time by virtue of computing a bit representation of the $\textit{relocatable}$ dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of $\textit{k},$ and it is found to be more efficient than the previous results for many choices of $\textit{k}$ and $small\textit{m}.$ Moreover, because the algorithm is not dependent on $\textit{k},$ it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an $\textit{O(kn/w)}$ expected-time algorithm for the case where $\textit{m}$ may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx $\textit{w}$ entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with $\textit{all}$ existing algorithms except for some filtration algorithms that are superior when $\textit{k/m}$ is sufficiently small. ISSN 00045411 Age Range 18 to 22 years ♦ above 22 year Educational Use Research Education Level UG and PG Learning Resource Type Article Publisher Date 1999-05-01 Publisher Place New York e-ISSN 1557735X Journal Journal of the ACM (JACM) Volume Number 46 Issue Number 3 Page Count 21 Starting Page 395 Ending Page 415

#### Open content in new tab

Source: ACM Digital Library