Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2022
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2204.08331 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866929244566716416 |
|---|---|
| author | Dehghani, Hossein Mhaskar, Neerja Smyth, W. F. |
| author_facet | Dehghani, Hossein Mhaskar, Neerja Smyth, W. F. |
| contents | In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= |Σ| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2204_08331 |
| institution | arXiv |
| publishDate | 2022 |
| record_format | arxiv |
| spellingShingle | Practical KMP/BM Style Pattern-Matching on Indeterminate Strings Dehghani, Hossein Mhaskar, Neerja Smyth, W. F. Data Structures and Algorithms In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= |Σ| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore. |
| title | Practical KMP/BM Style Pattern-Matching on Indeterminate Strings |
| topic | Data Structures and Algorithms |
| url | https://arxiv.org/abs/2204.08331 |