Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Dehghani, Hossein, Mhaskar, Neerja, Smyth, W. F.
Format:	Preprint
Published:	2022
Subjects:	Data Structures and Algorithms
Online Access:	https://arxiv.org/abs/2204.08331
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929244566716416
author	Dehghani, Hossein Mhaskar, Neerja Smyth, W. F.
author_facet	Dehghani, Hossein Mhaskar, Neerja Smyth, W. F.
contents	In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= \|Σ\| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore.
format	Preprint
id	arxiv_https___arxiv_org_abs_2204_08331
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Practical KMP/BM Style Pattern-Matching on Indeterminate Strings Dehghani, Hossein Mhaskar, Neerja Smyth, W. F. Data Structures and Algorithms In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= \|Σ\| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore.
title	Practical KMP/BM Style Pattern-Matching on Indeterminate Strings
topic	Data Structures and Algorithms
url	https://arxiv.org/abs/2204.08331

Similar Items