Sublinear Approximate String-Matching and Biological Applications

Chang, W. I., Lawler, E. L. (October 1994) Sublinear Approximate String-Matching and Biological Applications. Algorithmica, 12 (4-5). pp. 327-344. ISSN 0178-4617

Abstract

Given a text string of length n and a pattern string of length m over a b-letter alphabet, the k differences approximate string matching problem asks for all locations in the text where the pattern occurs with at most k differences (substitutions, insertions, deletions). We treat k not as a constant but as a fraction of m (not necessarily constant-fraction). Previous algorithms require at least 0(kn) time (or exponential space). We give an algorithm that is sublinear time O((n/m)k log(b) m) when the text is random and k is bounded by the threshold m/(log(b) m + 0(1)). In particular, when k = o(m/log(b) m) the expected running time is o(n). In the worst case our algorithm is 0(kn), but is still an improvement in that it is practical and uses 0(m) space compared with 0(n) or 0(m2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.

Item Type: Paper
Uncontrolled Keywords: PATTERN MATCHING EDIT DISTANCE SUFFIX TREE LOWEST COMMON ANCESTOR CHERNOFF BOUND COMPUTATIONAL BIOLOGY DATA-COMPRESSION ALGORITHM SEARCH
Subjects: bioinformatics > computational biology
CSHL Authors:
Communities: CSHL labs
Depositing User: Matt Covey
Date: October 1994
Date Deposited: 27 Aug 2015 16:10
Last Modified: 27 Aug 2015 16:10
URI: https://repository.cshl.edu/id/eprint/31388

Actions (login required)

Administrator's edit/view item Administrator's edit/view item