Chang, W. I., Lawler, E. L.
(October 1994)
*Sublinear Approximate String-Matching and Biological Applications.*
Algorithmica, 12 (4-5).
pp. 327-344.
ISSN 0178-4617

## Abstract

Given a text string of length n and a pattern string of length m over a b-letter alphabet, the k differences approximate string matching problem asks for all locations in the text where the pattern occurs with at most k differences (substitutions, insertions, deletions). We treat k not as a constant but as a fraction of m (not necessarily constant-fraction). Previous algorithms require at least 0(kn) time (or exponential space). We give an algorithm that is sublinear time O((n/m)k log(b) m) when the text is random and k is bounded by the threshold m/(log(b) m + 0(1)). In particular, when k = o(m/log(b) m) the expected running time is o(n). In the worst case our algorithm is 0(kn), but is still an improvement in that it is practical and uses 0(m) space compared with 0(n) or 0(m2). We define three problems motivated by molecular biology and describe efficient algorithms based on our techniques: (1) approximate substring matching, (2) approximate-overlap detection, and (3) approximate codon matching. Respectively, applications to biology are local similarity search, sequence assembly, and DNA-protein matching.

Item Type: | Paper |
---|---|

Uncontrolled Keywords: | PATTERN MATCHING EDIT DISTANCE SUFFIX TREE LOWEST COMMON ANCESTOR CHERNOFF BOUND COMPUTATIONAL BIOLOGY DATA-COMPRESSION ALGORITHM SEARCH |

Subjects: | bioinformatics > computational biology |

CSHL Authors: | |

Communities: | CSHL labs |

Depositing User: | Matt Covey |

Date: | October 1994 |

Date Deposited: | 27 Aug 2015 16:10 |

Last Modified: | 27 Aug 2015 16:10 |

URI: | https://repository.cshl.edu/id/eprint/31388 |

### Actions (login required)

Administrator's edit/view item |