Phylogenetic estimation of context-dependent substitution rates by maximum likelihood

Siepel, A., Haussler, D. (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol, 21 (3). pp. 468-88. ISSN 0737-4038 (Print)0737-4038

URL: http://www.ncbi.nlm.nih.gov/pubmed/14660683
DOI: 10.1093/molbev/msh039

Abstract

Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.

Item Type: Paper
Uncontrolled Keywords: Algorithms Amino Acid Substitution Computer Simulation *Likelihood Functions Markov Chains *Models, Genetic *Phylogeny Sequence Alignment
Subjects: bioinformatics > genomics and proteomics > alignment > sequence alignment
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > CpG islands
bioinformatics > computational biology > statistical analysis
CSHL Authors:
Communities: CSHL labs > Siepel lab
Depositing User: Matt Covey
Date Deposited: 14 Jan 2015 14:18
Last Modified: 14 Jan 2015 14:18
Related URLs:
URI: http://repository.cshl.edu/id/eprint/31101

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving