A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants

Wu, Yaoyao, Johnson, Lynn, Song, Baoxing, Romay, Cinta, Stitzer, Michelle, Siepel, Adam, Buckler, Edward, Scheben, Armin (April 2022) A multiple alignment workflow shows the effect of repeat masking and parameter tuning on alignment in plants. The Plant Genome. e20204. ISSN 1940-3372

[thumbnail of 2022.Wu.Plant.pdf] PDF
2022.Wu.Plant.pdf
Available under License Creative Commons Attribution.

Download (1MB)
URL: https://www.ncbi.nlm.nih.gov/pubmed/35416423
DOI: 10.1002/tpg2.20204

Abstract

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) to allow practical and sensitive multiple alignment of diverged plant genomes and calculation of conservation scores with minimal user inputs. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the effect of different masking approaches and parameters of the LAST aligner using genome assemblies of 33 grass species. Compared with conventional masking with RepeatMasker, a masking approach based on k-mers (nucleotide sequences of k length) increased the alignment rate of coding sequence and noncoding functional regions by 25 and 14%, respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for noncoding functional regions by over 52% compared with default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of noncoding sites that can be scored for conservation by over 76%. Overall, tuning of masking and alignment parameters can generate optimized multiple alignments to drive biological discovery in plants.

Item Type: Paper
Subjects: bioinformatics
bioinformatics > genomics and proteomics > genetics & nucleic acid processing
bioinformatics > genomics and proteomics
bioinformatics > computational biology
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > genomes
organism description > plant
CSHL Authors:
Communities: CSHL labs > Siepel lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 13 April 2022
Date Deposited: 20 Apr 2022 15:02
Last Modified: 17 Jan 2024 19:15
URI: https://repository.cshl.edu/id/eprint/40591

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving