A multiple genome alignment workflow shows the impact of repeat masking and parameter tuning on alignment of functional regions in plants

Wu, Yaoyao, Johnson, Lynn, Song, Baoxing, Romay, Cinta, Stitzer, Michelle, Siepel, Adam, Buckler, Edward, Scheben, Armin (June 2021) A multiple genome alignment workflow shows the impact of repeat masking and parameter tuning on alignment of functional regions in plants. BioRxiv. (Unpublished)

[thumbnail of 2021.Wu.multiple_genome_alignment.pdf] PDF
2021.Wu.multiple_genome_alignment.pdf
Available under License Creative Commons Attribution.

Download (2MB)

Abstract

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%.

Item Type: Paper
Subjects: bioinformatics > genomics and proteomics > analysis and processing
bioinformatics > genomics and proteomics > alignment > sequence alignment
organism description > plant
CSHL Authors:
Communities: CSHL labs > Siepel lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: 2 June 2021
Date Deposited: 26 May 2022 15:07
Last Modified: 26 May 2022 15:07
URI: https://repository.cshl.edu/id/eprint/40635

Actions (login required)

Administrator's edit/view item Administrator's edit/view item