Semi-automated assembly of high-quality diploid human reference genomes

Jarvis, Erich D, Formenti, Giulio, Rhie, Arang, Guarracino, Andrea, Yang, Chentao, Wood, Jonathan, Tracey, Alan, Thibaud-Nissen, Francoise, Vollger, Mitchell R, Porubsky, David, Cheng, Haoyu, Asri, Mobin, Logsdon, Glennis A, Carnevali, Paolo, Chaisson, Mark JP, Chin, Chen-Shan, Cody, Sarah, Collins, Joanna, Ebert, Peter, Escalona, Merly, Fedrigo, Olivier, Fulton, Robert S, Fulton, Lucinda L, Garg, Shilpa, Gerton, Jennifer L, Ghurye, Jay, Granat, Anastasiya, Green, Richard E, Harvey, William, Hasenfeld, Patrick, Hastie, Alex, Haukness, Marina, Jaeger, Erich B, Jain, Miten, Kirsche, Melanie, Kolmogorov, Mikhail, Korbel, Jan O, Koren, Sergey, Korlach, Jonas, Lee, Joyce, Li, Daofeng, Lindsay, Tina, Lucas, Julian, Luo, Feng, Marschall, Tobias, Mitchell, Matthew W, McDaniel, Jennifer, Nie, Fan, Olsen, Hugh E, Olson, Nathan D, Pesout, Trevor, Potapova, Tamara, Puiu, Daniela, Regier, Allison, Ruan, Jue, Salzberg, Steven L, Sanders, Ashley D, Schatz, Michael C, Schmitt, Anthony, Schneider, Valerie A, Selvaraj, Siddarth, Shafin, Kishwar, Shumate, Alaina, Stitziel, Nathan O, Stober, Catherine, Torrance, James, Wagner, Justin, Wang, Jianxin, Wenger, Aaron, Xiao, Chuanle, Zimin, Aleksey V, Zhang, Guojie, Wang, Ting, Li, Heng, Garrison, Erik, Haussler, David, Hall, Ira, Zook, Justin M, Eichler, Evan E, Phillippy, Adam M, Paten, Benedict, Howe, Kerstin, Miga, Karen H, Human Pangenome Reference Consortium (November 2022) Semi-automated assembly of high-quality diploid human reference genomes. Nature, 611 (7936). pp. 519-531. ISSN 0028-0836

[thumbnail of Semi-automated assembly of high-quality diploid human reference genomes.pdf] PDF
Semi-automated assembly of high-quality diploid human reference genomes.pdf - Published Version
Available under License Creative Commons Attribution.

Download (15MB)
URL: https://www.ncbi.nlm.nih.gov/pubmed/36261518
DOI: 10.1038/s41586-022-05325-5

Abstract

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

Item Type: Paper
Subjects: bioinformatics > genomics and proteomics > analysis and processing
bioinformatics
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification
bioinformatics > genomics and proteomics > genetics & nucleic acid processing
bioinformatics > genomics and proteomics
organism description > animal
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > chromosome
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > chromosomes, structure and function > chromosome
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > chromosomes, structure and function
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > single nucleotide polymorphism > haplotype
organism description > animal > mammal > primates > hominids
organism description > animal > mammal > primates > hominids > human
organism description > animal > mammal
organism description > animal > mammal > primates
bioinformatics > genomics and proteomics > analysis and processing > reference assembly
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > single nucleotide polymorphism
CSHL Authors:
Communities: CSHL labs > Schatz lab
SWORD Depositor: CSHL Elements
Depositing User: CSHL Elements
Date: November 2022
Date Deposited: 02 Oct 2023 18:50
Last Modified: 11 Jan 2024 20:39
PMCID: PMC9668749
Related URLs:
URI: https://repository.cshl.edu/id/eprint/41106

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving