SeqHBase: a big data toolset for family based sequencing data analysis

He, M., Person, T. N., Hebbring, S. J., Heinzen, E., Ye, Z., Schrodi, S. J., McPherson, E. W., Lin, S. M., Peissig, P. L., Brilliant, M. H., O'Rawe, J., Robison, R. J., Lyon, G. J., Wang, K. (January 2015) SeqHBase: a big data toolset for family based sequencing data analysis. Journal of Medical Genetics, 52 (4). pp. 282-288. ISSN 0022-2593

URL: http://www.ncbi.nlm.nih.gov/pubmed/25587064
DOI: 10.1136/jmedgenet-2014-102907

Abstract

BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. METHODS: Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). RESULTS: We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. CONCLUSIONS: These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.

Item Type: Paper
Subjects: bioinformatics
Investigative techniques and equipment > whole exome sequencing
Investigative techniques and equipment > assays > whole exome sequencing
Investigative techniques and equipment > assays > whole genome sequencing
CSHL Authors:
Communities: CSHL labs > Lyon lab
Stanley Institute for Cognitive Genomics
Depositing User: Matt Covey
Date: 13 January 2015
Date Deposited: 30 Jan 2015 20:51
Last Modified: 06 Nov 2015 20:43
PMCID: PMC4382803
Related URLs:
URI: https://repository.cshl.edu/id/eprint/31168

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving