Biophysical models of transcriptional regulation from sequence data

Kinney, J. B. (2008) Biophysical models of transcriptional regulation from sequence data. PhD thesis, Princeton University.



In the post-genomics era, DNA sequence itself is becoming a medium by which to probe biological phenomena. With the advent of microarray technology, and ultrahigh- throughput sequencing more recently, large sequence data sets are becoming standard products of day-to-day research. Yet as software for analyzing such data proliferates, a fundamental understanding of how DNA sequence should be used to gain biological insight is missing from the literature. The focus of this thesis is on developing tools for characterizing the biophysical interactions underlying transcriptional regulation { the ability of cells to control which genes they transcribe to mRNA, and thus express as protein. We begin by presenting basic principles for the analysis of DNA sequence data { specically, data for which each sequence � is accompanied by a (perhaps very noisy) measurement z of biophysical functionality. A salient feature of experiments which produce such �z data is the di�culty of characterizing experimental noise a priori. We overcome this obstacle by introducing error-model-averaged (EMA) likelihood, which allows biophysical models of arbitrary functional form to be rigorously t to �z data. EMA likelihood is closely related to mutual information, but its probabilistic interpretation provides some advantages. We demonstrate EMA likelihood's utility on previously published microarray data, using Metropolis Monte Carlo sampling to infer models for the DNA-binding energy of transcription factor proteins. The ability to properly analyze �z data leads us to propose a new experimental assay, called Sort-Seq. This technique uses ultra-high-throughput sequencing to probe the protein-DNA and protein-protein interactions underlying transcriptional regulation at specific genomic loci. We present data from a proof-of-principle Sort- Seq experiment probing the lacZ promoter of E. coli, data we use to characterize the sequence-dependent binding energy of the transcription factor CRP. We then discuss what one can, in principle, infer from large Sort-Seq data sets. We show that, with enough �z data probing the binding of multiple proteins per sequence, one should be able to infer both protein-DNA and protein-protein interaction energies in absolute thermal units. We conclude that, with the advent of ultra-highthroughput sequencing, DNA sequence itself might provide a very sensitive means by which to probe in vivo biophysics.

Item Type: Thesis (PhD)
Subjects: physics > biophysics
bioinformatics > genomics and proteomics > genetics & nucleic acid processing > DNA, RNA structure, function, modification > transcription
Investigative techniques and equipment > assays > whole genome sequencing
CSHL Authors:
Communities: CSHL labs > Kinney lab
Depositing User: Matt Covey
Date: 2008
Date Deposited: 30 Apr 2015 19:18
Last Modified: 30 Apr 2015 19:18

Actions (login required)

Administrator's edit/view item Administrator's edit/view item
CSHL HomeAbout CSHLResearchEducationNews & FeaturesCampus & Public EventsCareersGiving