Discriminative prediction of mammalian enhancers from DNA sequence

Dongwon Lee, Rachel Karchin and Michael A. Beer

Accurately predicting regulatory sequences and enhancers in entire genomes is an important but difficult problem, especially in large vertebrate genomes. With the advent of ChIP-seq technology, experimental detection of genome-wide EP300/CREBBP bound regions provides a powerful platform to develop predictive tools for regulatory sequences and to study their sequence properties. Here, we develop a support vector machine (SVM) framework which can accurately identify EP300-bound enhancers using only genomic sequence and an unbiased set of general sequence features. Moreover, we find that the predictive sequence features identified by the SVM classifier reveal biologically relevant sequence elements enriched in the enhancers, but we also identify other features that are significantly depleted in enhancers. The predictive sequence features are evolutionarily conserved and spatially clustered, providing further support of their functional significance. Although our SVM is trained on experimental data, we also predict novel enhancers and show that these putative enhancers are significantly enriched in both ChIP-seq signal and DNase I hypersensitivity signal in the mouse brain and are located near relevant genes. Finally, we present results of comparisons between other EP300/CREBBP data sets using our SVM and uncover sequence elements enriched and/or depleted in the different classes of enhancers. Many of these sequence features play a role in specifying tissue-specific or developmental-stage-specific enhancer activity, but our results indicate that some features operate in a general or tissue-independent manner. In addition to providing a high confidence list of enhancer targets for subsequent experimental investigation, these results contribute to our understanding of the general sequence structure of vertebrate enhancers.

Programs

Click here to download our python scripts. (Note: We developed Python scripts using SHOGUN Toolbox. In order to run our scripts, you first need to install Shogun Toolbox, which you can obtain from here.)

Training Data Sets

All sequence data sets in FASTA format: sequence_files.tar.gz

Genome-Wide Forebrain p300 Enhancer Prediction

UCSC Genome browser custom track in bigWig format: fb_pred.bw

Or, simply copy and paste the following track line:
track type=bigWig name="SVM prediction" description="SVM Forebrain Enhancer Scores" visibility=full color=200,100,0 altColor=0,100,200 priority=20 autoScale=off viewLimits=-3:3 yLineMark=1.0 yLineOnOff=on bigDataUrl=http://www.beerlab.org/p300enhancer/downloads/fb_pred.bw

If you have any questions, please contact Dongwon Lee at dwlee AT jhu DOT edu.