Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features

Mahmoud Ghandi*, Dongwon Lee*, Morteza Mohammad-Noori, Michael A. Beer

* These authors contributed equally to the work

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naive-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

New Software:

gkmSVM-R:

We have released a new gkm-SVM R package (Ghandi, et al, Bioinformatics 2016) which includes a new faster kernel implementation, negative sequence set generation, genomic sequence extraction, and variant prediction (deltaSVM). This R package is recommended for training gkm-SVM on up to ~20,000 training sequences.

Tutorial and installation notes can be found here: gkmSVM tutorial

Sample input files: CTCF_GM12878_hg38_top5k.bed (use 'Save link as..') nr10mers.fa ref.fa alt.fa

Software package for Linux or mac: gkmSVM-R
Windows users should use the CRAN library
C++ source (if you prefer not to use R) gkmsvm-2.0.tar.gz

LS-GKM:

If you have larger sequence sets (~50,000-100,000 seqs), we recommend our large scale software which does not pre-compute the kernel matrix: LS-GKM

Citation

If you use gkm-SVM, please cite as:
Ghandi M, Lee D, Mohammad-Noori M, Beer MA. 2014. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10: e1003711.
--and--
Ghandi M, Mohammad-Noori M, Gharegani N, Lee D, Garraway L, Beer MA. 2016. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 32: 2205-2207.

Programs

previous version: gkmsvm-1.3.tar.gz

Notes:

5/28/15: gkmsvm v1.3 (gkmsvm-1.3.tar.gz) is released. Compile error on Mac is fixed. Default parameter for the number of mismatches (-d option) is set to 3.
1/11/15: gkmsvm v1.2 (gkmsvm-1.2.tar.gz) is released. A buffer overflow bug in gkmsvm_classify is fixed.
9/15/14: gkmsvm v1.1 (gkmsvm-1.1.tar.gz) is released. A simple tutorial for predictive k-mer/PWM analysis (Please see README) and associated scripts have been added. A minor bug in gkmsvm_classify has also been fixed.
8/8/14: Please download and install again this package (gkmsvm.tar.gz) if you accessed it before 8/8/2014. There was an error in gkmsvm_train program in the orignal release.

Training Data Sets

All sequence data sets in FASTA format: sequences_gkmsvm.tar.gz (~325Mb)

If you have any questions, please contact Mike Beer at mbeer AT jhu DOT edu.