﻿gkmsvm README file
==================

  The gkm-SVM provides implementation of a new SVM kernel method using gapped
  k-mers as features for DNA sequences and the related scripts.

  If you use gkm-SVM, we kindly ask that you cite our paper:

  Ghandi M*, Lee D*, Mohammad-Noori M, and Beer MA. Enhanced Regulatory Sequence
  Prediction Using Gapped k-mer Features. Submitted.
  * Co-first authors


Installation
============

  Download the gkmsvm.tar.gz and type:

  tar -xzvf gkmsvm.tar.gz
  cd gkmsvm
  make

  If successful, You should be able to find the following executables in the
  current directory:

  gkmsvm_kernel
  gkmsvm_train
  gkmsvm_classify


Tutorial
========

  We introduce the users to the basic workflow of our gkm-SVM step-by-step.
  Please refer to help messages for more detailed information of each program.
  You can access to it by running the programs without any argument/parameter.
  

  1) making a kernel matrix
  First of all, we should calculate a full kernel matrix before training SVM
  classifiers. In this tutorial, we are going to use test/test_positives.fa
  as a positive set, and test/test_negatives.fa as a negative set. Note that
  we specify '-d' option for more efficient computation. Otherwise, we use
  default setting.

  Type:
  $ ./gkmsvm_kernel -d 3 test/test_positives.fa test/test_negatives.fa test_kernel.out

  Now we have the kernel matrix file, test_kernel.out. For comparison, 
  we have also made the same file available in the test directory,
  test/test_kernel.out. These two files should be identical.


  2) training SVM
  We can now train a SVM classifier using the kernel matrix generated above.
  We provide two different ways to do so. We implemented an iterative method of
  SVM training, gkmsvm_train. It takes four arguments; kernel file, positive
  sequence file, negative sequence file, and prefix of output.

  Type:
  $ ./gkmsvm_train test_kernel.out test/test_positives.fa test/test_negatives.fa test_svmtrain

  It will generate two files, test_svmtrain_svalpha.out and
  test_svmtrain_svseq.fa, which will then be used for classification/scoring
  of any sequences as described below.

  Alternatively, we also provide a Python script, scripts/cksvm_train.py for
  SVM training using Shogun.  In order to run this script, Shogun Toolbox
  (http://shogun-toolbox.org/) should be installed with python-modular enabled.
  This script support Shogun version 0.9.2 ~ 1.1.0. It takes the same arguments
  as gkmsvm_train.  As outputs, this script also generates cross-validation (CV)
  results as well as SVM training results.

  Type:
  $ python scripts/cksvm_train.py test_kernel.out test/test_positives.fa test/test_negatives.fa test_svmtrain2

  Now, you should have three files, test_svmtrain2_svalpha.out,
  test_svmtrain2_svseq.fa test_svmtrain2_cvpred.out. By default, it performs
  5-fold CV.  The output format of CV is simple:
  
  [sequenceid] [SVM score] [label] [CV-set]
  ...

  3) plotting ROC and Precision-Recall (PR) Curves.
  You can also plot ROC and PR curves using a R script scripts/rocprcurve.R
  In order to run this script, you need R and ROCR library.

  Type:
  $ Rscript scripts/rocprcurve.R test_svmtrain2_cvpred.out test_svmtrain2_rocpr.pdf

  4) classification using SVM
  gkmsvm_classify can be used to score any set of sequences. Here, we will 
  score the positive and the negative training sequences. Note that the same
  set of parameters used in the gkmsvm_kernel should always be specified for
  correct classification. In our case, '-d 3' should be set.

  Type:
  $ ./gkmsvm_classify -d 3 test/test_positives.fa test_svmtrain_svseq.fa test_svmtrain_svalpha.out test_svmclassify_pos.out
  $ ./gkmsvm_classify -d 3 test/test_negatives.fa test_svmtrain_svseq.fa test_svmtrain_svalpha.out test_svmclassify_neg.out

  $ ./gkmsvm_classify -d 3 test/test_positives.fa test_svmtrain2_svseq.fa test_svmtrain2_svalpha.out test_svmclassify2_pos.out
  $ ./gkmsvm_classify -d 3 test/test_negatives.fa test_svmtrain2_svseq.fa test_svmtrain2_svalpha.out test_svmclassify2_neg.out

  5) Predictive k-mer/PWM analysis
  You can further study predictive k-mers and generate de novo PWMs as discussed 
  in our paper.  You first need to generate all possible non-redundant k-mers
  using a Python script scripts/nrkmers.py.  Then, you need to score them using 
  gkmsvm_classify program.  Lastly, you can build de novo PWMs with another 
  script scripts/svmw_emalign.py.  Note that the following example doesn't 
  generate any meaningful PWMs since the test set is too small.  Also, the length
  of k-mers for scoring should match to the parameter L used in training.

  Type:
  $ python scripts/nrkmers.py 10 nr10mers.fa
  $ ./gkmsvm_classify -d 3 nr10mers.fa test_svmtrain_svseq.fa test_svmtrain_svalpha.out nr10mers_test_scores.txt
  $ python scripts/svmw_emalign.py -n 1 -c 3 nr10mers_test_scores.txt 14 testpwm


  Done!


Distribution
============

README    -- This file
LICENSE   -- GPL v3 License file
Makefile  -- Makefile for compile
src/      -- The source code files
scripts/  -- Python script file for SVM training
test/     -- Sequence files for testing and test result files
