MotifSpec

A discriminative motif finder that uses dynamic search spaces

Download MotifSpec

MotifSpec source code

As an alternative to the tarfile, you may use git clone to download the source:

# git clone https://github.com/rakarnik/motifspec.git

Building MotifSpec

MotifSpec is written in C++. We have compiled it successfully on recent Linux (GCC 4.1.2) and Mac OS X versions (GCC 4.2.1), though it should compile successfully on any Unix with a reasonably recent GCC. Once you have unpacked the source, compile it by typing:

# make

After the code has compiled, the MotifSpec binary (motifspec) is placed in the "bin" directory. You can copy it to wherever you want in your system to run it.

Running MotifSpec

MotifSpec has worker processes that perform the actual motif search and an archive process that collects non-redundant motif results. You can run any number of worker processes, parallelizing the motif search, since each random restart is independent. The connection between the worker processes and the archive process is the "-o" option, which must be the same for all processes within a run. You can thus have multiple MotifSpec runs within the same directory, as long as you keep the "-o" option different between the runs.

To start a worker process to run on the NRSF ChIP-seq data:

# bin/motifspec -s seq/nrsf.fa -su seq/nrsf.pos -o nrsf -worker 1 -numcols 15 -seed 1 >& nrsf.1.out & 

To start the archive process:

# bin/motifspec -s seq/nrsf.fa -su seq/nrsf.pos -o nrsf -numcols 15 -seed 1 >& nrsf.out & 

Results will be collected in nrsf.ms and information on the motifs can be summarized by:

# perl scripts/msproc.pl nrsf.ms 

Which should eventually produce:

ace                    mot    tot    ssp  gtseq   hits    score cons                                   seqc  sspc     iter
nrsf.ms                  1   7251   2417   1442   1445  3516.41 TCAGCACCATGGACAG                     0.9790  0.70      1.2
nrsf.ms                  2   7251   2417    965    335   731.53 RGRRARRRRRRRRRR                      0.9550  0.70      1.1
nrsf.ms                  3   7251   2417    501    143   370.55 AARAAAAAAAAAAAAA                     0.9880  0.70      1.5
nrsf.ms                  4   7251   2417     55     37   161.95 ACCYTG--AARKG-Y                      0.9750  0.70      1.3

Options

Mandatory options

-s <seqfile>
The input file containing sequences in FASTA format (positive and negative sets combined)
-o <out>
The output prefix. This argument must be the same for worker and archive processes within a run.
-worker <i>
Sets the ID number of the worker process. Omit this argument for the archive process.

You must also specify one (and only one) of the following three options:

-su <sufile>
File containing IDs of sequences that constitute the fixed positive set (one ID per line)
-sc <scfile>
File containing binding scores (tab-delimited, one ID and binding score per line)
-ex <exfile>
File containing expression values (tab-delimited, one ID and multiple expression values per line)

Search parameters

-numcols <k>
The number of columns in the PWM motif model (default 10)
-order <o>
Order of the background markov model (default 3)
-simcut <s>
Similarity cutoff (CompareACE-like) for motifs to be considered redundant (default 0.9)
-minpass <m>
The number of iterations that must occur without improvement before a motif search is terminated (default 100)
-seed <s>
Sets the random seed for reproducibility of runs (default uses system time)

Output

MotifSpec creates a single output file "<out>.adj.ace", where out is the output prefix option (see above). This file is in a format very similar to the AlignACE output format. It first lists the IDs of input sequences, followed by a list of motfs found. For each motif, the output format is as follows:
Motif <motidx>
<site>	<seqidx>	<seqpos>	<strand>
<site>	<seqidx>	<seqpos>	<strand>
<site>	<seqidx>	<seqpos>	<strand>
.
.
.
.
<site>	<seqidx>	<seqpos>	<strand>
*** **** *
Score: <score>
Sequences above sequence threshold: <s2>
Size of search space: <s1>
Sequence cutoff: <seqcut>
Expression cutoff: <exprcut>
Score cutoff: <sccut>
Iteration found:  <workerid>.<restartnum>
Dejavu: <dj>

where the output parts are:

motidx
The index of the motif within the output file
site
Sequence of the motif hit
seqidx
The index of the sequence within the list of input sequences
seqpos
The position of the motif hit within the sequence
strand
The strand on which the motif hit occurs (1 = Watson, 0 = Crick)
*** **** *
Informative columns (* if informative, <space> if not
score
Score of the motif
s1
Size of the search space or positive set
s2
Number of sequences that contain the motif, across positive and negative sets
seqcut
Dynamically learned sequence threshold
sccut
Dynamically learned binding score threshold (only valid in score mode)
exprcut
Dybamically learned expression correlation (cluster width) threshold (only valid in expression mode)
workerid
ID of the worker process that found this motif
restartnum
Number of the random restart that found this motif
dj
Number of times a similar motif was found

Utility scripts

msproc.pl Process MotifSpec output file to get tabular list of motifs
extract_ms.pl Extract list of sequences having a motif

Referencing MotifSpec

If you use MotifSpec in your work, please reference our paper:
Karnik, R, and Beer, MA. Identification of predictive cis-regulatory elements using a discriminative objective function and dynamic search spaces. (submitted)

Support

Please contact Mike Beer at mbeer AT jhu DOT edu with any questions you may have.