Contents
Installing the Software
  1. Unzip the tar.gz package:
    $ tar -xzf scrm2.v2.2.tar.gz
    
  2. make the executable in the program directory by changing directory and typing "make"
    $ cd scrm2.v2.2
    $ make
    g++ -Wall -O3 -static -c scrm2.cpp
    g++ -Wall -O3 -static -c Option.cpp
    g++ -Wall -O3 -static -c grid.cpp
    g++ -Wall -O3 -static -c genomic.cpp
    g++ -Wall -O3 -static -c fasta.cpp
    g++ -Wall -O3 -static -c Timer.cpp
    g++ -Wall -O3 -static -c CRM.cpp
    g++ -Wall -O3 -static -c Distance.cpp
    g++ -Wall -O3 -static -c IOExample.cpp
    g++ -Wall -O3 -static -c metric.cpp
    g++ -Wall -O3 -static -c io.cpp
    g++ -Wall -O3 -static -c custom.cpp
    g++ -Wall -O3 -static -o scrm2 scrm2.o Option.o genomic.o grid.o fasta.o Timer.o CRM.o  Distance.o IOExample.o metric.o io.o custom.o
    g++ -Wall -O3 -static -o fasta2markov fasta2markov.cpp fasta.o Option.o Timer.o genomic.o grid.o
    g++ -Wall -O3 -static -o pwm2fasta pwm2fasta.cpp  Option.o
    g++ -Wall -O3 -static -o revcomp revcomp.cpp Option.o genomic.o fasta.o
    $
    
  3. The main binary is called "scrm2."
Running the Software

Run the binary, "scrm2" created by the installation process above.

Program Input

Basic usage is: scrm2 [options] postive-FASTA-file negative-FASTA-file.

The input to the program is a set of two files containing sequence data: A set of positive example sequences and a set of negative example sequences. These are in FASTA format:

The unique header is important because it is used in other, optional input files (sequence weights and prior binding site distributions).
Program Options

Basic usage is: scrm2 [options] postive-FASTA-file negative-FASTA-file.

Options are associated with a keyword and often a single letter. Keywords are preceeded with two dashes, single letters with one. For instance, the PRNG seed is set with --seed 12345 or -S 12345 (one could also use --seed=12345, -S=12345, or -S12345).

For help, run scrm2 --help. This will list all the options, their keywords, default values, and give a brief description.

There are many options. What follows is a list of the ones you'll want to consider.

Option Keyword(s) Meaning Comments
-C, --C Max. binding sites Usually set to --C=3. That means the final solution will have one, two, or three binding sites, but the program will never consider four.
-D, --D Max. motifs per binding site Each binding site may have multiple motifs. If any are present, the binding site occurs in the sequence. This can represent variability in a binding site motif, but each motif is learned separately. This is optional, and sometimes people prefer to set --D=1 and have each binding site represented by a single motif.
-G, --G Max. negated binding sites Set --G=0 if you don't want to consider negated motifs.
-Z, --timeout Max. number of considered models The algorithm will stop after considering a total of this many model structures. This includes learning single-motif models, possibly based on candidate motifs.
-m, --mm Variable-order Markov chain background distribution file The algorithm will work better with a good model of the background distribution (promoter DNA). It uses a Markov-chain model for this (same as MEME). One can be created from FASTA sequence data using the fasta2markov program.
-b, --bg Background distribution file Arbitrary background distribution per sequence. FASTA-like format: Following i numbers are the LOG-SCALE likelihood of the subsequence up to, but not including position i (so there are L+1 numbers for a sequence of length L). Example:
>seq1
-0.202 -0.599 ...
>seq2 
-0.334 -1.914 ...

The following are options you may or may not want to consider.

Option Keyword(s) Meaning Comments
-N, --folds cross-fold validation folds To run 10-fold cross-validation, set this to 10, and set the -F, --fold option appropriately.
-F, --fold Which fold?

For N-fold cross-validation, run N times, setting -F=0, -F=1, ..., -F=N-1.

This is (stratefied) N-fold cross-validation. The algorithm will set aside every Nth positive and negative example sequence for testing. Specifically, if all P positive examples are numbered 0 through P-1, and all G negative examples are numbered P through P+G-1, then an example numbered i is held aside for testing if i modulo N = F.

-w, --minw Minimum motif width The algorithm will try various CRM model structures by adding new motifs to promising learned models. When it adds a new motif, it is randomly initialized and given a random width at least this many base pairs.
-x, --maxw Maximum motif width Self-explanatory, see -w, --minw above.
-W, --weights Sequence weights input file If you are more confident about some sequences than others, you can assign a weight (between zero and one) to each sequence. This is interpreted as the probability that it contains the CRM. The input file looks like FASTA, but just contains each sequence's weight. For instance:

Input FASTA file:

		> Sequence0
		ACTACTACAATCGACTAGCATCGATCACTAACAGCATCA...
		> Sequence1
		ACTGACTATTTACGACGATCAGCATAGCATACGACAACA...
		...
		

Weights file:

		> Sequence0
		1.0
		> Sequence1
		0.80
		...
		
-V, --prior Binding site location prior input file If you have an idea of where on each sequence binding is more likely to occur (say, via comparative genomics), you can indicate this with a number for each sequence, for each location (individual base), indicating the relative probability of the binding site occuring there. For instance:

Input FASTA file:

		> Sequence0
		ACTACTACAATCGACTAGCATCGATCACTAACAGCATCA...
		> Sequence1
		ACTGACTATTTACGACGATCAGCATAGCATACGACAACA...
		...
		

Prior file:

		> Sequence0
		0.0023 0.0123 0.0199 ...
		> Sequence1
		0.0001 0.0022 0.0037 ...
		...
		
There are as many numbers for each sequence as base pairs for that sequence.
-r, --ratio1 Search ratio (|m|=1) To speed up convergence, the algorithm may consider only the most likely positions for each binding site. For instance, given a motif, it may be 99% probable that if the motif occurs, it occurs in one of a few positions. Setting --ratio1=0.99 will consider only these positions and greatly reduce the necessary amount of computation. This option refers to CRM models of size 1 (One binding site). (Note: This affects convergence time, and you can also set the number of EM iterations run during learning.)
-R, --ratio2 Search ratio (|m|>1) Same as --ratio1, but for CRM models with two or more binding sites.
-B, --binw Distance histogram bin width The distance distribution is a (smoothed) histogram over possible distances. It takes time to smooth, so for large sequences (kilobases) it is recommended that you "bin" together groups of this many consecutive base pairs.
-T, --tunerat Tune set size To avoid overfitting (or maybe you have too much data), you can hold aside this ratio of the training set for evaluation only.
--save Save learned CRM to file The file format for this is not meant to be human readable, although it contains comments. It is meant to be used in subsequent calls to the program with the --load option Note: the CRM file format for the --load and --save options may change in future versions.

See also: the --load and --base options.

--load Use a previously-learned CRM model Instead of learning a model, read one from file and evaluate it Note: train/tune/test set divisions remain the same
--base Supply a CRM submodel if the --base option is supplied with a filename, the program will read the submodel and all explored structures will contain this submodel (i.e. all the binding sites of the submodel).

See also: The --badjust option.

-j, --badjust Learn CRM submodel parameters If the program is given a CRM submodel to always use (via the --base option), the parameters of this submodel can be adjusted by the learning algorithm by setting this boolean.

For instance, with [-j|--badjust] set, the program will consider this base submodel structure with the parameters as given, add a new binding site with randomized parameters, then learn all model parameters.

Without this option set (default), the program will learn the parameters of any additional binding site(s), but not of the base submodel.

See also: The --base option.

Program Output

The program writes both to standard error and standard output.

Standard Error

Written to standard error:

The following is example summary data written to standard error (--verbose=1):

[llama $] scrm2 positive.fa negative.fa -b fly.markov -N10 -F0 -T0.25 -I1 -Z4 -C2 -D2 -G1

Structured CRM 2.1
Inupt file(s):  "positive.fa" "negative.fa"
Parameter settings:  X="" b="fly.markov" c="" W="" V="" load="" save="" fileprec=10 
                     f="" C=2 D=2 G=1 d=unset o=unset s=unset m=unset w=20 x=20 e=10
                     E=10 I=1 r=0.99 R=0.1 u=0.001 k=0.1 B=1 Z=4 N=10 F=0 T=0.25 S=-1 v=1

Seeding PRNG with 1147986514
Read order 5 Markov file "fly.markov". (0.023 seconds)
TRAINSET:  135 sequences (67+ 68- )
TUNESET:   45 sequences (23+ 22- )
TESTSET:   20 sequences (10+ 10- )
1       (TmhybCGGATATCCGbdnnh)  F1 = 0.914398 > 0 (9.59 seconds)
2       (AwGCGCGGATATCCGndnnh | nnnCGGATATCCGndnnhnw)   F1 = 0.966999 > 0.914398 (9.59 seconds)
3       (kymrATGGATATCCGndnnh) & (nnnnACTTnmGGGACTThnn) F1 = 0.9999 > 0.966999 (42.9 seconds)
4       (nnnnnCGGATATCCGndnnh) & (~wkrATsyGvTACTAGGvvGw)        F1 = 0.956673 (50.3 seconds)
...

Each learned model is summarized (if --verbose=1) as:

  1. The number of model learned
  2. A model summary using "&" for 'and', "|" for 'or', and "~" for 'not'.
  3. The evaluation score (here, it's F1)
  4. "> [old-score]" if the new score is higher
  5. The user time taken to train.
NOTE: More information is given if --verbose≥2, not described here. That setting is mostly for debugging.

The following is an example CRM model written to standard error:

Binding Site #1 (template 0.000998004)
  Motif #1 (motif preference 1) Consensus GGkAbAGrCwswCGCknwyy  MLE GGGATAGACTCTCGCGTATT
     |    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20
   A |  0.27 0.00 0.00 0.72 0.00 0.72 0.00 0.54 0.18 0.45 0.00 0.36 0.18 0.00 0.00 0.00 0.18 0.45 0.00 0.09
   C |  0.09 0.00 0.00 0.09 0.36 0.00 0.27 0.00 0.72 0.09 0.45 0.09 0.72 0.00 0.90 0.09 0.18 0.18 0.36 0.45
   G |  0.63 0.99 0.54 0.09 0.27 0.18 0.72 0.45 0.00 0.00 0.36 0.00 0.09 0.99 0.09 0.54 0.27 0.09 0.09 0.00
   T |  0.00 0.00 0.45 0.09 0.36 0.09 0.00 0.00 0.09 0.45 0.18 0.54 0.00 0.00 0.00 0.36 0.36 0.27 0.54 0.45
Binding Site #2 (template 0.00751102)
  Motif #1 (motif preference 0.5) Consensus ryTCmbCCdCTTCkwGGAdr  MLE GTTCAGCCGCTTCGTGGATG
     |    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20
   A |  0.36 0.09 0.00 0.09 0.54 0.00 0.00 0.00 0.36 0.00 0.00 0.00 0.00 0.00 0.36 0.00 0.27 0.54 0.27 0.36
   C |  0.00 0.36 0.27 0.63 0.27 0.36 0.90 0.90 0.00 0.99 0.18 0.00 0.72 0.18 0.00 0.00 0.00 0.18 0.00 0.00
   G |  0.63 0.00 0.09 0.27 0.00 0.36 0.09 0.00 0.36 0.00 0.00 0.00 0.09 0.54 0.18 0.81 0.63 0.18 0.27 0.45
   T |  0.00 0.54 0.63 0.00 0.18 0.27 0.00 0.09 0.27 0.00 0.81 0.99 0.18 0.27 0.45 0.18 0.09 0.09 0.45 0.18
  Motif #2 (motif preference 0.5) Consensus TykTTGmTCArsmGCkwCrT  MLE TCGTTGATCAGCAGCGACGT
     |    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20
   A |  0.18 0.18 0.09 0.18 0.27 0.00 0.45 0.00 0.27 0.72 0.27 0.09 0.45 0.00 0.00 0.00 0.54 0.18 0.36 0.18
   C |  0.18 0.54 0.18 0.09 0.00 0.00 0.45 0.18 0.63 0.00 0.00 0.54 0.36 0.09 0.63 0.18 0.00 0.72 0.18 0.00
   G |  0.09 0.00 0.45 0.00 0.00 0.99 0.00 0.18 0.09 0.00 0.54 0.36 0.18 0.90 0.27 0.45 0.00 0.00 0.45 0.00
   T |  0.54 0.27 0.27 0.72 0.72 0.00 0.09 0.63 0.00 0.27 0.18 0.00 0.00 0.00 0.09 0.36 0.45 0.09 0.00 0.81
Negated Binding Site #1 (template 0.929835)
  Motif #1 (motif preference 1) Consensus TTnnGGGACTThnnnnnnnn  MLE TTACGGGACTTAGAACAACA
     |    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20
   A |  0.00 0.00 0.30 0.26 0.02 0.00 0.02 0.98 0.00 0.01 0.00 0.35 0.28 0.30 0.29 0.23 0.33 0.30 0.19 0.31
   C |  0.03 0.00 0.19 0.33 0.01 0.00 0.03 0.00 0.96 0.01 0.00 0.23 0.17 0.22 0.24 0.31 0.20 0.23 0.32 0.17
   G |  0.00 0.00 0.25 0.19 0.95 0.98 0.93 0.01 0.00 0.00 0.00 0.17 0.31 0.19 0.20 0.21 0.20 0.16 0.16 0.21
   T |  0.95 0.98 0.24 0.21 0.01 0.00 0.01 0.00 0.03 0.97 0.99 0.23 0.21 0.26 0.24 0.23 0.24 0.29 0.31 0.28

Binding Site #1 downstream of #2:  0.710477
Binding Site #1 downstream of #3:  0.500481
Binding Site #2 downstream of #3:  0.500681
Distance #1 <-> END as Distance[0..501) peak ~= 237, mid ~= 230
Distance #1 <-> #2 as Distance[0..501) peak ~= 136, mid ~= 173
Distance #1 <-> #3 as Distance[0..501) peak ~= 128, mid ~= 151
Distance #2 <-> END as Distance[0..501) peak ~= 394, mid ~= 308
Distance #2 <-> #3 as Distance[0..501) peak ~= 0, mid ~= 91
Distance #3 <-> END as Distance[0..501) peak ~= 0, mid ~= 100
This has the same structure as shown in Figure 1.

Notes:

Standard Output

Standard output gets evaluation data, one record per sequence, one line per record with four fields (separated by white space):

  1. Sequence ID
  2. The set that the sequence is in ("train", "tune", or "test")
  3. The sequence label. This is it's weight, or "0" for negative, "1" for positive)
  4. The predicted label. The confidence that this is a positive example. This is the calculated probability that the sequence contains the learned CRM model (not negated sites), when the alternative is that it was generated entirely by the background model (or the CRM with its negated sites).
  5. The most likely binding site locations. There is one column for each binding site, and the entry looks like X@Y, meaning binding site #X starts at position Y (starting at 1 from the left end of the sequence, even though the internal representation considers the sequence to start at index zero). It is very important to note that, since this algorithm is probabilistic, every motif might occur in every location, with various degrees of liklihood. These locations happen to be the single most likely combination, and say nothing of other (possibly almost equally likely) possible locations. The probability that the sequence contains the CRM is based on all possible locations, and not just the most likely.

Example output:

acCer1_sgdGene_YBR181C	train	1	0.987643	2@172	1@201		
acCer1_sgdGene_YBR188C	train	1	0.98993 	2@295	1@365		
	...
acCer1_sgdGene_YPR103W	train	1	0.814915	2@250	1@315		
acCer1_sgdGene_YJR072C	train	0	2.76453e-05	2@3 	1@76		
acCer1_sgdGene_YGL213C	train	0	0.000704786	2@65	1@298		
acCer1_sgdGene_YOR190W	train	0	0.0746866	2@12	1@93		
	...
acCer1_sgdGene_YGR057C	train	0	0.00324076	2@133	1@201		
acCer1_sgdGene_YBR085W	tune	1	0.85638 	2@201	1@426		
acCer1_sgdGene_YBR191W	tune	1	0.998627	2@141	1@193		
	...
acCer1_sgdGene_YPR132W	tune	1	0.999275	2@82	1@115		
acCer1_sgdGene_YGR250C	tune	0	0.00647939	2@187	1@411		
acCer1_sgdGene_YCR107W	tune	0	0.00274947	2@27	1@272		
	...
acCer1_sgdGene_YMR016C	tune	0	0.000158975	2@204	1@383		
acCer1_sgdGene_YBL087C	test	1	0.977379	2@148	1@427		
acCer1_sgdGene_YDR471W	test	1	0.923808	2@143	1@341		
acCer1_sgdGene_YFL034W	test	1	0.991551	2@191	1@304		
acCer1_sgdGene_YGL030W	test	1	0.997261	2@168	1@320		
acCer1_sgdGene_YHR141C	test	1	0.98651 	2@133	1@474		
acCer1_sgdGene_YJL190C	test	1	0.927052	2@225	1@254		
acCer1_sgdGene_YLR048W	test	1	0.995488	2@70	1@144		
acCer1_sgdGene_YLR448W	test	1	0.99195 	2@61	1@246		
acCer1_sgdGene_YNL302C	test	1	0.999094	2@108	1@177		
acCer1_sgdGene_YOR095C	test	1	0.000222165	2@106	1@209		
acCer1_sgdGene_YPR102C	test	1	0.953526	2@21	1@49		
acCer1_sgdGene_YGL066W	test	0	0.00928651	2@139	1@182		
acCer1_sgdGene_YGR287C	test	0	4.81675e-05	2@9 	1@32		
acCer1_sgdGene_YDR071C	test	0	0.647564 	2@113	1@423		
acCer1_sgdGene_YBR141C	test	0	0.00608682	2@270	1@372		
acCer1_sgdGene_YER069W	test	0	0.049907	2@283	1@487		
acCer1_sgdGene_YMR144W	test	0	0.699687	2@206	1@322		
acCer1_sgdGene_YOR162C	test	0	0.40865 	2@21	1@47		
acCer1_sgdGene_YJL140W	test	0	0.00721103	2@12	1@38		
acCer1_sgdGene_YIL115C	test	0	0.0228935	2@275	1@365		
acCer1_sgdGene_YIL067C	test	0	0.014081	2@100	1@220		
Model Structure
Fig. 1: Our CRM model. a: Model structure. b: Pairwise parameters.

An example CRM model structure is shown in Figure 1. On the left (a), we see an example model structure: A conjuction of binding sites, each a disjunction of motifs. Binding sites also have a strand preference (a single number representing the probability that binding occurs on the template strand). Binding sites may be negated.

In this particular case, there are three binding sites. The first has a single motif ("A"), the second has two motifs (either "B1" or "B2" will satisfy this binding site), and the third has a single motif ("C"), but is negated. This means that a positive seqeunce must contain the first two binding sites, but not the third. (A negative sequence would contain all three, or fail to contain one of the first two.)

On the right in Figure 1 (b), we see pairwise preferences: For each pair of binding sites,

  1. An order preference (a single number representing the probability that one binding site occurs upstream of another.)
  2. A probability distribution over the distance between two adjacent binding sites. Also, the distance between a binding site and the end of a sequence (e.g., the estimated transcription start site).
Algorithm

Basic algorithm:

until
stopping criteria: Consider a model structure Learn parameters for said structure Evaluate model
return
best model

And slightly more detailed algorithm:

Q ← { };  
(Empty queue)
for
each candidate motif, c: Create a single-motif model m with c; Learn parameters of m; QQm
for
init iterations:
(Some user-defined number)
Create a random single-motif model, m; Learn parameters of m; QQm
until
stopping criteria:
(e.g. A user-defined maximum number of models considered)
m ← best model from Q; m'm ∪ new random single-motif binding site; Learn parameters of m'; QQm' m'm ∪ with new random motif adding to last binding site; Learn parameters of m'; QQm' m'm ∪ new random single-motif negated binding site; Learn parameters of m'; QQm'
return
best model considered.

Notes:

Customizing the Learning Task
There are a few ways in which you can change details of the algorithm by writing or editing source code.
Customizing the Scoring Metric

The file metric.cpp has a function, double metric(const vector<pair<probability, probability> >&) which is given a set of (class, prediction) pairs, and returns a "score" (typically between 0 and 1, higher is better). "class" refers to the true class of the example (1.0 is positive, 0.0 is negative, but these can be "soft"--between 0 and 1). "prediction" refers to the prediction made by the learning algorithm (which will be "soft"--the probability of an example being generated by a positive path through the corresponding HMM). (Note that probability is a type, defined in probabilty.h.)

You may create any metric you like, or call one of the existing ones, such as metric_cmargin (classification margin), metric_Fx (F1), or metric_acc (plain old accuracy).

Note that you should also update the function (also in metric.cpp), std::string metric(), which returns the "name" of the metric (which will be printed to standard error during the learning process).

Altering or Constraining the model(s)

The file, custom.cpp is designed to make it easy to customize the learning task by explicity constraining (or otherwise altering) the structure and/or parameters of CRM models, or by constraining other probabilities, like the probability distribution over possible orders-of-binding-sites.

These are the "custom" functions:

  1. void constrain_preconditions(CRM &crm, vector<Example*> &training_data):

    This is called before the first iteration of EM-based parameter learning algorithm (see our paper, Noto & Craven, ECCB 2006 and Krogh et al., ICPR 1994). You can put any code here to constrain or otherwise alter a CRM model.

  2. void constrain_per_iteration(CRM &crm, vector<Example*> &training_data):

    This is called after each E-M iteration.

  3. void constrain_finalize(CRM &crm, vector<Example*> &training_data):

    This is called after all E-M iterations.

  4. void constrain_order(list<pair<vector<uint>, probability> > &order, const CRM&):

    This is called whenever the CRM learner needs to know the probability of a given ordering among binding sites. It's based on normalizing the product of all pairwise orderings, but you can change the answer before it is given to the learning algorithm.

    Typical use would be to force certain binding sites to be adjacent by zeroing-out inconsistent orderings, then renormalizing.

The three versions of constrain_..., might all do the same thing, of course. Examples of use might be: