Contents

  1. Installation and Running
  2. Introduction
  3. File Formats
  4. Programs in this Directory
  5. Example Usage
  6. Reading scrm1 Output


Installation and Running

  1. Unzip the tarball package
    $ tar -xzf scrm1v1.0.tar.gz
    
  2. Change directory and make the binaries
    $ cd scrm1v1.0
    $ make
    g++ -O3 -static -c scrm1.cpp
    g++ -O3 -static -c Option.cpp
    g++ -O3 -static -c fasta.cpp
    g++ -O3 -static -c Tree.cpp
    g++ -O3 -static -c Branch.cpp
    g++ -O3 -static -c Hypothesis.cpp
    g++ -O3 -static -o scrm1 scrm1.o Option.o fasta.o Tree.o Branch.o Hypothesis.o
    g++ -O3 -static -c markov.cpp
    g++ -O3 -static -c dna.cpp
    g++ -O3 -static -o pssm2mif pssm2mif.cpp Option.o fasta.o markov.o dna.o
    g++ -O3 -static -o fasta2markov fasta2markov.cpp markov.o fasta.o Option.o
    $
    
  3. The program binary is "scrm1". Use "scrm1 --help" for program options.


Introduction

The basic process:

  1. Sequence Data --> Motif Finder --> Motifs
  2. Sequence Data + Motifs --> Motif Locator --> Motif Instances
  3. Motif Instances + Motifs --> CRM Finder (i.e. scrm1) --> CRM Output

The formats we use are:

We use MEME as a motif finder (T. L. Bailey and C. Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", ISMB 1994) and provide meme2pssm.py to convert the output of the MEME program to a "PSSM" file (see file formats).

File Formats

PSSM file (example): A Position Specific Scoring Matrix (PSSM) file. Each row contains a consensus sequence, E-value, and the matrix values in row-major order (for each site, then for each character in alphabetical order {A,C,G,T}).

CONSENSUS   E-VALUE   Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T
CONSENSUS   E-VALUE   Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T
...
AYrskGGG    0.03491   0.93 0.01 0.04 0.02 0.07 0.45 ... 0.00

MIF file (example): A Motif Instance File (MIF). Each row represents the presence of a motif in a sequence.

SEQUENCE_ID   MOTIF_ID  POSITION   STRAND   (COMMENTS...)
SEQUENCE_ID   MOTIF_ID  POSITION   STRAND   (COMMENTS...)
...
7             2         -245       1       
7             3         -199       2       
POSITION is a negative number, measured in base pairs from the RIGHT side of the input sequence. STRAND is 1 for template strand, 2 for transcribed (reverse complement) strand.

HIDDEN MARKOV MODEL STATISTICS file (example): These are letter frequencies following a finite set of previous sequence letters (the length of which is the Markov order).

# comments at any time begin a line with '#'
# 
# format is LETTER-PATTERN  FREQUENCY

A    0.2
C    0.3
G    0.3
T    0.2

AA   0.04
AC   0.06
....

These files can be created with the program fasta2markov or with a program that comes with MEME, if you're using that.

Programs in this Directory



Example

This section shows typical use via an example.

Files in sample-data:

Note: motif.pssm was created by running MEME on some of the positive instances and some "background" sequences from yeast: meme2pssm.py < MEME.output > motif.pssm

Here, we run the pssm2mif program (pssm2mif --help for help) to create a set of motif instances ("MIF" file):

[yertle $] cd sample-data/
[yertle $] 
[yertle $] cat positive.fa negative.fa | pssm2mif -m motif.pssm -s 100 -b ./markov > MIF

Read 126 sequences (stdin).
Read 40 PSSMs (motif.pssm).
Read 5-th order markov model (./markov).
Using forward background model as reverse complement.
pssm2mif:  processed 126 sequences and found 7384 motif instances.

Here is the result. The input sequence is the result of cat positive.fa negative.fa. The first sequence in positive.fa will be sequence ID 0, etc. The first motif in motif.pssm will be motif ID 0.

Now, we run the scrm1 program (scrm1 --help for help) to find and test a CRM. We'll do one fold (fold number zero) of 10 fold cross-validation:


[yertle $] 
[yertle $] scrm1 -f MIF -m motif.fa -P26 -S126 -N10 -F0

------------------------
Noto and Craven, CRM 1.0
------------------------
Sequences:  126
Positive Examples:  26
Motifs:  40 (./motif.fa)
Motif Instances:  7384 (./MIF)
Max. branches:  3
Max. motifs:  3
Max. negated motifs:  1
Beam width:  10
Required (minimum) recall:  0
Use strand constraints?  YES
Use order constraints?  YES
Use distance constraints?  YES
Chi squared value for evaluating model distintion during search:  0
Learn parameters?  YES
Tune CFV folds:  10
Chi Squared value for parameter choosing:  6.25139
N = 10, f = 0 (One fold of cross-fold validation)


Learned CRM aspect space:
 Max. binding sites = 2
 Max. motifs / binding site = 1
 Allow distance constraints = No
 Allow order constraints = No
 Allow strand constraints = No
 Max. motifs / negated region = 1

-------------
Binding site (0) on :
ATACAyACmCACmmACAyAC:  > ATACAyACmCACmmACAyAC (E-value: 3.7e-27)
-------------
Binding site (1) on :
AsAAAAAkAAA:  > AsAAAAAkAAA (E-value: 140.0)
-------------
0, 1 in either order
0, 1 any distance apart
-------------
Negated regions downstream:
AArArrrArwwrhrArrwrAAAAvdAAAA:  > AArArrrArwwrhrArrwrAAAAvdAAAA (E-value: 2.3e-51)

Trainset Results:  TP=23 FP=9 TN=81 FN=0 P=0.71875 R=1 S=0.836364

Testset Results:  TP=3 FP=2 TN=8 FN=0 P=0.6 R=1 S=0.75
3 2 8 0  <-- this is stdout, TEST set results (True positives, FP, TN, FN)


The -P 26 means there are 26 Positive sequences, and -S 126 means 126 sequences total (you can run fasta-count < positive.fa to count them).

Reading a CRM

In verbose mode, durning training, the program will print a short version of a CRM solution, along with traning set statistics:

[(3 18 23)tpl,(2 5)] tss = 211.5  neg. down = {11} (0 < 1) 
Each binding site is numbered, starting with zero. This (above) means:

After learning, the program will list the learned CRM aspects, which define the final search space, and a more descriptive version of the final CRM, including the consensus motif sequences:

-------------
Binding site (0) on tpl:
CCAyyCATssATCmrTCsAT:  > FASTA description of motif 3
GTGTGTGyGTGTGyGTGTG:  > FASTA description of motif 18
TGTGTGTGTGTGTGTrTGTk:  > FASTA description of motif 23
-------------
Binding site (1):
TGCTCAATGAA:  > FASTA description of motif 2
TwTykbsbTykbwTymTyhy:  > FASTA description of motif 5 
-------------
0 > Upstream of > 1
-------------
CRM within 211.5 bp of TSS.
-------------
Negated regions downstream:
AAAAAAAAAAmAAAAmwAAA:  > FASTA description of motif 11