Installation and Running
Introduction
File Formats
Programs in this Directory
Example Usage
Reading scrm1 Output

Installation and Running

Unzip the tarball package
```
$ tar -xzf scrm1v1.0.tar.gz
```

Change directory and make the binaries

$ cd scrm1v1.0
$ make
g++ -O3 -static -c scrm1.cpp
g++ -O3 -static -c Option.cpp
g++ -O3 -static -c fasta.cpp
g++ -O3 -static -c Tree.cpp
g++ -O3 -static -c Branch.cpp
g++ -O3 -static -c Hypothesis.cpp
g++ -O3 -static -o scrm1 scrm1.o Option.o fasta.o Tree.o Branch.o Hypothesis.o
g++ -O3 -static -c markov.cpp
g++ -O3 -static -c dna.cpp
g++ -O3 -static -o pssm2mif pssm2mif.cpp Option.o fasta.o markov.o dna.o
g++ -O3 -static -o fasta2markov fasta2markov.cpp markov.o fasta.o Option.o
$

The program binary is "scrm1". Use "scrm1 --help" for program options.

Introduction

The basic process:

Sequence Data --> Motif Finder --> Motifs
Sequence Data + Motifs --> Motif Locator --> Motif Instances
Motif Instances + Motifs --> CRM Finder (i.e. scrm1) --> CRM Output

The formats we use are:

Sequence Data: FASTA
Motifs: "PSSM" file (data matrix, see file formats), and FASTA (consensus listing)
Motif Instances: "MIF" file (see file formats)

We use MEME as a motif finder (T. L. Bailey and C. Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", ISMB 1994) and provide meme2pssm.py to convert the output of the MEME program to a "PSSM" file (see file formats).

File Formats

PSSM file (example): A Position Specific Scoring Matrix (PSSM) file. Each row contains a consensus sequence, E-value, and the matrix values in row-major order (for each site, then for each character in alphabetical order {A,C,G,T}).

CONSENSUS   E-VALUE   Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T
CONSENSUS   E-VALUE   Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T
...
AYrskGGG    0.03491   0.93 0.01 0.04 0.02 0.07 0.45 ... 0.00

MIF file (example): A Motif Instance File (MIF). Each row represents the presence of a motif in a sequence.

SEQUENCE_ID   MOTIF_ID  POSITION   STRAND   (COMMENTS...)
SEQUENCE_ID   MOTIF_ID  POSITION   STRAND   (COMMENTS...)
...
7             2         -245       1       
7             3         -199       2

POSITION is a negative number, measured in base pairs from the RIGHT side of the input sequence. STRAND is 1 for template strand, 2 for transcribed (reverse complement) strand.

HIDDEN MARKOV MODEL STATISTICS file (example): These are letter frequencies following a finite set of previous sequence letters (the length of which is the Markov order).

# comments at any time begin a line with '#'
# 
# format is LETTER-PATTERN  FREQUENCY

A    0.2
C    0.3
G    0.3
T    0.2

AA   0.04
AC   0.06
....

These files can be created with the program fasta2markov or with a program that comes with MEME, if you're using that.

Programs in this Directory

cfv: Divide data files into TRAIN and TEST files for cross-fold validation:
fet: Compute Fisher's exact test p-value.
scrm1: Main program, find a CRM (and test a testing set)
f1.py: Compute an F1 score.
fasta2markov: Calculate Markov model frequencies from sequence data
fasta2pssm.py: Create a PSSM file from consensus sequences (equal weight to all possible nucleotides)
fasta2slim: Convert FASTA to/from a one-sequence-per-line format (for sorting)
fasta-count: Count sequences in a FASTA format file
fasta-format: Reformat FASTA files to a one-line-header, one-line-sequence format (and optionally number the sequences)
fasta-limit.py: Make the sequences shorter by cropping away the LEFT and keeping the RIGHT
fasta-redup: Remove duplicate sequences
fasta-short: Remove short sequences
meme2pssm.py: Translate text MEME output into a PSSM file
pr.py: Compute precision and recall from true/false negatives/positives
pssm2fasta.py: Convert a PSSM matrix file to a FASTA consensus file (which is needed by scrm1 to list individual motifs)
pssm2mif: Find and report the locations of all motifs (from a PSSM file) that occur in sequence data (from a FASTA file)
pssm2pssm.py: Add pseudocounts to PSSM matrix entries
revcomp: Compute the reverse-complement of IUPAC DNA sequence data

Example

This section shows typical use via an example.

Files in sample-data:

regulon: list of genes in the set
markov: 5th order Markov model of the 500 bp upstream/promoter sequences in yeast.
positive.fa: Positive sequences (500 bp upstream/promoter sequences from the list)
negative.fa: Negative sequences (500 bp upstream/promoter sequences selected randomly from yeast)
motif.pssm: PSSM file of 40 motifs
motif.fa: FASTA format version of the motifs

Note: motif.pssm was created by running MEME on some of the positive instances and some "background" sequences from yeast: meme2pssm.py < MEME.output > motif.pssm

Here, we run the pssm2mif program (pssm2mif --help for help) to create a set of motif instances ("MIF" file):

[yertle $] cd sample-data/
[yertle $] 
[yertle $] cat positive.fa negative.fa | pssm2mif -m motif.pssm -s 100 -b ./markov > MIF

Read 126 sequences (stdin).
Read 40 PSSMs (motif.pssm).
Read 5-th order markov model (./markov).
Using forward background model as reverse complement.
pssm2mif:  processed 126 sequences and found 7384 motif instances.

Here is the result. The input sequence is the result of cat positive.fa negative.fa. The first sequence in positive.fa will be sequence ID 0, etc. The first motif in motif.pssm will be motif ID 0.

Now, we run the scrm1 program (scrm1 --help for help) to find and test a CRM. We'll do one fold (fold number zero) of 10 fold cross-validation:


[yertle $] 
[yertle $] scrm1 -f MIF -m motif.fa -P26 -S126 -N10 -F0

------------------------
Noto and Craven, CRM 1.0
------------------------
Sequences:  126
Positive Examples:  26
Motifs:  40 (./motif.fa)
Motif Instances:  7384 (./MIF)
Max. branches:  3
Max. motifs:  3
Max. negated motifs:  1
Beam width:  10
Required (minimum) recall:  0
Use strand constraints?  YES
Use order constraints?  YES
Use distance constraints?  YES
Chi squared value for evaluating model distintion during search:  0
Learn parameters?  YES
Tune CFV folds:  10
Chi Squared value for parameter choosing:  6.25139
N = 10, f = 0 (One fold of cross-fold validation)


Learned CRM aspect space:
 Max. binding sites = 2
 Max. motifs / binding site = 1
 Allow distance constraints = No
 Allow order constraints = No
 Allow strand constraints = No
 Max. motifs / negated region = 1

-------------
Binding site (0) on :
ATACAyACmCACmmACAyAC:  > ATACAyACmCACmmACAyAC (E-value: 3.7e-27)
-------------
Binding site (1) on :
AsAAAAAkAAA:  > AsAAAAAkAAA (E-value: 140.0)
-------------
0, 1 in either order
0, 1 any distance apart
-------------
Negated regions downstream:
AArArrrArwwrhrArrwrAAAAvdAAAA:  > AArArrrArwwrhrArrwrAAAAvdAAAA (E-value: 2.3e-51)

Trainset Results:  TP=23 FP=9 TN=81 FN=0 P=0.71875 R=1 S=0.836364

Testset Results:  TP=3 FP=2 TN=8 FN=0 P=0.6 R=1 S=0.75
3 2 8 0  <-- this is stdout, TEST set results (True positives, FP, TN, FN)

The -P 26 means there are 26 Positive sequences, and -S 126 means 126 sequences total (you can run fasta-count < positive.fa to count them).

Reading a CRM

In verbose mode, durning training, the program will print a short version of a CRM solution, along with traning set statistics:

[(3 18 23)tpl,(2 5)] tss = 211.5  neg. down = {11} (0 < 1)

Each binding site is numbered, starting with zero. This (above) means:

Binding site 0 consists of motifs 3, 18, 23
Binding site 1 consists of motifs 2, 5
Motifs for binding site 0 must appear on the template (tpl) strand.
The distance between the CRM and the [estimated] transcription start site (tss) must be less than 211.5 base pairs.
Motif 11 is negated (a repressor binding site) downstream of the CRM (a set of a single motif).
Binding site 0 must be downstream (0 < 1) of binding site 1.

After learning, the program will list the learned CRM aspects, which define the final search space, and a more descriptive version of the final CRM, including the consensus motif sequences: