scrm1
Output
$ tar -xzf scrm1v1.0.tar.gz
$ cd scrm1v1.0 $ make g++ -O3 -static -c scrm1.cpp g++ -O3 -static -c Option.cpp g++ -O3 -static -c fasta.cpp g++ -O3 -static -c Tree.cpp g++ -O3 -static -c Branch.cpp g++ -O3 -static -c Hypothesis.cpp g++ -O3 -static -o scrm1 scrm1.o Option.o fasta.o Tree.o Branch.o Hypothesis.o g++ -O3 -static -c markov.cpp g++ -O3 -static -c dna.cpp g++ -O3 -static -o pssm2mif pssm2mif.cpp Option.o fasta.o markov.o dna.o g++ -O3 -static -o fasta2markov fasta2markov.cpp markov.o fasta.o Option.o $
The basic process:
scrm1
) --> CRM Output
The formats we use are:
We use MEME
as a motif finder
(T. L. Bailey and C. Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", ISMB 1994)
and provide meme2pssm.py
to convert the output of the MEME
program
to a "PSSM" file (see file formats).
PSSM file (example): A Position Specific Scoring Matrix (PSSM) file. Each row contains a consensus sequence, E-value, and the matrix values in row-major order (for each site, then for each character in alphabetical order {A,C,G,T}).
CONSENSUS E-VALUE Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T CONSENSUS E-VALUE Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T ... AYrskGGG 0.03491 0.93 0.01 0.04 0.02 0.07 0.45 ... 0.00
MIF file (example): A Motif Instance File (MIF). Each row represents the presence of a motif in a sequence.
SEQUENCE_ID MOTIF_ID POSITION STRAND (COMMENTS...) SEQUENCE_ID MOTIF_ID POSITION STRAND (COMMENTS...) ... 7 2 -245 1 7 3 -199 2POSITION is a negative number, measured in base pairs from the RIGHT side of the input sequence. STRAND is 1 for template strand, 2 for transcribed (reverse complement) strand.
HIDDEN MARKOV MODEL STATISTICS file (example): These are letter frequencies following a finite set of previous sequence letters (the length of which is the Markov order).
# comments at any time begin a line with '#' # # format is LETTER-PATTERN FREQUENCY A 0.2 C 0.3 G 0.3 T 0.2 AA 0.04 AC 0.06 ....
These files can be created with the program
fasta2markov
or with a program that comes with MEME, if you're using that.
MEME
output into a PSSM file
scrm1
to list individual motifs)
This section shows typical use via an example.
Files in sample-data:
Note: motif.pssm was created by running
MEME
on some of the positive instances and
some "background"
sequences from yeast:
meme2pssm.py < MEME.output > motif.pssm
Here, we run the pssm2mif
program (pssm2mif --help
for help)
to create a set of
motif instances ("MIF" file):
[yertle $] cd sample-data/ [yertle $] [yertle $] cat positive.fa negative.fa | pssm2mif -m motif.pssm -s 100 -b ./markov > MIF Read 126 sequences (stdin). Read 40 PSSMs (motif.pssm). Read 5-th order markov model (./markov). Using forward background model as reverse complement. pssm2mif: processed 126 sequences and found 7384 motif instances.
Here is the result.
The input sequence is the result
of cat positive.fa negative.fa
.
The first sequence in positive.fa will be sequence ID 0, etc.
The first motif in motif.pssm will be motif ID 0.
Now, we run the scrm1
program (scrm1 --help
for help)
to find and test a CRM. We'll do one fold (fold number zero) of 10 fold cross-validation:
[yertle $] [yertle $] scrm1 -f MIF -m motif.fa -P26 -S126 -N10 -F0 ------------------------ Noto and Craven, CRM 1.0 ------------------------ Sequences: 126 Positive Examples: 26 Motifs: 40 (./motif.fa) Motif Instances: 7384 (./MIF) Max. branches: 3 Max. motifs: 3 Max. negated motifs: 1 Beam width: 10 Required (minimum) recall: 0 Use strand constraints? YES Use order constraints? YES Use distance constraints? YES Chi squared value for evaluating model distintion during search: 0 Learn parameters? YES Tune CFV folds: 10 Chi Squared value for parameter choosing: 6.25139 N = 10, f = 0 (One fold of cross-fold validation) Learned CRM aspect space: Max. binding sites = 2 Max. motifs / binding site = 1 Allow distance constraints = No Allow order constraints = No Allow strand constraints = No Max. motifs / negated region = 1 ------------- Binding site (0) on : ATACAyACmCACmmACAyAC: > ATACAyACmCACmmACAyAC (E-value: 3.7e-27) ------------- Binding site (1) on : AsAAAAAkAAA: > AsAAAAAkAAA (E-value: 140.0) ------------- 0, 1 in either order 0, 1 any distance apart ------------- Negated regions downstream: AArArrrArwwrhrArrwrAAAAvdAAAA: > AArArrrArwwrhrArrwrAAAAvdAAAA (E-value: 2.3e-51) Trainset Results: TP=23 FP=9 TN=81 FN=0 P=0.71875 R=1 S=0.836364 Testset Results: TP=3 FP=2 TN=8 FN=0 P=0.6 R=1 S=0.75 3 2 8 0 <-- this is stdout, TEST set results (True positives, FP, TN, FN)
The -P 26 means there are 26 Positive sequences, and -S 126 means 126
sequences total (you can run fasta-count < positive.fa
to count them).
In verbose mode, durning training, the program will print a short version of a CRM solution, along with traning set statistics:
[(3 18 23)tpl,(2 5)] tss = 211.5 neg. down = {11} (0 < 1)Each binding site is numbered, starting with zero. This (above) means:
After learning, the program will list the learned CRM aspects, which define the final search space, and a more descriptive version of the final CRM, including the consensus motif sequences:
------------- Binding site (0) on tpl: CCAyyCATssATCmrTCsAT: > FASTA description of motif 3 GTGTGTGyGTGTGyGTGTG: > FASTA description of motif 18 TGTGTGTGTGTGTGTrTGTk: > FASTA description of motif 23 ------------- Binding site (1): TGCTCAATGAA: > FASTA description of motif 2 TwTykbsbTykbwTymTyhy: > FASTA description of motif 5 ------------- 0 > Upstream of > 1 ------------- CRM within 211.5 bp of TSS. ------------- Negated regions downstream: AAAAAAAAAAmAAAAmwAAA: > FASTA description of motif 11