scrm1 Output
$ tar -xzf scrm1v1.0.tar.gz
$ cd scrm1v1.0 $ make g++ -O3 -static -c scrm1.cpp g++ -O3 -static -c Option.cpp g++ -O3 -static -c fasta.cpp g++ -O3 -static -c Tree.cpp g++ -O3 -static -c Branch.cpp g++ -O3 -static -c Hypothesis.cpp g++ -O3 -static -o scrm1 scrm1.o Option.o fasta.o Tree.o Branch.o Hypothesis.o g++ -O3 -static -c markov.cpp g++ -O3 -static -c dna.cpp g++ -O3 -static -o pssm2mif pssm2mif.cpp Option.o fasta.o markov.o dna.o g++ -O3 -static -o fasta2markov fasta2markov.cpp markov.o fasta.o Option.o $
The basic process:
scrm1) --> CRM Output
The formats we use are:
We use MEME as a motif finder
(T. L. Bailey and C. Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", ISMB 1994)
and provide meme2pssm.py to convert the output of the MEME program
to a "PSSM" file (see file formats).
PSSM file (example): A Position Specific Scoring Matrix (PSSM) file. Each row contains a consensus sequence, E-value, and the matrix values in row-major order (for each site, then for each character in alphabetical order {A,C,G,T}).
CONSENSUS E-VALUE Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T CONSENSUS E-VALUE Site-1-A Site-1-C Site-1-G Site-1-T Site-2-A ... Site-N-T ... AYrskGGG 0.03491 0.93 0.01 0.04 0.02 0.07 0.45 ... 0.00
MIF file (example): A Motif Instance File (MIF). Each row represents the presence of a motif in a sequence.
SEQUENCE_ID MOTIF_ID POSITION STRAND (COMMENTS...) SEQUENCE_ID MOTIF_ID POSITION STRAND (COMMENTS...) ... 7 2 -245 1 7 3 -199 2POSITION is a negative number, measured in base pairs from the RIGHT side of the input sequence. STRAND is 1 for template strand, 2 for transcribed (reverse complement) strand.
HIDDEN MARKOV MODEL STATISTICS file (example): These are letter frequencies following a finite set of previous sequence letters (the length of which is the Markov order).
# comments at any time begin a line with '#' # # format is LETTER-PATTERN FREQUENCY A 0.2 C 0.3 G 0.3 T 0.2 AA 0.04 AC 0.06 ....
These files can be created with the program
fasta2markov
or with a program that comes with MEME, if you're using that.
MEME output into a PSSM file
scrm1 to list individual motifs)
This section shows typical use via an example.
Files in sample-data:
Note: motif.pssm was created by running
MEME on some of the positive instances and
some "background"
sequences from yeast:
meme2pssm.py < MEME.output > motif.pssm
Here, we run the pssm2mif program (pssm2mif --help for help)
to create a set of
motif instances ("MIF" file):
[yertle $] cd sample-data/ [yertle $] [yertle $] cat positive.fa negative.fa | pssm2mif -m motif.pssm -s 100 -b ./markov > MIF Read 126 sequences (stdin). Read 40 PSSMs (motif.pssm). Read 5-th order markov model (./markov). Using forward background model as reverse complement. pssm2mif: processed 126 sequences and found 7384 motif instances.
Here is the result.
The input sequence is the result
of cat positive.fa negative.fa.
The first sequence in positive.fa will be sequence ID 0, etc.
The first motif in motif.pssm will be motif ID 0.
Now, we run the scrm1 program (scrm1 --help for help)
to find and test a CRM. We'll do one fold (fold number zero) of 10 fold cross-validation:
[yertle $] [yertle $] scrm1 -f MIF -m motif.fa -P26 -S126 -N10 -F0 ------------------------ Noto and Craven, CRM 1.0 ------------------------ Sequences: 126 Positive Examples: 26 Motifs: 40 (./motif.fa) Motif Instances: 7384 (./MIF) Max. branches: 3 Max. motifs: 3 Max. negated motifs: 1 Beam width: 10 Required (minimum) recall: 0 Use strand constraints? YES Use order constraints? YES Use distance constraints? YES Chi squared value for evaluating model distintion during search: 0 Learn parameters? YES Tune CFV folds: 10 Chi Squared value for parameter choosing: 6.25139 N = 10, f = 0 (One fold of cross-fold validation) Learned CRM aspect space: Max. binding sites = 2 Max. motifs / binding site = 1 Allow distance constraints = No Allow order constraints = No Allow strand constraints = No Max. motifs / negated region = 1 ------------- Binding site (0) on : ATACAyACmCACmmACAyAC: > ATACAyACmCACmmACAyAC (E-value: 3.7e-27) ------------- Binding site (1) on : AsAAAAAkAAA: > AsAAAAAkAAA (E-value: 140.0) ------------- 0, 1 in either order 0, 1 any distance apart ------------- Negated regions downstream: AArArrrArwwrhrArrwrAAAAvdAAAA: > AArArrrArwwrhrArrwrAAAAvdAAAA (E-value: 2.3e-51) Trainset Results: TP=23 FP=9 TN=81 FN=0 P=0.71875 R=1 S=0.836364 Testset Results: TP=3 FP=2 TN=8 FN=0 P=0.6 R=1 S=0.75 3 2 8 0 <-- this is stdout, TEST set results (True positives, FP, TN, FN)
The -P 26 means there are 26 Positive sequences, and -S 126 means 126
sequences total (you can run fasta-count < positive.fa to count them).
In verbose mode, durning training, the program will print a short version of a CRM solution, along with traning set statistics:
[(3 18 23)tpl,(2 5)] tss = 211.5 neg. down = {11} (0 < 1)
Each binding site is numbered, starting with zero.
This (above) means:
After learning, the program will list the learned CRM aspects, which define the final search space, and a more descriptive version of the final CRM, including the consensus motif sequences:
------------- Binding site (0) on tpl: CCAyyCATssATCmrTCsAT: > FASTA description of motif 3 GTGTGTGyGTGTGyGTGTG: > FASTA description of motif 18 TGTGTGTGTGTGTGTrTGTk: > FASTA description of motif 23 ------------- Binding site (1): TGCTCAATGAA: > FASTA description of motif 2 TwTykbsbTykbwTymTyhy: > FASTA description of motif 5 ------------- 0 > Upstream of > 1 ------------- CRM within 211.5 bp of TSS. ------------- Negated regions downstream: AAAAAAAAAAmAAAAmwAAA: > FASTA description of motif 11