ilya shlyakhter <ilya239@gmail.com>
PseudoPipeline
1 message
Iain MacCallum <iainm@broad.mit.edu>	 Thu, Dec 20, 2007 at 4:27 PM
To: Ilya Shlyakhter <ilya@broad.mit.edu>


PseudoUnipather Pipeline

BuildColiForPseudo
BuildColiForPseudo N=15 POSTFIX=_iainm

Collects together Solexa reads from the pipeline and calculated trusted bases in those reads. Writes single fastb file and a matching vecbitvector of bases in the reads that are 'trusted'. Operates on a fixed set of lanes. The subset of those lanes to use is indicated by the parameter N, where N is the number of lanes.

PARAMETERS:
POSTFIX - added to the directory name
N - number of lanes to use

INPUT FILES: (in solexa pipeline directory, for each flow cell and lane combination)
<flowcell>.<lane>.fastb
<flowcell>.<lane>.intensities
<flowcell>.<lane>.metrics

OUTPUT FILES: (in working directory)
reads.fastb
reads.trusted

ReadsToPaths
ReadsToPaths DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 USE_QUALITY_SCORES=False GENOME_SIZE=4600K

Convert reads to paths in k-mer space.

PARAMETERS:
USE_QUALITY_SCORES=False
GENOME_SIZE - the genome size
K=20

INPUT FILES:
reads.fastb
reads.qualb

OUTPUT FILES:
reads.paths.k*
reads.paths_rc.k*
reads.pathsdb.k*
reads.paths.mult.k*
ReadsToPaths.log

MarkTrusted
MarkTrusted DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 MIN_SUPPORT=4

Identify <trusted kmers> among the kmers in the reads -- that is, the kmers we believe to be genomic.  (So, trusted kmers approximate the same quality of kmers as <strong kmers>).  For each kmer, we look at its <occurrences>, and if each base in the kmer is supported by at least MIN_SUPPORT <trusted bases> in the occurrences -- then we call the kmer trusted.

PARAMETERS:
K=20
MIN_SUPPORT=4

INPUT FILES:
reads.paths.k*
reads.paths_rc.k*
reads.pathsdb.k*
reads.trusted

OUTPUT FILES:
kmers.trusted

UnipathsFromTrustedKmers
UnipathsFromTrustedKmers DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 TRUSTED=kmers.trusted

Given a sorted vec of trusted kmer ids, compute the unipaths from the subsections of the reads that are trusted.

PARAMETERS:
K=20
TRUSTED=kmers.trusted

INPUT FILES:
reads.paths.k*
reads.paths_rc.k*
kmers.trusted

OUTPUT FILES:
reads.trusted_unipaths.k*
reads.trusted_unipathsdb.k*

CACHED FILES:
reads.trusted_paths.k*
reads.trusted_paths_rc.k*

CorrectUnipaths
CorrectUnipaths DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20

For each unipath, find the reads that are anchored to it by a kmer, lay them on the unipaths, and counted the trusted read bases lying over a given unipath base.  If the majority vote is different from the unipath base, change it.

PARAMETERS:
K=20

INPUT FILES:
reads.fastb
reads.paths.k*
reads.paths_rc.k*
reads.pathsdb.k*
reads.trusted_unipaths.k*
reads.trusted

OUTPUT FILES:
reads.trusted_unibases.corrected.k*.fastb

MakeLookupTable
MakeLookupTable SOURCE= /wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20.fastb  OUT_HEAD=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20 LOOKUP_ONLY=True

Build lookup table for corrected trusted unibases fastb.

PARAMETERS:
SOURCE=<fullpath>/reads.trusted_unibases.corrected.k20.fastb
OUT_HEAD=<fullpath>/reads.trusted_unibases.corrected.k20
LOOKUP_ONLY=True

INPUT FILES:
reads.trusted_unibases.corrected.k*.fastb

OUTPUT FILES:
reads.trusted_unibases.corrected.k*.lookup

ImperfectLookupTable
ImperfectLookupTable SEQS=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.fastb K=12 L=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20.lookup FWRC=False

Find alignments of reads to corrected trusted unibases fastb.

PARAMETERS:
SEQS=<fullpath>/reads.fastb
L=<fullpath>/reads.trusted_unibases.corrected.k20.lookup
K=12
FWRC=False

INPUT FILES:
reads.fastb
reads.trusted_unibases.corrected.k*.lookup

OUTPUT FILES:
reads.minAlignErrors.txt
reads.ilt.qltout

GcBiasCurve
GcBiasCurve DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 EXCLUDE_PAIRED=False TRUNC=27

Estimate and store the <GC bias> curve for a given K value.  This shows, for each possible GC content, the positive or negative bias for reading genome regions with that gc content: do such genome regions get relatively more reads than average, or relatively fewer reads than average?

PARAMETERS:
K=20
EXCLUDE_PAIRED=False
TRUNC=27

INPUT FILES:
reads.fastb

OUTPUT FILES:
reads.fastb.unpaired.freq.k*
reads.gc_bias.k*

UnibaseCopyNumber
UnibaseCopyNumber DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm UNIBASES=trusted_unibases.corrected UNIPATHS=trusted_unipaths ALIGNS=reads.ilt.qltout K=20 BIAS_CURVE=reads.gc_bias ERR_RATE=0.01

Determines copy number of each unibase using read alignments.

PARAMETERS:
UNIBASES=trusted_unibases.corrected
UNIPATHS=trusted_unipaths
ALIGNS=reads.ilt.qltout
K=20
BIAS_CURVE=reads.gc_bias
ERR_RATE=0.01
K=20

INPUT FILES:
reads.fastb
reads.gc_bias.k*
reads.trusted_unibases.corrected.k*.fastb
reads.trusted_unipaths.k*.fastb
reads.trusted_unipaths.k*
reads.trusted_unipathsdb.k*
reads.ilt.qltout

OUTPUT FILES:
reads.trusted_unibases.corrected.preducted_count.k*

CACHED FILES:
reads.ilt.qltout.query_is_aligned
reads.ilt.qltout.num_aligns_per_target

PseudoUnipather
PseudoUnipather DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 SUBDIR=iainm.12dec2007 PREDICTED_CN=True LINKAGE_CN=True KPATH_CONSENSUS=True MIN_KMERS_CN1=30 LINE_LINK_TIMEOUTS="{1,10}" TRUSTED_FIRST=True

INPUT FILES:
reads.fastb
reads.paths.k*
reads.paths_rc.k*
reads.pathsdb.k*
kmers.trusted
reads.trusted_unipaths.k*
reads.trusted_unipathsdb.k*
reads.trusted_unibases.corrected.preducted_count.k* (optional)
reads.trusted
<Plus alignments of paired reads to unibases - hardcoded and currently outside pipeline>

OUTPUT FILES:
reads.trusted_pseudounibases.k*.fastb
reads.trusted_pseudounibases.k*.summary
reads.trusted_pseudounibases.k*.superb
reads.trusted_pseudounibases.k*.superb_index

CACHED FILES:
reads.trusted_pseudounibases.befLinkLines