ilya shlyakhter PseudoPipeline 1 message Iain MacCallum Thu, Dec 20, 2007 at 4:27 PM To: Ilya Shlyakhter PseudoUnipather Pipeline BuildColiForPseudo BuildColiForPseudo N=15 POSTFIX=_iainm Collects together Solexa reads from the pipeline and calculated trusted bases in those reads. Writes single fastb file and a matching vecbitvector of bases in the reads that are 'trusted'. Operates on a fixed set of lanes. The subset of those lanes to use is indicated by the parameter N, where N is the number of lanes. PARAMETERS: POSTFIX - added to the directory name N - number of lanes to use INPUT FILES: (in solexa pipeline directory, for each flow cell and lane combination) ..fastb ..intensities ..metrics OUTPUT FILES: (in working directory) reads.fastb reads.trusted ReadsToPaths ReadsToPaths DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 USE_QUALITY_SCORES=False GENOME_SIZE=4600K Convert reads to paths in k-mer space. PARAMETERS: USE_QUALITY_SCORES=False GENOME_SIZE - the genome size K=20 INPUT FILES: reads.fastb reads.qualb OUTPUT FILES: reads.paths.k* reads.paths_rc.k* reads.pathsdb.k* reads.paths.mult.k* ReadsToPaths.log MarkTrusted MarkTrusted DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 MIN_SUPPORT=4 Identify among the kmers in the reads -- that is, the kmers we believe to be genomic. (So, trusted kmers approximate the same quality of kmers as ). For each kmer, we look at its , and if each base in the kmer is supported by at least MIN_SUPPORT in the occurrences -- then we call the kmer trusted. PARAMETERS: K=20 MIN_SUPPORT=4 INPUT FILES: reads.paths.k* reads.paths_rc.k* reads.pathsdb.k* reads.trusted OUTPUT FILES: kmers.trusted UnipathsFromTrustedKmers UnipathsFromTrustedKmers DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 TRUSTED=kmers.trusted Given a sorted vec of trusted kmer ids, compute the unipaths from the subsections of the reads that are trusted. PARAMETERS: K=20 TRUSTED=kmers.trusted INPUT FILES: reads.paths.k* reads.paths_rc.k* kmers.trusted OUTPUT FILES: reads.trusted_unipaths.k* reads.trusted_unipathsdb.k* CACHED FILES: reads.trusted_paths.k* reads.trusted_paths_rc.k* CorrectUnipaths CorrectUnipaths DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 For each unipath, find the reads that are anchored to it by a kmer, lay them on the unipaths, and counted the trusted read bases lying over a given unipath base. If the majority vote is different from the unipath base, change it. PARAMETERS: K=20 INPUT FILES: reads.fastb reads.paths.k* reads.paths_rc.k* reads.pathsdb.k* reads.trusted_unipaths.k* reads.trusted OUTPUT FILES: reads.trusted_unibases.corrected.k*.fastb MakeLookupTable MakeLookupTable SOURCE= /wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20.fastb OUT_HEAD=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20 LOOKUP_ONLY=True Build lookup table for corrected trusted unibases fastb. PARAMETERS: SOURCE=/reads.trusted_unibases.corrected.k20.fastb OUT_HEAD=/reads.trusted_unibases.corrected.k20 LOOKUP_ONLY=True INPUT FILES: reads.trusted_unibases.corrected.k*.fastb OUTPUT FILES: reads.trusted_unibases.corrected.k*.lookup ImperfectLookupTable ImperfectLookupTable SEQS=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.fastb K=12 L=/wga/dev/WGAdata/projects/ALLPATHS/E.coli.MG1655.Broad1/MadDash/full15_iainm/reads.trusted_unibases.corrected.k20.lookup FWRC=False Find alignments of reads to corrected trusted unibases fastb. PARAMETERS: SEQS=/reads.fastb L=/reads.trusted_unibases.corrected.k20.lookup K=12 FWRC=False INPUT FILES: reads.fastb reads.trusted_unibases.corrected.k*.lookup OUTPUT FILES: reads.minAlignErrors.txt reads.ilt.qltout GcBiasCurve GcBiasCurve DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 EXCLUDE_PAIRED=False TRUNC=27 Estimate and store the curve for a given K value. This shows, for each possible GC content, the positive or negative bias for reading genome regions with that gc content: do such genome regions get relatively more reads than average, or relatively fewer reads than average? PARAMETERS: K=20 EXCLUDE_PAIRED=False TRUNC=27 INPUT FILES: reads.fastb OUTPUT FILES: reads.fastb.unpaired.freq.k* reads.gc_bias.k* UnibaseCopyNumber UnibaseCopyNumber DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm UNIBASES=trusted_unibases.corrected UNIPATHS=trusted_unipaths ALIGNS=reads.ilt.qltout K=20 BIAS_CURVE=reads.gc_bias ERR_RATE=0.01 Determines copy number of each unibase using read alignments. PARAMETERS: UNIBASES=trusted_unibases.corrected UNIPATHS=trusted_unipaths ALIGNS=reads.ilt.qltout K=20 BIAS_CURVE=reads.gc_bias ERR_RATE=0.01 K=20 INPUT FILES: reads.fastb reads.gc_bias.k* reads.trusted_unibases.corrected.k*.fastb reads.trusted_unipaths.k*.fastb reads.trusted_unipaths.k* reads.trusted_unipathsdb.k* reads.ilt.qltout OUTPUT FILES: reads.trusted_unibases.corrected.preducted_count.k* CACHED FILES: reads.ilt.qltout.query_is_aligned reads.ilt.qltout.num_aligns_per_target PseudoUnipather PseudoUnipather DATA=projects/ALLPATHS/E.coli.MG1655.Broad1 RUN=MadDash/full15_iainm K=20 SUBDIR=iainm.12dec2007 PREDICTED_CN=True LINKAGE_CN=True KPATH_CONSENSUS=True MIN_KMERS_CN1=30 LINE_LINK_TIMEOUTS="{1,10}" TRUSTED_FIRST=True INPUT FILES: reads.fastb reads.paths.k* reads.paths_rc.k* reads.pathsdb.k* kmers.trusted reads.trusted_unipaths.k* reads.trusted_unipathsdb.k* reads.trusted_unibases.corrected.preducted_count.k* (optional) reads.trusted OUTPUT FILES: reads.trusted_pseudounibases.k*.fastb reads.trusted_pseudounibases.k*.summary reads.trusted_pseudounibases.k*.superb reads.trusted_pseudounibases.k*.superb_index CACHED FILES: reads.trusted_pseudounibases.befLinkLines