File: Terms and Concepts Terms and concepts that are frequently used in the Arachne/ALLPATHS documentation. Some terms have more than one meaning -- then all meanings are given in turn. Only terms used in more than one place should be defined here; local terms should be defined where they are used. ////////////////////////////////////////////////////////////////////// Group: Unclassified terms ////////////////////////////////////////////////////////////////////// Term: base vector A vector of DNA bases: a string of A/G/T/C's. Can be represented using class. Term: read set A set of reads. Can be represented as a . Term: genome unipath A constructed directly from a known reference genome. By definition, aligns perfectly to all its . See also . Term: read unipath A constructed from reads. Its alignments to the may not be perfect. Read unipaths are our best attempt to approximate the genome unipaths. When the term "unipaths" is referred to without qualification, usually read unipaths are meant. Term: unique unipath A which occurs exactly once, that is, has 1. From such unipaths, we can seed . It is a big error if we call a unipath whose actual copy number is greater than one, a unique unipath; we then end up pasting together unrelated genome regions. Term: unibase Take a unipath, convert it to base sequence, and do edits by aligning reads to this sequence -- you get a unibase. Note that a unibase no longer has the strict mathematical properties of a unipath -- for example, a given may occur in two different unipaths. Term: N50 A statistical measure of how contigous (unbroken) a set of segments (, scaffolds, ungapped scaffolds, , ...) is. It is a segment size, such that half of the is composed of segments of at least this size. Note that this is an intrinsic measure that depends only on the set of segments and does not consider how well these segments align to the . See also . Term: contig A sequence of bases which we believe occurs in the genome. So, , and are examples of contigs; but there can also be longer contigs, obtained by concatenating these elements and/or applying edits, or obtaining a sequence that we believe is genomic by some other method.. A contig *never has gaps*. consists of contigs, properly oriented towards each other and with gap length/deviation information among the member contigs. Term: scaffold A set of , properly oriented relative to each other and with gap length/deviation information among the contigs. An consists of scaffolds. Term: assembly A group of . Has no information on how the scaffolds are oriented relative to each other, or even whether they are from the same DNA molecule. Correspond to . Term: kmer graph A directed graph where each is a node and there is an edge from K1 to K2 if there is a where K2 follows K1 and they're both . A subgraph of the de Bruijn graph of all possible kmers of the given length. Term: unipath graph A directed graph, derived from the , where each node is a . Term: unipath A in the that starts at a kmer and follows a path in the kmer graph until it hits a branch. Has useful mathematical properties. For example, each is in exactly one unipath. Unipaths are determined by . Term: genome part One of the in a genome. Note that the order of the genome parts relative to each other is unknown or may be undefined -- for example, each part may be a separate chromosome. Term: genome part id Identifier of a . Also referred to as "genome id" for short. Term: genome pos A particular base on a particular . Term: reference A reference genome, a known genome for which we have reads. Obviously we don't have that in production runs, but can use the reference for sanity-checking of results and for developing heuristics. Term: truth Refers to knowing the actual exact sequence of the genome that was sequenced, and the exact location of each read on that genome (and therefore the error-free version of each read). Only happens when we're using . Note that knowledge of the truth is not the same as knowledge of the reference. Sometimes we get reads from a known genome (that has been sequenced by other means), but we don't know where on the genome each read originated. In that case we know the , but not the truth. We have even less knowledge when doing or , where we don't know the exact genome from which our reads come, but know the genome of another individual of the same species (resequencing) or the genome or a related species (assisted assembly). Finally, we may be doing complete 'de novo' sequencing with no knowledge at all of the genome from which our reads come. Term: reverse complement Reverse complement of a string of bases. Term: kmer space Representation of a in terms of its sequence of , rather than its sequence of bases. A particular kmer space -- the mapping of each kmer to an integer -- is created by . Term: sequence space The ordinary representation of a base vector as a string of bases. Term: genomic kmer A that actually occurs in the genome. Term: strong kmer A kmer that we believe is a . Term: occurrence An occurrence of a kmer in a read; also referred to as a "kmer instance". A given kmer may have a number of occurrences in various reads (potentially even several occurrences in the same read). Note that this may be the occurrence of the reverse complement of the kmer in the read (meaning the actual kmer as written occurs on the complementary strand). Term: forward occurrence The of a kmer in a read, such that the canonical kmer occurs in the read. Term: backward occurrence The of a kmer in a read, such that the canonical kmer occurs on the complementary strand to the strand from which the read comes. Term: kmer strength How strongly we believe that a kmer is a . Term: adj A kmer adjacency; in other words, a (k+1)-mer. Term: mutmer An alignment with no indels; the only mismatches between the two aligned sequences are in the form of base substitutions. (A perfect alignment, with no indels but no mutations either, is a special case of a mutmer). A mutmer is often a short alignment, since long alignments often have indels. Term: mux A , a construction used for . Ask about this. Term: placement A point on the reference to which a read or a unipath aligns. Differs from a full alignment in that only the start of the alignment is represented -- any substitutions, and especially any indels, are not captured by a placement and must be recomputed if you need them. Term: perfect read A read that aligns perfectly to at least one place in the reference genome. This definition doesn't necessarily mean that this read had no errors: it could have an error but still align correctly to some place in the genome, that was very similar to the place from which the read actually came. This definition also ignores . Term: passing read A read that passes a binary quality metric for reads. Term: correctible read A read that we are able to correct, as opposed to just throwing it out. That is, a read for which we can find a reasonable hypothesis of what the read originally looked like, so that the proposed picture of how the read mutated from its original sequence into the read sequence is reasonably probable, and the proposed original sequence looks reasonably genomic (i.e. we believe that it occurs in the genome). The current implementation of the above, in , is to see if a few mutations can make all kmers in the read . Term: gap This can refer to one of several things. It can mean: - a gap in an alignment (such gap is also called an indel) - a gap in a - a gap in a Topic: intensity For each read at each read position, for each base possibility, Solexa gives an intensity for that base possibility: how strongly does the raw data indicate that the base at that position is an A? a G? a T? a C? For each of these possibilities, a structure gives a value; we have one such structure for each position in each read in an <.intensities> file. Topic: quality score Handling of quality scores, which tell for each base in the read how confident we are that the base given in the read has been read correctly. Generally, read quality declines (and error probability increases) towards the end of a read. There are several general approaches to representing error probabilities: - only considering read positions -- have a fixed probability of error at a given read position, or of errors at a particular combination of positions - also considering read contents -- for individual reads, together with the reads we get the quality scores for each base. - using the improved per-read quality scores developed by and . Term: assisted assembly Assembling an unkonwn genome from low-coverage data, with the help of a known related genome. Term: paired reads A pair of reads with roughly known distance between them -- they are read from two ends of the same fragment, and we know the distribution of fragment lengths in a . Term: read pileup An abnormally high number of reads aligning to the same point on the reference -- usually caused by a highly repetitive region of the genome. Term: protected read A heuristic mechanism in error correction, where we identify some reads as being "good" and prevent them from being edited. Term: read bias The fact that reads to not fall uniformly randomly on the genome; rather, some parts of the genome have systematically more reads falling on them, and some systematically fewer. This is in addition to the natural random variation in the number of reads falling on the various places of the genome. One major read bias is the . See also Term: read start bias The fact that read starts are not distributed randomly on the genome. This can result, for example, from uneven of the genome into . Where reads start can affect what linking information we have between various genome parts: if there are no reads crossing the boundary between two particular genome bases, that limits our ability to link together . See also: , . Term: gc bias A that is strongly tied to the gc content of the read. Note that the gc content of parts of the that are not read may influence the gc bias of the read portion of the fragment. See . Term: trusted base A base that we are pretty sure is correct, according to a definition by . For each base of each read we compute whether that base is trusted or not, and put this info into a <.trusted> file. Term: ALLPATHS run One analysis of an ALLPATHS project, stored in a under a . Term: Solexa run One reading of a , on a particular date. There is normally exactly one run of each flowcell. For there are two runs, one to read one end of the and the other to read the other end. These can be viewed as one interrupted run. Each run corresponds to a directory under the . Term: run 1) ; 2) . ////////////////////////////////////////////////////////////////////// Group: General sequencing terms Terms used in the field of DNA sequencing generally, without reference to a particular sequencing technology. Term: fragment One fragment of the sheared genome, which we attempt to read. Each fragment is amplified by PCR and gives rise to a . Term: insert See . (The term comes from older DNA sequencing technologies where the fragment to be sequenced was inserted into a plasmid; for we're not using plasmids anymore, so fragment is the more accurate term.) Term: library A collection of of a particular size (plus or minus a bit). ///////////////////////////////////////////////////////////////////////// Group: Solexa terms Terms specific to the sequencing platform. Term: cycle The reading of one base position during a ; this reads the next base in each , adding one base to each . Term: flowcell Contains several . Term: lane A lane on the glass slide (a ), with on the walls. Term: cluster A cluster of copies of a fragment -- a spot on the wall of one lane of a Solexa ; corresponds to one read. Also known as "polony" - polymer colony. Term: polony See . Group: Sanger sequencing terms Term: trace The raw data of a Sanger read. Group: Software engineering terms Term: Type concept A set of requirements on a type; a specification of what members the type must have and what their semantics must be. Same thing as an STL concept, or a concept in . We call them _type concepts_ rather than just concept to leave room for the normal meaning of the word "concept", to denote some aspect of the problem domain. Section: Miscellaneous Abbreviations: Varous abbreviations FLM - first local minimum of a ; most kmers with frequencies less than this are erroneous (non-genomic), while most kmers with frequencies greater than this are genomic (or so we hope); see . FC - A Solexa RC - See rc - See WGA - Whole Genome Assembly Synonyms: Various synonyms reference genome - See genome position - See genome base - See genome id - See k-mer - See K-mer - See basevector - See sequence - See pileup - See strength - See strong - See GC bias - See base quality - See trusted - See NaturalDocs - See www.naturaldocs.org Natural Docs - See www.naturaldocs.org doxygen - See www.doxygen.org Doxygen - See www.doxygen.org ConceptC++ - See http://www.generic-programming.org/languages/conceptcpp kmer instance - See quality control - See Scaffold - See finished sequence - See genome contig - See