This manual and the software it describes are copyright © 2001-2004 The Broad Institute/Massachusetts Institute of Technology.
This product includes software developed by the Apache Software Foundation (http://www.apache.org).
This product includes software developed by the University of California, Berkeley and its contributors.
As input, Arachne requires the base calls and associated quality scores of each read (as produced by most base-calling software, such as PHRED), as well as ancillary information about each read (in a standard format described herein).
As output, Arachne produces a list of supercontigs ("scaffolds") -- each of which consists of an ordered list of contigs, all forward-oriented, and estimates for the gaps between them within the supercontig. Base calls and quality scores are provided for each contig, along with the approximate locations of the reads used to build it. Arachne also produces a summary and brief analysis of the assembly.
Many of Arachne's algorithms are described in "ARACHNE: A Whole-Genome Shotgun Assembler", Genome Research, January 2002, and "Whole-Genome Sequence Assembly for Mammalian Genomes: ARACHNE 2", Genome Research, January 2003.
We explain here the assumptions Arachne makes about your sequence reads.
These reads come from the entire genome of an organism or a cloned fragment thereof (but not both simultaneously), which we call the target. The target should be from a single haplotype: Arachne does not support the assembly of polymorphic data at this time.
Regardless of the target, Arachne assumes most sequence reads have been obtained randomly from it, either as single reads ("unpaired production reads") or as clone-end pairs ("paired production reads"). The bulk of reads provided to Arachne should be pairs from the latter category, because this pairing information is needed to assemble correctly.
To use Arachne, make a sensible division of your sequence reads into libraries (to use the term loosely), preferably placing similar reads together. Thus, there may be several libraries of paired production reads, each having different insert length statistics (mean and standard deviation, which must be the same for all reads in a single library). These library statistics must be provided to Arachne via the ancillary data files or the configuration file, as will be described later.
Arachne does have limited support for transposon-based read pairs, but does not handle other types of finishing reads at this time. If such reads are presented to the program as unpaired production reads, the performance may be acceptable, but Arachne will treat the reads as though they were obtained randomly from the genome.
Arachne is available as compiled binaries for a single platform: Compaq Alpha hardware, running Tru64 Unix, operating system version 5.1. This exact platform is required.The Arachne source code, while unsupported, is available at ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_src.tar.gz.
Here is the procedure for installing Arachne from the compiled binaries:
Pick a location on your system for the Arachne binaries, then get and unpack ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_bin.tar.gz into that location.
Pick a location on your system where your data will go, then get and unpack ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_data.tar.gz into that location. All users of Arachne must set the environment variable ARACHNE_PRE to the full path of Arachne_data.
In the main Arachne data directory, the file vector/contigs.fasta includes a broad selection of vector sequences. Be sure to add your sequencing vectors to this file if they are not already in it.
Go to the Arachne binary directory and type
Assemble DATA=mouse_example RUN=run
Wait until it finishes. A report about the project should be produced as mouse_example/run/assembly.ps, inside the main Arachne data directory. View this file to confirm that the assembly yielded one supercontig, consisting of 52 contigs, one of which is misassembled. Verify that mouse_example/run/assembly.log ends with a message regarding normal termination.
This does not guarantee that Arachne is installed correctly, however it should reveal any major problems.
Note. If upon typing Assemble, you got an error message about "command not found", then it is probably because you need to add "." to your Unix path. You can see your path by typing "set | grep path".
Note. If upon typing Assemble, you got an error message from the loader, it may be because you do not have version 5.1 of the operating system. Type "uname -r" to find out.
Preparing your data for assembly
Notes: DATA is simply a partial path to a subdirectory of the main Arachne data directory, so nesting is allowed. For example, if in the above example we were to specify that DATA=projects/yeast, it would refer to /seq/Arachne_stuff/projects/yeast. Also, soft linking is allowed and probably necessary for very large sequencing projects. Finally, the RUN directory is automatically created by Arachne if it did not exist beforehand.
We use only a subset of the fields specified in the Trace Archive XML definition:
In addition, we have a field that is not part of the Trace Archive Format and therefore must be set using the configuration file (see above for a brief description).
Important: Every read must appear in exactly three places in the Arachne input files: in a read sequence file, in a read quality score file, and in an XML ancillary file. Read identities are defined by read names, and read names are determined as follows. For read sequence and read quality score files, we take the rightmost white-space-free string on a ">" line. For example, ">gnl|ti|3 G10P69425RH3.T0" would yield the read name "G10P69425RH3.T0". For XML ancillary files, read names are defined by the TRACE_NAME field.
The configuration file (reads_config.xml) allows you to correct and augment the information presented to Arachne via the XML files.
For example, if any of the required fields are missing from your XML ancillary files, you do not need to modify the XML file itself before running Arachne. You can simply write a configuration file that will provide the missing information to Arachne.
Also, the configuration file allows for an easy way to set parameters that are common to a group of reads. For instance, below we demonstrate how to set insert size and insert size standard deviation for a particular library.
Note that you must use the configuration file to set type field. If you try to include the type field in the XML file, Arachne will fail because the XML file will not conform to the Trace Archive Format specification (see above).
Here we give a somewhat informal explanation of how the configuration files are constructed. However, as an XML file, the configuration file has a formal "document type" definition, that can be found in the file dtds/configuration.dtd (in the main Arachne data directory).
Begin the configuration file with the following text:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "configuration.dtd"> <configuration>
and end it with:
</configuration>
The types of constructs that can put in between are comments, macros, and most importantly, rules.
Comments are in the standard XML format, for example:
<!-- ******** Some contaminated reads, to be tossed ******** -->
Macros facilitate abbreviation, pointless or otherwise, for example:
<macro name="gh">gringlehopper</macro>would change every subsequent occurrence of the string $gh to the string gringlehopper. Any text could have been used in place of gringlehopper.
Rules require more explanation, because they have nontrivial syntax. For example,
<rule> <name> exclude probable human reads </name> <match> <match_field>plate_id</match_field> <literal>G10P6007</literal> </match> <match> <match_field>plate_id</match_field> <regex>^G10P613[01]$</regex> </match> <action><remove /></action> </rule>
would cause all reads having plate_id G10P6007, G10P6130, or G10P6311 to be ignored by Arachne.
More generally, a rule is defined by three fields:
A given rule can have more than one <match> tag. If there is more than one, then the rule is applied to reads that match any of those <match> tags.
Each <match> tag contains the names of the fields it checks (in <match_field> tags) and the values it expects in those fields (in <literal> or <regex> tags). If all of the specified fields match the expected values, the <match> is made and the rule's action is applied.
A <match> tag requires one or more <match_field><literal> or <match_field><regex> pairs.
Since it is a regular expression, be aware that a regex like <regex>0</regex> will match not only zero, but any string that contains a zero, such as "500" or "asdf0qwerty". If your intention is to match exactly some value, use the <literal> tag or use the start- and end-of-line markers ("^" and "$") to restrict your regular expression to match exactly the entire string in that field, e.g. <literal>0</literal> or <regex>^0$</regex>.
The <action> tag requires one of the following sub-tags. Only one type of sub-tag is allowed for a single <action> tag, though multiple <set> tags are allowed in a single <action> tag.
<set> <set_field> ... </set_field> <value> ... </value> </set>but there may be more than one set tag within a given <action>.
Values of other fields associated with the matching read may be referred to in <value> tags by prepending the name of the field with an "@". Also, integer arithmetic evaluation will occur when setting numeric fields such as insert_size and insert_stdev. For example,
<set> <set_field> insert_stdev </set_field> <value>@insert_size/10</value> </set>will set insert_stdev to 10% of insert_size.
Here is another example of a rule that sets the insert statistics for all reads whose names begin with G20:
<rule> <name> set insert stats for G20 reads </name> <match> <match_field>trace_name</match_field> <regex>^G20</regex> </match> <action> <set> <set_field>insert_size</set_field> <value>4000</value> </set> <set> <set_field>insert_stdev</set_field> <value>400</value> </set> </action> </rule>
Finally, we give an example that shows how to designate every read as being a paired production read:
<rule> <name> all reads are paired production reads </name> <match> <match_field>trace_name</match_field> <regex>.</regex> </match> <action> <set> <set_field>type</set_field> <value>paired_production</value> </set> </action> </rule>
Additionally, one may provide an exclusion file, a list of read names to be excluded from the assembly. This file should be named "reads.to_exclude" and should be located in the DATA directory. The reads in this file will be excluded prior to the application of any rules.
An Arachne assembly must be initiated from the Arachne binary directory, by invoking the main Arachne executable, Assemble, as follows:
Assemble DATA=the_project_directory RUN=the_results_directory
where "the_project_directory" is the name of the data directory and "the_results_directory" is the name of the run directory.
Note. Before running Assemble on your own data, be sure to test your installation by running it on mouse_example, as per the instructions given earlier.
Note. Some experimentation is needed to determine how much memory and disk space are needed for any given Arachne assembly. Running out of either will have unpredictable consequences.
Note. Simultaneous Assemble processes can share the same data directory, but not the same run directory.
Assemble accepts a number of additional command-line arguments, all optional:
A pair of these parameters affects the "positive breaking" algorithm, where Arachne attempts to find evidence that indicates that two supercontigs should be broken and one piece from each joined together instead. This evidence takes the form of long links from the middle of one supercontig to the middle of another.
Arachne requires that there be a minimum number of these links and that the links be spread over some minimum distance in each supercontig. These parameters are specified on the Assemble command line as min_cluster_size_to_break and min_cluster_spread_to_break, with default values of 5 and 50000, respectively. Both can be any positive (non-zero) value, though we recommend that the min_cluster_spread_to_break should be a significant fraction of the long links' estimated insert size.
Note that this code was designed for whole genome shotgun assemblies, and may not be applicable to smaller assemblies, such as BACs.
The first of these sub-options is patch_gaps_loops1, which affects how inclusive the algorithm is in selecting reads that might patch a gap. The higher the value of this parameter, the larger this set of reads will be and the better chance you have of collecting a set of reads that will successfully patch a gap. However, using a higher value also increases the possibility of performing an incorrect patch, and the runtime and memory usage of the gap-patching modules will increase. Conversely, by decreasing the value of this parameter, you decrease the chances of successfully patching gaps, but you will reduce runtime and memory usage. The default value of patch_gaps_loops1 is 5, but any non-negative integer is valid.
The second of these sub-options is patch_gaps_max_deviance, which affects how closely a possible patch must correlate to the linking information in that region. If a prospective patch of a gap would stretch the links over that gap too much, the patch is abandoned. The lower this value, the stricter the correspondence must be. The default value of patch_gaps_max_deviance is 4.0, but any non-negative floating point number is valid.
Note that this code may produce assemblies with multiply-placed reads, i.e. reads which are placed into more than one contig, though not twice in the same supercontig. Arachne attempts to resolve as many of these as it can, but in some cases it is not clear which location is better, and both placements are left untouched.
The output of the assembly consists of the following files, found in the RUN directory:
This tab-delimited file has the following fields, one row per contig, ordered by supercontig id and the ordinal number of the contig in the supercontig:
Type | Meaning |
Integer | Id of the supercontig containing this contig |
Integer | Length of the supercontig containing this contig (including estimated gap sizes) |
Integer | Number of contigs in the supercontig containing this contig |
Integer | Ordinal number of this contig in the supercontig |
Integer | Id of this contig |
Integer | Length of this contig |
Integer | Estimated length of gap before this contig (zero if first contig in supercontig) |
Integer | Estimated length of gap after this contig (zero if last contig in supercontig) |
This tab-delimited file has the following fields, one row per placed read, ordered by the id of the contig containing the read and the approximate coordinate of the first base of the trimmed read in the contig:
Type | Meaning |
String | Name of read |
String | Status of read |
Integer | Untrimmed read length |
Integer | Coordinate of first base of trimmed read in untrimmed read (zero-based) |
Integer | Length of trimmed read in untrimmed read |
Integer | Id of contig containing read |
Integer | Length of contig containing read |
Integer | Approximate coordinate of first base of trimmed read in contig (zero-based) |
Integer | Approximate coordinate of last base of trimmed read in contig (zero-based) |
'+' or '-' | Strand (orientation of read on contig) |
String | Name of this read's partner (empty if unpaired) |
String | Status of this read's partner |
Integer | Id of the contig containing this read's partner (empty if unpaired or partner unplaced) |
Integer | Observed insert size (empty if unpaired, partner unplaced, or partner in different supercontig) |
Integer | Given insert size (empty if unpaired) |
Integer | Given insert size standard deviation (empty if unpaired) |
Float | Observed insert size deviation measure (empty if observed insert size is empty) |
The status of the read is a set of characters used to flag conditions of note. Currently that field will either be empty or contain one or more of the following one-letter codes: M, S, and T.
If a read is multiply placed, its status will include M, and no observed insert size or deviation measure will be given for that pairing.
If a read's partner is multiply placed, the partner's status will include M, and no contig will be given for the partner, and no observed insert size or deviation measure will be given for that pairing.
If a read and its partner are on the same supercontig and have the same orientation, the status of both will include S, and no observed insert size or dev deviation measure will be given for that pairing.
If a read is a transposon, its status will include T, and the observed insert size will be the observed separation of the transposon reads and its partner.
Note that the observed insert size may include estimated gap sizes between contigs unless the read and its partner are located in the same contig.
The observed insert size deviation measure field contains the result of the calculation:
This gives you a signed measure of the observed insert size relative to the given insert size.
This tab-delimited file contains the names of the reads that were not placed in the assembly and why.
Each row contains a read name and a keyword indicating the reason the read was not placed in the assembly. The values of that field can be:
Value | Meaning |
deliberate | excluded by configuration file |
low_quality | nothing left after quality-based trimming |
vector_or_host | matches vector or bacterial host sequence |
mitochondrial | matches mitochondrial sequence |
other_contaminant | matches sequence in DATA/contaminants.fasta |
same_name | had the same name as some other read |
no_metainfo | had no metainformation in the XML files |
chimera | suspected of being chimeric |
unplaced | no problem with read, but not placed in contig |
other | some other reason |
A typical use of CreateAceFile would be
CreateAceFile DATA=... RUN=... ACEDIR=... AceFile=ace_file_name Type=Index Index='[1-3,5]'
where DATA and RUN are set in the same manner as with Assemble, except that "/work" should be appended to the RUN parameter. This command would produce four .ace files in the ACEDIR directory (where ACEDIR is a subdirectory of DATA): ace_file_name.1, ace_file_name.2, ace_file_name.3, and ace_file_name.5, corresponding to supercontigs 1, 2, 3, and 5.
If "Type=Index" is changed to "Type=All" and "Index=..." is omitted, then ace files will be generated for all supercontigs, in multiple files as in the example. Alternately, acefiles for the n largest supercontigs can be produced by using "Type=Top TopN=n", where n is a positive integer. In all cases, an additional argument of the form "Cutoff=k" will cause the omission of ace files for supercontigs whose constituent contigs are all shorter than k bases.
If ONE_FILE=True is used, CreateAceFile will place all the contigs in the assembly in one ace file.
We would like to hear from you! You may contact us at wga@broad.mit.edu.
If you experience difficulty while running Arachne, please send us a description of the problem encountered along with the assembly.log file from the relevant RUN directory. You may find it helpful to look at the list of Frequently Asked Questions.