This manual and the software it describes are copyright © 2001-2004 The Broad Institute/Massachusetts Institute of Technology.

This product includes software developed by the Apache Software Foundation (http://www.apache.org).

This product includes software developed by the University of California, Berkeley and its contributors.




Arachne 2.0 Manual

Table of Contents






Overview

Arachne is a tool for assembling genome sequence from whole-genome shotgun reads, mostly in forward-reverse pairs obtained by sequencing clone ends.

As input, Arachne requires the base calls and associated quality scores of each read (as produced by most base-calling software, such as PHRED), as well as ancillary information about each read (in a standard format described herein).

As output, Arachne produces a list of supercontigs ("scaffolds") -- each of which consists of an ordered list of contigs, all forward-oriented, and estimates for the gaps between them within the supercontig. Base calls and quality scores are provided for each contig, along with the approximate locations of the reads used to build it. Arachne also produces a summary and brief analysis of the assembly.

Many of Arachne's algorithms are described in "ARACHNE: A Whole-Genome Shotgun Assembler", Genome Research, January 2002, and "Whole-Genome Sequence Assembly for Mammalian Genomes: ARACHNE 2", Genome Research, January 2003.

About your data

We explain here the assumptions Arachne makes about your sequence reads.

These reads come from the entire genome of an organism or a cloned fragment thereof (but not both simultaneously), which we call the target. The target should be from a single haplotype: Arachne does not support the assembly of polymorphic data at this time.

Regardless of the target, Arachne assumes most sequence reads have been obtained randomly from it, either as single reads ("unpaired production reads") or as clone-end pairs ("paired production reads"). The bulk of reads provided to Arachne should be pairs from the latter category, because this pairing information is needed to assemble correctly.

To use Arachne, make a sensible division of your sequence reads into libraries (to use the term loosely), preferably placing similar reads together. Thus, there may be several libraries of paired production reads, each having different insert length statistics (mean and standard deviation, which must be the same for all reads in a single library). These library statistics must be provided to Arachne via the ancillary data files or the configuration file, as will be described later.

Arachne does have limited support for transposon-based read pairs, but does not handle other types of finishing reads at this time. If such reads are presented to the program as unpaired production reads, the performance may be acceptable, but Arachne will treat the reads as though they were obtained randomly from the genome.

Installation

Arachne is available as compiled binaries for a single platform: Compaq Alpha hardware, running Tru64 Unix, operating system version 5.1. This exact platform is required.

The Arachne source code, while unsupported, is available at ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_src.tar.gz.

Here is the procedure for installing Arachne from the compiled binaries:

Step 1: Install prerequisite software

Step 2: Create the Arachne binary directory

Pick a location on your system for the Arachne binaries, then get and unpack ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_bin.tar.gz into that location.

Step 3: Create the main Arachne data directory

Pick a location on your system where your data will go, then get and unpack ftp://ftp.broad.mit.edu/pub/wga/Arachne/Arachne_data.tar.gz into that location. All users of Arachne must set the environment variable ARACHNE_PRE to the full path of Arachne_data.

Step 4: Supplement the vector file

In the main Arachne data directory, the file vector/contigs.fasta includes a broad selection of vector sequences. Be sure to add your sequencing vectors to this file if they are not already in it.

Step 5: Test Arachne on a small mouse project

Go to the Arachne binary directory and type

Assemble DATA=mouse_example RUN=run

Wait until it finishes. A report about the project should be produced as mouse_example/run/assembly.ps, inside the main Arachne data directory. View this file to confirm that the assembly yielded one supercontig, consisting of 52 contigs, one of which is misassembled. Verify that mouse_example/run/assembly.log ends with a message regarding normal termination.

This does not guarantee that Arachne is installed correctly, however it should reveal any major problems.

Note. If upon typing Assemble, you got an error message about "command not found", then it is probably because you need to add "." to your Unix path. You can see your path by typing "set | grep path".

Note. If upon typing Assemble, you got an error message from the loader, it may be because you do not have version 5.1 of the operating system. Type "uname -r" to find out.

Preparing your data for assembly

Data and run directories

For each sequencing project to be assembled, create a subdirectory (referred to hereafter as DATA) of the main Arachne data directory, that contains the source data for the project. Inside DATA, create a subdirectory for each particular assembly (referred to as RUN), into which assembly output files are to be placed. We use DATA and RUN relative to their parents. For example, if the main Arachne data directory is /seq/Arachne_stuff, and DATA=sequoia and RUN=May1, then DATA really refers to /seq/Arachne_stuff/sequoia, and RUN really refers to /seq/Arachne_stuff/sequoia/May1.

Notes: DATA is simply a partial path to a subdirectory of the main Arachne data directory, so nesting is allowed. For example, if in the above example we were to specify that DATA=projects/yeast, it would refer to /seq/Arachne_stuff/projects/yeast. Also, soft linking is allowed and probably necessary for very large sequencing projects. Finally, the RUN directory is automatically created by Arachne if it did not exist beforehand.

Data files

The data files provided to Arachne as input reside in the DATA directory:

XML ancillary files

The files DATA/traceinfo/*xml* contain ancillary data about the reads, which is in the Trace Archive XML format (http://www.ncbi.nlm.nih.gov/Traces/TraceArchiveRFC.html). As described in the next section, this ancillary data may be modified and supplemented with the aid of the configuration file.

We use only a subset of the fields specified in the Trace Archive XML definition:

In addition, we have a field that is not part of the Trace Archive Format and therefore must be set using the configuration file (see above for a brief description).

Important: Every read must appear in exactly three places in the Arachne input files: in a read sequence file, in a read quality score file, and in an XML ancillary file. Read identities are defined by read names, and read names are determined as follows. For read sequence and read quality score files, we take the rightmost white-space-free string on a ">" line. For example, ">gnl|ti|3 G10P69425RH3.T0" would yield the read name "G10P69425RH3.T0". For XML ancillary files, read names are defined by the TRACE_NAME field.

Configuration file

The configuration file (reads_config.xml) allows you to correct and augment the information presented to Arachne via the XML files.

For example, if any of the required fields are missing from your XML ancillary files, you do not need to modify the XML file itself before running Arachne. You can simply write a configuration file that will provide the missing information to Arachne.

Also, the configuration file allows for an easy way to set parameters that are common to a group of reads. For instance, below we demonstrate how to set insert size and insert size standard deviation for a particular library.

Note that you must use the configuration file to set type field. If you try to include the type field in the XML file, Arachne will fail because the XML file will not conform to the Trace Archive Format specification (see above).

Here we give a somewhat informal explanation of how the configuration files are constructed. However, as an XML file, the configuration file has a formal "document type" definition, that can be found in the file dtds/configuration.dtd (in the main Arachne data directory).

Begin the configuration file with the following text:

<?xml version="1.0"?> 
<!DOCTYPE configuration SYSTEM "configuration.dtd"> 
<configuration> 

and end it with:

</configuration> 

The types of constructs that can put in between are comments, macros, and most importantly, rules.

Comments are in the standard XML format, for example:

	<!-- ******** Some contaminated reads, to be tossed ******** --> 

Macros facilitate abbreviation, pointless or otherwise, for example:

	<macro name="gh">gringlehopper</macro> 
would change every subsequent occurrence of the string $gh to the string gringlehopper. Any text could have been used in place of gringlehopper.

Rules require more explanation, because they have nontrivial syntax. For example,

        <rule> 
             <name> exclude probable human reads </name> 
             <match> 
                  <match_field>plate_id</match_field> 
                  <literal>G10P6007</literal>
             </match> 
	     <match> 
                  <match_field>plate_id</match_field> 
                  <regex>^G10P613[01]$</regex>
             </match> 
	     <action><remove /></action> 
	</rule> 

would cause all reads having plate_id G10P6007, G10P6130, or G10P6311 to be ignored by Arachne.

More generally, a rule is defined by three fields:

A given rule can have more than one <match> tag. If there is more than one, then the rule is applied to reads that match any of those <match> tags.

Each <match> tag contains the names of the fields it checks (in <match_field> tags) and the values it expects in those fields (in <literal> or <regex> tags). If all of the specified fields match the expected values, the <match> is made and the rule's action is applied.

A <match> tag requires one or more <match_field><literal> or <match_field><regex> pairs.

The <action> tag requires one of the following sub-tags. Only one type of sub-tag is allowed for a single <action> tag, though multiple <set> tags are allowed in a single <action> tag.

Values of other fields associated with the matching read may be referred to in <value> tags by prepending the name of the field with an "@". Also, integer arithmetic evaluation will occur when setting numeric fields such as insert_size and insert_stdev. For example,

      <set>
         <set_field> insert_stdev </set_field>
         <value>@insert_size/10</value>
      </set>

will set insert_stdev to 10% of insert_size.

Here is another example of a rule that sets the insert statistics for all reads whose names begin with G20:

      <rule> 
         <name> set insert stats for G20 reads </name> 
         <match>
            <match_field>trace_name</match_field>
            <regex>^G20</regex>
         </match> 
         <action> 
            <set>
               <set_field>insert_size</set_field>
               <value>4000</value>
            </set> 
            <set>
               <set_field>insert_stdev</set_field>
               <value>400</value>
            </set> 
         </action> 
      </rule> 

Finally, we give an example that shows how to designate every read as being a paired production read:

      <rule> 
         <name> all reads are paired production reads </name> 
         <match>
            <match_field>trace_name</match_field>
            <regex>.</regex>
         </match> 
         <action> 
            <set>
               <set_field>type</set_field>
               <value>paired_production</value>
            </set> 
         </action> 
      </rule> 

Rules are applied in the order which they appear in the configuration file. Interactions between the rules are possible, and consequently, the order in which the rules appear may matter.

Additionally, one may provide an exclusion file, a list of read names to be excluded from the assembly. This file should be named "reads.to_exclude" and should be located in the DATA directory. The reads in this file will be excluded prior to the application of any rules.

Running Arachne

An Arachne assembly must be initiated from the Arachne binary directory, by invoking the main Arachne executable, Assemble, as follows:

Assemble DATA=the_project_directory RUN=the_results_directory

where "the_project_directory" is the name of the data directory and "the_results_directory" is the name of the run directory.

Note. Before running Assemble on your own data, be sure to test your installation by running it on mouse_example, as per the instructions given earlier.

Note. Some experimentation is needed to determine how much memory and disk space are needed for any given Arachne assembly. Running out of either will have unpredictable consequences.

Note. Simultaneous Assemble processes can share the same data directory, but not the same run directory.

Assemble accepts a number of additional command-line arguments, all optional:

Output

The output of the assembly consists of the following files, found in the RUN directory:

The RUN directory also contains a subdirectory "work", in which large numbers of internal assembly files reside.

Generating ace files

Ace files are the main input files for Consed, a tool for viewing assemblies by graphically showing the aligned reads on a contig-by-contig basis. To get ace files from an Arachne assembly, either specify ACE=True on the Assemble command line or invoke the tool CreateAceFile manually from the Arachne binary directory. This will provide enough data to run Consed, although its functionality will be greater if the .scf and .phd files which PHRED produces are available. The acefiles created by Arachne have only been tested with consed releases 7.52 and 12.

A typical use of CreateAceFile would be

  CreateAceFile  DATA=... RUN=... ACEDIR=... AceFile=ace_file_name  Type=Index  Index='[1-3,5]'

where DATA and RUN are set in the same manner as with Assemble, except that "/work" should be appended to the RUN parameter. This command would produce four .ace files in the ACEDIR directory (where ACEDIR is a subdirectory of DATA): ace_file_name.1, ace_file_name.2, ace_file_name.3, and ace_file_name.5, corresponding to supercontigs 1, 2, 3, and 5.

If "Type=Index" is changed to "Type=All" and "Index=..." is omitted, then ace files will be generated for all supercontigs, in multiple files as in the example. Alternately, acefiles for the n largest supercontigs can be produced by using "Type=Top TopN=n", where n is a positive integer. In all cases, an additional argument of the form "Cutoff=k" will cause the omission of ace files for supercontigs whose constituent contigs are all shorter than k bases.

If ONE_FILE=True is used, CreateAceFile will place all the contigs in the assembly in one ace file.

Contacting us

We would like to hear from you! You may contact us at wga@broad.mit.edu.

If you experience difficulty while running Arachne, please send us a description of the problem encountered along with the assembly.log file from the relevant RUN directory. You may find it helpful to look at the list of Frequently Asked Questions.