Carolyn F. Allex (1999).
Computational Methods for Fast and Accurate DNA Fragment Assembly.
PhD thesis, Department of Computer Sciences, University of Wisconsin-Madison.
(Also appears as UW Technical Report CS-TR-99-1406)
This publication is available in PDF and available in postscript.
Abstract:
As advances in technology result in the production of increasing amounts of DNA sequencing data in decreasing amounts of time, developing computational methods that allow data analysis to keep pace is imperative. In this dissertation, I present methods that improve the speed and accuracy of DNA fragment assembly. One critical characteristic of automatic methods for fragment assembly is that they must be accurate. Currently, to ensure accurate sequences, the data that underlies questionable base calls must be examined by human editors so that the correct base call can be determined. This manual process is both error-prone and time-consuming. Automatic methods that yield high accuracy and few questionable calls can reduce errors and lessen the need for manual inspections. In my work, I developed a method, Trace-Evidence, that automatically produces highly accurate consensus sequences, even with few aligned sequences. Most assembly programs analyze only base calls when determining a consensus sequence. The key to the high accuracy is that I incorporate morphological information about the underlying ABI trace data. This is accomplished through a new representation of traces, Trace-Class, that characterizes the height and shape of traces. The new representation not only yields high accuracy when used in consensus-calling methods, but also produces improved results when used in removing poor-quality data, and when used as inputs for neural networks for consensus determination. The need for fast processing is becoming more important as the size of sequencing projects increases. Almost all existing fragment assembly programs perform pairwise comparisons of reads, resulting in execution times proportional to n2, where n is the number of reads. I describe a new algorithm for fragment layout, SLIC, that runs in time proportional to n. SLIC relies on subsequences of bases that occur in overlapping regions of fragment reads. Subsequences that are common to two or more fragment reads are aligned to determine the overall layout of reads. The work I present provides improvements to currently available computational methods for DNA sequencing that can serve as a foundation for further study in developing better solutions to problems in fragment assembly.
Return to the publications of the Univ. of Wisconsin Machine Learning Research Group.
Computer Sciences Department
College of Letters and Science
University of Wisconsin - Madison
INFORMATION
~ PEOPLE
~ GRADS
~ UNDERGRADS
~ RESEARCH
~ RESOURCES
5355a Computer Sciences and Statistics ~ 1210 West Dayton Street, Madison,
WI 53706
cs@cs.wisc.edu ~ voice: 608-262-1204 ~
fax: 608-262-9777