|
|
|||||||||||||
Research InterestsMy research interests center on using machine learning and probabilistic inference to solve challenging problems, with an emphasis on tasks in molecular biology. These problems, ranging from interpretation of medical images to modeling gene expression profiles as they change over time, are well suited to probabilistic inference, as they involve both: (a) disambiguating many sources of noisy observations, as well as (b) incorporation of often large sets of background domain knowledge. In particular, I have focused on determination of a protein's three-dimensional structure, providing both key insights into function as well as targets for drug design. Many proteins and protein complexes of biomedical importance elude traditional structure determination attempts; however, for many of these proteins it is possible to collect sparse experimental data. My work has shown that machine learning methods -- in particular, approximate probabilistic inference -- can significantly help in structure determination from sparse experimental data. By developing novel inference methods that can make use of the wealth of background data on protein structures known from biological and chemical knowledge, I aim to further improve structure determination from sparse experimental data. | ||
Research ProjectsIncreasing the radius of convergence of molecular replacement by density- and energy-guided optimization | ||
|
The crystallographic phase problem refers to the fact that when X-ray diffraction data is collected, additional data -- the "phases" -- are needed to construct a map of the protein's density. Molecular replacement (MR) is a method in which a previously solved protein structure (the "template") is used to fill in this missing experimental information for a target protein. The method generally works, assuming template and target are reasonably similar. However, when the template and target have less than 30% sequence identity, molecular replacement often will fail. I show that the crystallographic phase problem can be solved using distant evolutionary relationships by combining algorithms for protein structure modelling with those developed for crystallographic structure determination. Integrating Rosetta structure modelling with Autobuild chain tracing yielded high-resolution structures for 8 of 13 X-ray diffraction data sets that could not be solved in the laboratories of expert crystallographers, and that remained unsolved after application of an extensive array of alternative approaches. The method shows a 50% success rate in cases where templates with 16-30% sequence identity and 70%+ coverage are available. Source code has been included in the Rosetta and Phenix software packages. |
|
Inferring protein backbones with Markov field models | ||
|
I have been investigating automatically tracing protein structures de novo into electron density maps. Analogous to a 3-dimensional picture of a protein, the electron density map is produced as the final result of X-ray crystallographic experiments. Tracing protein backbones in these complex 3D images manually is time consuming and labor-intensive. My work employs probabilistic inference to determine the most likely trace given an electron density map. I model the protein using a pairwise Markov random field (MRF), which defines the joint probability of some set of random variables on an undirected graph. In this case, graph vertices represent amino acids in the protein with associated random variables describing the 3D location and orientation of each amino acid. Associated with each amino acid is a probability of finding that amino acid in a particular location given the map. Determining the most likely backbone trace given some electron density map infers the marginal distribution of each amino acid's position. Belief propagation (BP) is a technique that performs approximate inference in loopy graphs, however, it does not scale to proteins which may contain thousands of residues. To make BP tractable in these types of graphs, I have developed AggBP, which approximates some subset of outgoing messages at a single node with a single message, making the method computationally feasible for large proteins. On a variety of maps, my method produces a more accurate backbone trace than two other commonly used methods. The algorithms developed have been included in the software suite ACMI. |
|
Determining protein structures from sparse experimental data | ||
|
The structure of many biomedically important proteins eludes traditional structure determination methods. However, for many of these proteins it may be possible to collect sparse experimental data. While these sources of data may not be enough to uniquely determine the structure of protein, they do contain enough information to guide conformational search in structure prediction methods. By providing sparse experimental information to structure prediction methods, I am able to generate significantly better models using less conformational sampling. In particular, I showed my method could be used to infer high-resolution structural details from sparse experimental data using cryo-electron microscopy (cryoEM) data, a method by which individual images of tens of thousands of protein molecules are reconstructed into a three-dimensional "envelope". Additionally, I have also shown that other sources of weak data -- including those from NMR experiments and small-angle X-ray scattering -- may be used in a similar manner. |
|
Selected publications (download CV)
| ||