Before receiving my Ph.D. in Summer 2010, I worked with Professor Jerry Zhu on several research projects in statistical machine learning and natural language processing. Related publications can be found here.
My primary line of research, and the topic of my dissertation, is semi-supervised learning (SSL)---learning from both labeled and unlabeled data to try to learn better predictors than can be obtained with labeled data alone.
One of several open problems I hope to one day address is how to perform "safe SSL." That is, a practitioner should be able to apply SSL to a new domain or task without expert knowledge of which SSL model's assumptions are most appropriate, in such a way that learning is no worse than supervised learning. Our 2009 NAACL SSL-NLP workshop paper explores using cross-validation for SSL model selection, and we're currently researching theoretically justified solutions to this open problem. One step in the direction of "safe SSL" is developing more robust methods that can handle complex data sets. For example, current graph-based SSL methods may have difficulty learning from data that is supported on multiple intersecting manifolds. Our AISTATS 2009 paper proposes a novel graph-construction method based on local geometry, which is able to detect changes in dimensionality, orientation, and density. We also describe a cluster-then-label technique using constrained spectral clustering on our novel graph. On a related note, I have explored an approach that can perform automatic graph selection while also being robust to the absence of any true underlying manifold structure.
My dissertation covers the above research, as well as several other novel directions in SSL explored in the last few years. We have explored online SSL with kernels (ECML 2008), binary and multi-class classification in the presence of dissimilarity constraints (AISTATS 2007), and semi-supervised regression using order preferences (AAAI 2007). We have also applied graph-based SSL to sentiment analysis (NAACL 2006 TextGraphs workshop).
I am also actively involved in a project called Text-to-Picture Synthesis in which our goal is to be able to automatically produce a set of images that captures the main meaning behind a piece of arbitrary text. Our current approach involves several natural language processing and computer vision techniques to select semantic entities in the text that should appear in the picture and find appropriate images to represent each of these concepts. See this page for more details.
Non-topical Text Classification
Another area of interest is non-topical text classification. Instead of trying to predict the topic of some text, the goal is to predict some other attribute that might require more background knowledge or less obvious features. Examples include sentiment classification (positive vs. negative opinion) and wish recognition (objective fact vs. a personal desire for some outcome). See our NAACL-TextGraphs 2006 and NAACL 2009 papers for details on our statistical machine learning approaches to these difficult problems.
Other NLP and IR Research
In recent years, I have also been involved in several other projects related to natural language processing and information retrieval. On the NLP front, I have worked on trying to recover ordered bigram probabilities from unordered bag-of-words representations (ACL 2008). This has implications on privacy in text indexing, as we have found some evidence that bag-of-words indexes can be converted back into ordered text documents. My previous work also includes diversity-based ranking---trying to rank documents / people / etc such that the highest ranked items are simultaneously non-redundant with each other and relevant to some query (or just "central" in a PageRank sense). This work arose in the context of information retrieval for bioinformatics: given some gene or protein, the task is to return passages of text that discuss a diverse set of aspects of the gene or protein (e.g., different functions). We developed a general purpose ranking algorithm GRASSHOPPER, based on Markov Chain random walks with absorbing states, which encourages diversity in the final ranked list. See the HLT-NAACL 2007 and TREC 2006 papers.
In addition to these projects at UW-Madison, I spent two summers doing industry internships. First, at Google in 2007, I worked on n-best list reranking for machine translation, using large-scale, distributed supervised and semi-supervised machine learning. Then, at Microsoft Research in 2008, I worked on the problem of query intent classification (e.g., given a short text query, predict whether or not the user intends to buy a commercial product). This classifier can be used to rerank search results or present a custom interface for different types of queries. To overcome the need for manually labeled training queries, I helped develop a method to extract large amounts of automatically labeled training data from search-engine query and toolbar logs. We show in our KDD 2009 paper that an intent classifier trained on this data can achieve large accuracy gains at much lower cost compared to classifiers trained using expensive manually labeled training sets.
During my first year at UW-Madison, I had an independent study with Professor Michael Ferris in which I experimented with implementing various knowledge-based support vector machine formulations in Matlab. This was based on recent work by Olvi Mangasarian and Jude Shavlik in the bioinformatics domain (e.g., Wisconsin Breast Cancer Database).
As an undergraduate at Amherst College, I spent a year researching genetic algorithms. This work culminated in a senior honors thesis in which I developed a genetic algorithm to find near-optimal schedules for a soccer league. Specifically, I worked with data and prior knowledge acquired from the New England Small College Athletic Conference (NESCAC) to build a real-world system capable of creating schedules that optimize various criteria (i.e., minimizing total traveling distance by all teams, ensuring that big rivalry games occur on key dates). The work was a success, as the schedules produced were desirable to league administrators.