Advanced Natural Language Processing

2006 Projects

Project Proposal/Description:

David Andrezejewski

Modeling Protein-Protein Interactions in Biomedical Abstracts with Latent Dirichlet Allocation

A major goal in biomedical text processing is the automatic extraction of protein interaction information. We can approach this task with a model based on the 'topic' concept - where each topic corresponds to a different multinomial distribution over our vocabulary. Sentences in biomedical abstracts can then be generated by either the 'interaction' topic if they contain or discuss interacting proteins or the 'background' topic otherwise. This model structure can be represetned with Latent Dirichlet Allocation (LDA). Some model development has already been done outside of this class - this project will consist of further model development and refinement, inference equation and algorithm derivation, and experimental testing on a dataset of Escheria Coli abstracts and known pairs of interacting proteins obtained from the Database of Interacting Proteins (DIP) at UCLA.

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of
Machine Learning Research, 3:993v1022, January 2003.

Himani Apte

Text summarization is a classic problem in natural language processing. Seminal work in this domain has been done as early as 1950s. In recent times, with the huge amount of information available electronically, there is an increasing demand for automatic text summarization systems. A lot of new approaches have been developed in order to adapt summaries to user needs and to corpus characteristics. Based on the content, these can differ in generating a 'generalized summary' of a document as against a 'query-specific summary'. A summary may be a collection of sentences carefully picked from the document or can be a formed by synthesizing new sentences representing the information in the documents. Sentence extraction methods for summarization normally work by scoring each sentence as a candidate to be part of summary, and then selecting the highest scoring subset of sentences. Such statistical techniques usually employ various features such as keywords, sentence length, indicative phrases. The linguistic approach to summarization extract phrases and lexical chains from the documents and fuse them together with generative tools to produce a summary. In the past few years, various techniques have been developed for multi-document and multi-lingual summarization. One of the open problem in text summarization domain is evaluation of summaries. A commonly employed metric is to compare the automatically generated summaries to manually generated ones.

Until now I have reviewed literature in the text summarization field. I plan to apply a modified extractive summarization technique to generate a domain-specific summary system. Due to the lack of standardization of evaluation methodologies in this field, it is not clear at present how to evaluate such a modified summarization system. With respect to collecting training data, one possible approach is to obtain relevant documents from the Internet (e.g. customer reviews). Also, standardized data sets are available for certain conferences (e.g. Document Understanding Conference) along with manually created summaries, however prior permission must be obtained to use this data.

Ye Chen

Hierarchical Topic Models for Image Categorization


Image classification is an essential part of digital image analysis. As a fundamental and challenging task, many research efforts have been devoted along this stream. With the state-of-the-art classification techniques such as SVM, ever improving classification performances have been reported [1,2]. From the perspective of imaging processing, different approaches have been tried, from simply as bag-of-patches, bag-of-features [6], pyramid match kernel [4, 5], to more sophisticated methods such as constellation model [3].

However, current approaches to image categorization have the follow limitations:
(1) Unable to leverage less-expensive unlabeled data, and the available labeled datasets are usally small-size.
(2) Care less about the non data-driven knowledge, such as domain-specific knowledge or general knowledge.
(3) Not so clear about the performance improvements are from better classification methods or better understanding/representation of image data.

Both generative and discriminative classification paradigms have been tried in image categorization tasks. Although currently discriminative methods such as SVM empirically outperforms generative models, the better intuition of the latter is still appealling. One framework for supervised learning called topic model was proposed [7,8], which since then showed interesting results and potentials in the areas such as text clustering and NIPS abstracts. A similar approach was also employed in unsupervised learning [9]. The framework has also been used in image categorization tasks [6].


In this project, we propose a hierarchical topic model for image categorization. In this context, one assumption underlying this approach is that there exists a topic hierarchy from image data. For example, the image class "apple" shares some latent topics with the class "orange", and these topics also have image manifestations (table as background, size, shape, or something we don't know intuitively).

A simple representation of the structure of the topic hierarchy would be a tree, with leaf nodes denoting image classes fed with training images and internal nodes symbolizing latent classes. Each image can be viewed as a "bag of patches" or a feature vector by some image processing methods.

From a generative model point of view, each patch of an image is generated as such:
(1) Generate the class, ~multinomial(P(c));
(2) Generate a single latent class from the path starting from the class and up to the tree root, ~multinomial(P(a|c));
(3) Generate a patch from the latent class, ~multinomial(P(w|a)).
The patches of a image are generated position by position.

This is the starting version of our generative hierarchical topic model, which makes two assumption:
(1) The model is essentially a mixture of multinomials;
(2) Each class has the same mixture proportions to generate patche positions belonging to that class.

The model parameters will be learned by an EM algorithm, which can be outlined as follows:
(1) Initialize parameters as discrete uniform distributions, but satisfying the multinomial constraint (sum(P)=1 for each distribution). This initialization sacrifices random restart mechanisms. The next version would initialize parameters from some Dirichlet random distribution generator.
(2) Iterate until convergence:
(a) E-Step: Expect Q_k(ij) from parameters estimated from last iterations. Here k denotes a latent class, ij is the
patch position j in d_i;
(b) M-Step: Maximize parameters: P(c), P(a|c), P(w|a).

This work is to address several problems in the current image categorization efforts:
(1) Lack of labeled image training data. Since image classification tasks involve large amount of parameters, it would be valuable for a model to leverage information from neighboring data. For example, a generating latent class for "apple" learns something from the "orange" images.
(2) Incorporate prior knowledge in image categorization. The hierarchal latent model is a way to achieve this. It can be obtained from external knowledge, such as WordNet.


The learned model shall be empirically compared with the conventional flat supervised classification (to support advantage 1)
and those from arbitrary knowledge topology (to support advantage 2). As for the data, we will use the caltech101 image
data made possible by Computational Vision at Caltech. The prior knowledge to be incorporate will be extracted from WordNet
as a relevant condenced tree topology.

1. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. "Local Features and Kernels for Classifcation of Texture and Object Categories: An In-Depth Study." International Journal of Computer Vision, submitted, 2005.
2. S. Lazebnik, C. Schmid, and J. Ponce. "A Maximum Entropy Framework for Part-Based Texture and Object Recognition."
Proceedings of the IEEE International Conference on Computer Vision, Beijing, China, October 2005, vol. 1, pp. 832-838.
3. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. L. Fei-Fei, R. Fergus, and P. Perona. CVPR 2004, Workshop on Generative-Model Based Vision. 2004.
4. The Pyramid Match Kernel:Discriminative Classification with Sets of Image Features. K. Grauman and T. Darrell. International Conference on Computer Vision (ICCV), 2005.
5. S. Lazebnik, C. Schmid, and J. Ponce. "Semi-Local Affine Parts for Object Recognition." Proceedings of the British Machine Vision Conference, Kingston, UK, September 2004, vol. 2, pp. 959-968.
6. L. Fei-Fei and P. Perona. "A Bayesian Hierarchical Model for Learning Natural Scene Categories." CVPR 2005.
7. T. Hofmann. "Probabilistic latent semantic indexing." Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.
8. D. Blei, A. Ng, and M. Jordan. "Latent Dirichlet allocation." Journal of Machine Learning Research, 3:993Ė1022, January 2003.
9. T. Hofmann, "The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data."
Proc. of the IJCAI'99.

Pradheep Elango

1) Using Learning Techniques for Anaphora Resolution and Relation Extraction

Problem: Conditional Random Fields and HMMs give good results for Noun Coreference and Relation Extraction. I intend to explore semi-supervised learning techniques that would learn to resolve multiple coreference to entities as well as learn relations between entities. Further, active learning could be used to obtain more labeled data.

Related Work: A maximum entropy approach to extract relations using lexical, syntactic and semantic features is presented in [2]. Several conditional models are presented in [1] that apply to coreference resolution. A simple language model that looks at a fixed size window of n words to define context is used in [3] for modeling entities.

Datasets: A co-reference annotated corpus is available at It contains a listing of annotated technical manuals. Further, there is a small percentage of "7 sectors" dataset that is tagged with classes publicly available. There are also a few open source software packages available that does some basic tasks like tokenization, sentence extraction, and rudimentary named entity extraction.

Issues: Evaluation will be difficult because many of the previous papers have used commercial software and commercial data sets that are not available to us. However, there are some other data sets/tools that are freely available, which have been used in slightly older papers.

[1] Andrew McCallum. Conditional Models of Identity Uncertainty with Application to Noun Coreference
[2] Nanda Kambhatla, "Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Extracting Relations"
[3] Hema Raghavan, James Allan and Andrew McCallum. An Exploration of Entity Models, Collective Classification and Relation Description. KDD '04.

2) Disambiguating People and other Entities

Goal: The problem is to automatically identify the appearance of different people or entities in a given discourse. For example, consider the WWW. When we issue a query containing a name looking for some person, it is possible that we get pges that do not refer to the person we are looking for. The goal is to automatically identify multiple occurrences of the same person, from that of other persons.

Method : In [1], the authors have used a couple of clustering algorithms and leverage the link structure of the social network formed by different related people to solve the problem. I propose to associate other entities that occur with the appearance of a person or the event we are looking for, and do a co-occurrence analysis. The same entity or similar entities are likely to occur in the same context. An approach such as Latent Semantic Indexing will be useful for co-occurrence analysis. A tagged data set containing 1085 webpages is available at the webpages of the authors of [1].

[1] Ron Bekkerman and Andrew McCallum, "Disambiguating Web Appearance of People in a Social Network",

Mohamed Eldawy

My goal is to do a survey of classification methods. I will implement different classification techniques, and test them to find out which ones run faster, and which ones produce more accurate results for various sizes of training data. The classifiers will be built in a generic way such that they can be easily used for different applications.

Classifier I will implement are
1. Generative Classifiers
    a. Naive Bayes Classification: Both multivariate and multinomial models will be implemented
2. Discriminative Classifiers
    a. Nearest Neighbor Classifier: User will be able to write their own similarity function, and change the value of k
    b. Logistic Regression
    c. Support Vector Machines: Tricks to deal with nonlinearity will also be implemented.

The classifiers will be implemented in C++, and they will be based on the same philosophy and design concepts of STL.
I will test the classifiers on the problem of recognizing hand written digits on a grid. The size of the grid will be fixed. The user will be
able to specify the training set, and then ask the classifiers to classify a new symbol. I will test the classifiers on scarce training set, and an extensive one. For each case, the accuracy and running time of each classifier will be recorded and a comparison will be made among all classifiers.

Brian Eriksson

Noun-Verb Pair Based Text Classification

Context from a sentence is extracted by a human using the subject (noun) and the action (verb) of the sentence. Two text corpuses can be considered similar if they contain many sentences that share the same noun-verb pairs. Using the LINK software suite to perform a standard tree parsing of a sentence, the nouns and associated verbs of a sentence will be found. WordNet will then be used to find synonyms of the verbs to increase the possibility of a match. Classification using this technique will be compared against the bigram model and a combination technique using weighted noun-verb matches and the bigram model.

Wes Evans

After more investigation the intoxicated speech recognition idea has been scrapped. However, after extensive reading I have found several ideas in the area of speech synthesis or text-to-speech. The first idea is text normalization. Nonstandard tokens should be converted to words. For example, a period can occur in sentences from abbreviated words or for the end of the sentence. Abbreviated words should be stated in full form by the speech synthesizer. Another example, a string of numbers could be a phone number, date, dollar amount, etc. The synthesizer should state the numbers differently depending on their type. Machine learning techniques
can be used to in this conversion from tokens to words [1]. The second idea is homograph disambiguation. A homograph is a word that can be pronounced differently depending on the context and usage. For example, "read" has two different pronunciations depending on the part of speech. Using only part of speech to determine homograph pronunciation is not perfect for words such as "wind" and "bass" where context must be taken in consideration. A speech synthesizer needs to know the correct pronunciation of a word depending on the context and part of speech [2].  The third idea is grapheme-to-phoneme conversion or letter-to-sound rules. This process converts the words into phones that are stated by the synthesizer. A dictionary or lexicon is often used for this process, but not every word is in the lexicon. Proper nouns, foreign words, and slang are the most common words not found. For these words rules must be used
to convert the word to a set of phones. The rules are generated using machine learning techniques such as decision trees and
expectation-maximization [3]. With either of these projects, I plan to integrate their usage into Festival, an open source speech synthesizer [4].

[1] Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards,
C. 2001. Normalization of Non-standard Words, Computer Speech and
Language, 15(3):287-333.
[2] Yarowsky, D. ^”Homograph Disambiguation in Speech Synthesis.^‘ In J. van
Santen, R. Sproat, J. Olive and J. Hirschberg (eds.), Progress in Speech
Synthesis. Springer-Verlag, pp. 159-175
[3] Black, A., Lenzo, K. and Pagel, V. (1998) Issues in Building General
Letter to Sound Rules, 3rd ESCA Workshop on Speech Synthesis, pp. 77-80,
Jenolan Caves, Australia.

Jurgen Van Gael

Cross Language News Clustering

News clustering is done at various portal sites such as The clusters at these sites are single-language clusters and thus miss out on a possibly valuable feature. My project will address this issue: it will use data currently available (from and allow for users to find articles in several different languages on the same news event.

I propose 4 distinct parts of the project:
1. Gather useful data from
2. Find at least one model that clusters article across different languages
3. Implement the suggested models
4. Analyze the performance of the different models
Extra [5. Apply the technology]

1. Has been partly implemented and seems to gather a large amount of data on a daily basis.
2. Will be the main part of the project and involve studying known techniques from the literature.
3. In order to make user benchmarking possible we will probably implement a web based interface for the project.
4. Since we do not have any existing examples of cross language clusters available, it seems that we will have to resort to asking people to evaluate the project.
5. Depending on time, we might implement features such as one that reports the relative importance of news events across different regions in the world and possibly other ideas we come across along the way.

Andrew Goldberg

For my project, I plan to investigate semi-supervised learning approaches to sentiment classification/analysis. This task is like traditional text classification, except instead of predicting the topic of a particular piece of text, the goal is to predict whether the text conveys a positive or negative opinion. Practical applications of this include automatic summarization of product reviews, interpreting free text survey responses, or evaluating the opinions of documents retrieved by a search engine. This is a difficult task--while topical classification can easily rely on certain keywords that indicate the topic at hand, opinions can be much more subtly expressed. In movie reviews (the domain I plan to investigate most), for example, authors tend to use sarcasm or rhetorical questions which could use words that carry the opposite sentiment as what they are really intending. Movie reviews also tend to exhibit other difficult to learn phenomena such as "thwarted expectations" (Pang et al, 2002), in which the reviewer deliberately sets up a contrast between what they expected from a movie and what they actually thought of it. This has a tendency to introduce many words or phrases with sentiment opposite to that of the review as a whole. Thus, making sentiment classification decisions based purely on word occurrence/frequency could be quite difficult. A more complex model might be necessary, which takes into account patterns of positive and negative word usage throughout the course of the review. In addition to predicting positive or negative, or 0 through 4 stars, a common goal in this area is to predict a real number rating (say from 0 to 1), in which case the task is one of regression rather than categorization. Most work in this area has used supervised learning approaches (among others: Pang et al., 2002; Pang and Lee, 2004; Pang and Lee, 2005) that require all training examples to have explicit labels (usually in the form of a number of stars for a movie or product review). However, this is one of many areas where it would be extremely useful to classify and make predictions without a large labeled training corpus. Suppose we have a huge collection of movie reviews without explicit labels. We should be able to exploit similarities between unlabeled and labeled examples to build a classifier that is more accurate than one based on the labeled examples alone. This is the idea behind semi-supervised learning (SSL). A key component of graph- based SSL methods is determining the similarity between examples. With this knowledge, we can try to build a classifier such that it assigns similar labels/ratings to similar examples. For example, imagine we have review A with a known 4 star rating. Review B is similar to A in terms of its features, but it does not have a known label. Now suppose there's a new test example review C. It is very similar to B, but not that similar to A. A classifier trained on A alone might misclassify this, but a SSL approach would be able to predict C more accurately based on the fact that it's similar to B, which in turn is similar to A. This is the same as saying that the decision surface over the space of examples is smooth. I plan to study how effectively semi-supervised learning can be applied to the movie review dataset found at, comparing my results with the supervised approaches of Pang and Lee (2005). In this work, the authors tried to exploit similarities between reviews (based on the percentage of positive sentences each contained) to enforce the constraint that reviews similar in content should wind up being predicted to have similar ratings, but they only looked for similar reviews among the labeled training examples. I plan to explore new methods of comparing two movie reviews (or any opinionated text) and try out several theoretical graph-based semi- supervised learning approaches that exploit similarities between the test examples and all other examples (both among the training and test examples).

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002.
Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts, Proceedings of ACL 2004.
Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales, Proceedings of ACL 2005.

Apirak Hoonlor

In question-answering application, the answer to a question is usually extracted from the prespecify text patterns. For example, if the question "When was [person] born?", the typical answer would be "Abraham Lincoln was born in 1809." or "Gandhi (1869-1948)". Thus, one can probably develop some regular expresions, which can detect these pattern in the text. For example, in this case we would need "[Name] was born in [birthdate]" and "[Name] ([Birthdate]-". These phrase-patterns are refered to as surface patterns. There are a lot of possible surface pattern to any given question. Thus, it would be nearly impossible to manually find such pattern and learn these regular expressions along the web.

I would like to apply the unsupervised learning setting to this problem. Although, I am still in the process of searching the literatures to find out the current state of applying unsupervised learning method in QA applications.

Deepak Ravichandran and Eduard Hovy, Learning Surface Test Patterns for a qustion Ansering System. Proceeding of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 41-47.

Karthik Jayaraman

1) Disambiguation for cross-language IR

The higher level problem is to be able to retrieve relevant documents from a target language when the query is given in a source language. The chief problem here is that of words having multiple senses. The usual ways of dealing with this are a) direct translation of query to the target language, b) translation of documents from target to source language and then perform monolingual retrieval. The way I propose to address this issue is by using relevant documents in the source language to perform sense diambiguation on the query and once the sense is known, perform an automatic translation of the keywords into the target language and then perform IR in the target language. This would require query-relevant corpora in two languages (need not be parallel) and a bi-lingual dictionary.

2) Identifying and characterizing the protagonist in a book

Given a book, the aim is to first find out the chief character in the book. This can be done using Named Entity Recognition. Then, using frequencies of named entities a reliable estimate to the protagonist's name can be ascertained. The second step would be to find the adjectives in the text that refer to him/her and the final step would be to categorize the character of the protagonist into some predefined classes based on the adjectives. The first step can be performed automatically. The final step requires some user-tagging of adjectives into a few classes. The referring adjectives can probably be obtained using some POS analysis.
Saisuresh Krishnakumaran

1) Metaphor Identification

Similes and Metaphors are figures of speech that are prevalent in plays/dramas etc. A simile is a figure of speech in which two essentially unlike things are compared, often in a phrase introduced by like or as. For example "He fought like a lion in the battle", describes the fighting ability of the person. Identifying similes in text documents can be done by identifying the occurrence of words 'like' or 'as' in the sentences. A Metaphor is also a figure of speech in which an expression is used to refer to something that it does not literally denote in order to suggest a similarity. But unlike a simile, there aren't any explicit words which identifies a metaphor. This makes it interesting to address the problem of identifying a metaphor in documents. Identifying a metaphor gives a better knowledge of the sentences and it useful for correctly translating documents from one language to another. An example of a metaphor is the sentence "He was a lion in the battle". This sentence attributes the bravery (quality) of a lion to the person. Metaphor identification will involve learning the qualities first and then using the context of the sentence and the parts of speech to identify the correct quality that is referred.

2) Novel Type Classifier

Given a novel, the system should classify it as a 'tragedy'/'comedy' etc. A related problem is the document summarization. Automatic summarization is the creation of a shortened version of a text by a computer program. The summary should contain the most important points of the original text which forms a plot. The Novel classifier should be aware of the plot of the story in order to correctly identify the type. At first thought it looks as though the words used in the novel can be an indicator of the type. But identifying the meaning/emotions/tone that are useful for type classification could be hard. I'm interested in studying how effective the statistical modeling techniques would be in this case. It would involve identifying the key features that help in understanding the overall plot and then use it for classification.

Eric Lantz

Natural language generation (NLG) is the process of taking some structured representation of information an producing natural language. Generally speaking, there are three types of sentence realization: "canned" text, template, and agent based. Canned text is the most basic, where predetermined output can be triggered by predetermined input. Templates are forms or various other part-of-speech mechanisms into which information from the user is inserted. Agent based systems try to form some sort of representation of the user's intentions, as presented in a scenario such as a dialog.

In this project I would like to create a system that would produce a paraphrased summary or comment on a set of news articles on a particular topic. This is similar to a summarization task, except the system will be formulating its own sentences. I feel that a template-based approach to the task will be best, given the one-way nature of the problem and the time-constraints involved. In addition, I would like to try using some probabilistic preprocessing steps, such as using word frequencies to determine words relevant to the topic. I am looking to find a separate template-based generator that I can use for developing the templates (YAG is the most promising). I hope that this is enough structure to produce a
system that will generate understandable and interesting output.

Channarukul S. (1999) YAG: A Template-Based Natural Language Generator for Real Time Systems. Master Thesis, Department of Electrical Engineering and Computer Science, University of Wisconsin-Milwaukee.
Feng Liu

Human Motion Recognition

Human Motion recognition has been a hot topic in areas such as surveillance, and human machine interaction. Most research efforts focus on video based motion recognition. The recognition can usually be divided into 2 process, extracting motion information from videos and recognizing motion using the extracted information. In this project, we focus on recognizing motion. That is to say, we use existing 3D human motion data as the training dataset and testing dataset. Our method can also be well applied to motion information extracted from videos. Besides, our method can be used in motion data retrieval and example based motion synthesis.
The basic idea of this project is to train motion classifiers using existing motion capture data set, and use another data set to examine the performance of the classifiers. To train good motion classifiers, several issues must be considered:

1. Find a machine learning algorithm that is suitable for multivariate temporal signals.
2. Extract the important features from a motion.
3. Since motion is a high-dimensional long signal, efficiency is an important issue.

[1] Aphrodite Galata and Neil Johnson and David Hogg. Learning variable-length Markov models of behavior, Computer Vision and Image Understanding, vol. 81(3): 398--413.
[2] Liu, G., Zhang, J., Wang, W., and McMillan, L. 2005. A system for analyzing and indexing human-motion databases. SIGMOD '05. ACM Press, New York, NY, 924-926.
[3] Duong, T., Bui, H.; Phung, D., Venkatesh, S. Activity recognition and abnormality detection with the switching hidden semi-Markov model. CVPR 2005, pp. 20-25.
[4] Ben-Arie, J., Pandit, P., Rajaram, S. View-based human activity recognition by indexing and sequencing. CVPR 2001, pp. 78-83.
[5] Hongeng, S., Bremond, F., Nevatia, R. Representation and optimal recognition of human activities, CVPR 2000, pp. 818 - 825.
[6] Yacoob, Y. and Black, M. Parameterized modeling and recognition of activities, ICCV 1998, pp. 120 - 127.
[7] Nguyen, N., Phung, D., Venkatesh, S. and Bui, H. Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model. CVPR 2005, pp. 955 - 960.
[8] Niu, F. and Abdel-Mottaleb, M. HMM-Based Segmentation and Recognition of Human Activities from Video Sequences, ICME 2005, pp. 804 - 807.

Jian Liu
Soil information retrieval from multi-temporal remote sensing images (This project will deal with soil mapping from remote sensing images using statistical machine learning methods). Background

Soil is a very important natural resource, and detailed soil information is necessary for land use and environmental modeling applications. Traditional soil survey and soil mapping have been largely based on field observation and manual delineation, but remote sensing has provided a potential to detect soil information without intensive field trip and at a much larger scale. Its basis is that soil surface reflectance (given the effect of vegetation cover and surface roughness is minimized) as detected by satellite sensors would be different due to various moisture content.

Previous Approaches and Motivation

There are a few studies proposed that use remote sensing images to detect soil information ([1][2][3]), and they are usually based on a single band (usually a microwave band which is most sensitive to soil moisture content) and at a snapshot (one image). The motivation of this project is then the temporal change of soil moisture content (as response to weather change) rather than a static snapshot. The change of moisture content will be particularly prominent in a few days after a rainfall event when the soil responds differently from saturation to a stable state due to various soil texture and structure. Therefore, a series of remote sensing images from a high temporal frequency (e.g. once a day) satellite sensor will be chosen.

Data and Study Area

The two sensors, MODIS (Moderate Resolution Imaging Spectroradiometer) and AMSR-E (Advanced Microwave Scanning Radiometer) aboard Earth Observing System (EOS) satellite Aqua (daily temporal frequency), will be chosen. MODIS has 36 bands ranging from optical to thermal, but only relevant bands will be selected. AMSR-E provides microwave coverage, which is most sensitive to soil moisture. To minimize the effect of landcover, a prairie area will be chosen, and a series of images will be gathered from late fall after a rainfall event. Method Therefore, soil mapping becomes a classification problem. Each pixel has a feature vector of reflectance values from several bands and from a series of time snapshots. The pixel values across the feature dimensions constitute the "signature" of the pixel, identifying itself in the feature space. The task is to classify the pixels in the feature dimension. Various statistical learning methods are possible. One currently under consideration is density estimation using Gaussian mixture models. There are two possible approaches to train the GMMs: supervised learning and unsupervised learning. Unsupervised learning doesn't require training data (which would be expensive to gather), however, because the clustering uses merely distance metric and without human knowledge guidance, the results can be arbitrary and useless. On the contrary, supervised training could be very well guided but a large amount of training data would not be possible. Existing survey maps could be used as training data, but the accuracy of the existing maps is subject to investigation. The feasibility of semi-supervised approaches ([5]) can be explored in this project, to fully utilize the limited number of labeled points and vast number of unlabeled points. Gaussian Mixture Models can be constructed using Expectation Maximization algorithm given the training set. Other statistical methods that can be investigated include neural network and support vector machine.

Expected Result

The expected result of this study is to find a feasible method for large scale soil mapping with remote sensing images, in low relief and vegetation-sparse areas.

[1] Muller, E and H. Decamps. 2000. Modeling soil moisture-relectance. In Remote sensing of Environment. 76, pp. 173-180.
[2] Wigneron, J.P., J.C. Calvet, T. Pellarin, A.A. Van de Griend, M. Berger and P. Ferrazzoli. 2002. Retrieing near-surface soil mositure from microwave radiometric observations: current status and future plans. In Remote sensing of Environment. 85, pp. 489-505.
[3] Zribi, M. S.L. Hegarat-Mascle, C. Ottle, B. Kammoun and C. Guerin. 2003. Surface soil moisture estimation from the synergistic use of the (multi-incidence and multi-resolution) active microwave ERS wind scatterometer and SAR data. in Remote sensing of environment. 86, pp. 30-41.
[4] Zhu, X.J. 2005. Semi-Supervised learning literature survey. at:

Xiao-yu Liu

Gene regulation inference from biological literatures

Large number of literatures are been published daily in research journals. Researchers can only read a small portion of the literature in their field. Sometimes successful findings heavily rely on the quantity and quality of the literatures a researcher have accessed. Many discoveries on gene regulatory network were made from knowledges gained from literatures. However extraction of biological information is facing difficulties: 1) biological literature is written in the form of natural language that content can be understood only if researcher read papers one by one; 2) a correlation found in the literature can only be verified by researchers who read the paper; 3) the number of papers needed to be read quickly exceed the practical capacity of researchers. Thus automated information extraction from biological publications is much desired. I proposed to use pubmed abstracts to infer proteins that regulates a given set of genes. Two machine learning approaches will be used: Support Vector Machine (SVM) and Bond Energy Algorithm(BEA) partitioning. Each was previously described in separate papers and was found to be superior to other common method. Algorithms will be implemented, tested on well characterized gene sets, and comparison of the performance will be compared.

Ramanathan Palaniappan

Text classification has been at the heart of NLP and research in text classification has been going on with undeterred enthusiasm. Here, I list a couple of approaches to text classification and feature reduction.

One particular area of interest is "cross language text classification". The problem statement is as follows: Given labeled data in a resource rich language L1, and unlabeled data in another language L2 (assume we have access to a comparable corpora), the task is to classify documents in language L2. A few approaches in the literature use machine translation approaches to achieve this. But, with the lack of bilingual lexicons, this becomes a difficult task. One aproach would be to compare the documents in both the languages in a "latent semantic space". A first step would be to cluster the documents in L2 using a clustering algorithm like pLSA. Once this is achieved, we can develop a language model for each cluster. A similar task can be done for documents in L1 (clustering is trivial since we know the labels). The clusters can then be projected onto a latent semantic space and with the help of the cosine similarity, we can identify the labels for documents in L2. Note that this initial assignment might not be very accurate and hence, we need to extend with an EM algorithm. An advantage of this approach is that it avoids translating the documents. This approach belongs to the class of semi-supervised learning algorithms.

Text classification can be highly painful if the number of features (word types) that you must process is large. Many of these features may play no role in classification at all. Techniques in the literature use metrics like odds ratio, information gain to reduce the number of features. Here, I envisage a "latent sematic analysis approach" to feature reduction. Instead of the conventional term-document matrix, define a "term-class" matrix where a cell indicates the likelihood of a word appearing in a given class. We can project this onto a semantic space using SVD. With the help of similarity metrics, we can identify whether a "word type is important to a class or not". Insignificant word types can then be filtered out.

This briefs out an alternate approach to the above problem. We can go in for a Maximum Entropy approach to feature reduction. Define a feature associated with every word as the sum of its likelihood estimates for each class weighted by the corresponding class priors. Then, we can estimate p(w). If p(w) is above a defined threshold, we can include it else discard it. Both these approaches help us in regaining the term-document representation for the data. Feature reduction methods in the literature have been biased towards specific classifiers. It will be interesting to compare various classifiers trained on data subjected to both forms of feature reduction.

Brian Pellin

I was thinking about a project where I would compare traditional information retrieval methods to language model based methods. It seems that currently the state of the art is using methods based on traditional IRs methods, because the LM based methods do not scale to Google size data sets. I think it would be interesting to see how LM methods could be adapted to perform better on large datasets, and additionally to see how classification accuracy compares between the different methods. Clearly this is a large problem space, so it will probably have to be further restricted to comparing a couple specific methods in order to be restricted to a class sized project.

Nathan Rosenblum

To be announced...

Ameet Soni

I would like to explore the use of Support Vector Machines on a Natural Language Processing task. Which task remains open ended, but one idea is to build a classifier that can take speech/text as input and convert it to features and then classify what language that speech/text is in. This is useful to create an automatic translator that does not need the user to specify the language first. Phone systems could automatically determine the users language by classifying their speech. Or possibly a browser that automatically translates web pages into the users language. An interesting subtask is to apply to this to proper names such as locations, person's names, and other proper words. This can be used to automatically translate text based on a users name, or location.

I also was thinking about doing something like this in the area of information extraction. Using a classifier to pull out key biological relationships in a document for example, although I think this may be a saturated area and SVMs may not be the optimal algorithm.

Lidan Wang

Given a set of constraints over data (such as which pairs of instances belonging to same or different clusters), people have used Hidden Markov Random Fields (HMRFs) to provide a framework for incorporating supervision into prototype-based clustering. I wonder if it is possible to use the EM algorithm in this semi-supervised training instead. The missing information here would be some of the labels, since there are both labeled and unlabeled data available. A possible question is how I should incorporate the constraints to the EM algorithm. I will look into some EM and semi-supervised learning papers for this.