Advanced Natural Language Processing
Modeling Protein-Protein Interactions in
Biomedical Abstracts with Latent Dirichlet Allocation
A major goal in biomedical text processing is the automatic extraction of
protein interaction information. We can approach this task with a model based on
the 'topic' concept - where each topic corresponds to a different multinomial
distribution over our vocabulary. Sentences in biomedical abstracts can then be
generated by either the 'interaction' topic if they contain or discuss
interacting proteins or the 'background' topic otherwise. This model structure
can be represetned with Latent Dirichlet Allocation (LDA). Some model
development has already been done outside of this class - this project will
consist of further model development and refinement, inference equation and
algorithm derivation, and experimental testing on a dataset of Escheria Coli
abstracts and known pairs of interacting proteins obtained from the Database of
Interacting Proteins (DIP) at UCLA.
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of
Machine Learning Research, 3:993v1022, January 2003.
Text summarization is a classic problem in natural language processing.
Seminal work in this domain has been done as early as 1950s. In recent
times, with the huge amount of information available electronically,
there is an increasing demand for automatic text summarization systems.
A lot of new approaches have been developed in order to adapt summaries
to user needs and to corpus characteristics. Based on the content, these
can differ in generating a 'generalized summary' of a document as
against a 'query-specific summary'. A summary may be a collection of
sentences carefully picked from the document or can be a formed by
synthesizing new sentences representing the information in the
documents. Sentence extraction methods for summarization normally work
by scoring each sentence as a candidate to be part of summary, and then
selecting the highest scoring subset of sentences. Such statistical
techniques usually employ various features such as keywords, sentence
length, indicative phrases. The linguistic approach to summarization
extract phrases and lexical chains from the documents and fuse them
together with generative tools to produce a summary. In the past few
years, various techniques have been developed for multi-document and
multi-lingual summarization. One of the open problem in text
summarization domain is evaluation of summaries. A commonly employed
metric is to compare the automatically generated summaries to manually
Until now I have reviewed literature in the text summarization field. I
plan to apply a modified extractive summarization technique to generate
a domain-specific summary system. Due to the lack of standardization of
evaluation methodologies in this field, it is not clear at present how
to evaluate such a modified summarization system. With respect to
collecting training data, one possible approach is to obtain relevant
documents from the Internet (e.g. customer reviews). Also, standardized
data sets are available for certain conferences (e.g. Document
Understanding Conference) along with manually created summaries, however
prior permission must be obtained to use this data.
Hierarchical Topic Models for Image Categorization
Image classification is an essential part of digital image analysis. As a
fundamental and challenging task, many research efforts have been devoted along
this stream. With the state-of-the-art classification techniques such as SVM,
ever improving classification performances have been reported [1,2]. From the
perspective of imaging processing, different approaches have been tried, from
simply as bag-of-patches, bag-of-features , pyramid match kernel [4, 5], to
more sophisticated methods such as constellation model .
However, current approaches to image categorization have the follow limitations:
(1) Unable to leverage less-expensive unlabeled data, and the available labeled
datasets are usally small-size.
(2) Care less about the non data-driven knowledge, such as domain-specific
knowledge or general knowledge.
(3) Not so clear about the performance improvements are from better
classification methods or better understanding/representation of image data.
Both generative and discriminative classification paradigms have been tried in
image categorization tasks. Although currently discriminative methods such as
SVM empirically outperforms generative models, the better intuition of the
latter is still appealling. One framework for supervised learning called topic
model was proposed [7,8], which since then showed interesting results and
potentials in the areas such as text clustering and NIPS abstracts. A similar
approach was also employed in unsupervised learning . The framework has also
been used in image categorization tasks .
In this project, we propose a hierarchical topic model for image categorization.
In this context, one assumption underlying this approach is that there exists a
topic hierarchy from image data. For example, the image class "apple" shares
some latent topics with the class "orange", and these topics also have image
manifestations (table as background, size, shape, or something we don't know
A simple representation of the structure of the topic hierarchy would be a tree,
with leaf nodes denoting image classes fed with training images and internal
nodes symbolizing latent classes. Each image can be viewed as a "bag of patches"
or a feature vector by some image processing methods.
From a generative model point of view, each patch of an image is generated as
(1) Generate the class, ~multinomial(P(c));
(2) Generate a single latent class from the path starting from the class and up
to the tree root, ~multinomial(P(a|c));
(3) Generate a patch from the latent class, ~multinomial(P(w|a)).
The patches of a image are generated position by position.
This is the starting version of our generative hierarchical topic model, which
makes two assumption:
(1) The model is essentially a mixture of multinomials;
(2) Each class has the same mixture proportions to generate patche positions
belonging to that class.
The model parameters will be learned by an EM algorithm, which can be outlined
(1) Initialize parameters as discrete uniform distributions, but satisfying the
multinomial constraint (sum(P)=1 for each distribution). This initialization
sacrifices random restart mechanisms. The next version would initialize
parameters from some Dirichlet random distribution generator.
(2) Iterate until convergence:
(a) E-Step: Expect Q_k(ij) from parameters estimated from last iterations.
Here k denotes a latent class, ij is the
patch position j in d_i;
(b) M-Step: Maximize parameters: P(c), P(a|c), P(w|a).
This work is to address several problems in the current image categorization
(1) Lack of labeled image training data. Since image classification tasks
involve large amount of parameters, it would be valuable for a model to leverage
information from neighboring data. For example, a generating latent class for
"apple" learns something from the "orange" images.
(2) Incorporate prior knowledge in image categorization. The hierarchal latent
model is a way to achieve this. It can be obtained from external knowledge, such
The learned model shall be empirically compared with the conventional flat
supervised classification (to support advantage 1)
and those from arbitrary knowledge topology (to support advantage 2). As for the
data, we will use the caltech101 image
data made possible by Computational Vision at Caltech. The prior knowledge to be
incorporate will be extracted from WordNet
as a relevant condenced tree topology.
1. J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. "Local Features and
Kernels for Classifcation of Texture and Object Categories: An In-Depth Study."
International Journal of Computer Vision, submitted, 2005.
2. S. Lazebnik, C. Schmid, and J. Ponce. "A Maximum Entropy Framework for
Part-Based Texture and Object Recognition."
Proceedings of the IEEE International Conference on Computer Vision, Beijing,
China, October 2005, vol. 1, pp. 832-838.
3. Learning generative visual models from few training examples: an incremental
Bayesian approach tested on 101 object categories. L. Fei-Fei, R. Fergus, and P.
Perona. CVPR 2004, Workshop on Generative-Model Based Vision. 2004.
4. The Pyramid Match Kernel:Discriminative Classification with Sets of Image
Features. K. Grauman and T. Darrell. International Conference on Computer Vision
5. S. Lazebnik, C. Schmid, and J. Ponce. "Semi-Local Affine Parts for Object
Recognition." Proceedings of the British Machine Vision Conference, Kingston,
UK, September 2004, vol. 2, pp. 959-968.
6. L. Fei-Fei and P. Perona. "A Bayesian Hierarchical Model for Learning Natural
Scene Categories." CVPR 2005.
7. T. Hofmann. "Probabilistic latent semantic indexing." Proceedings of the
Twenty-Second Annual International SIGIR Conference, 1999.
8. D. Blei, A. Ng, and M. Jordan. "Latent Dirichlet allocation." Journal of
Machine Learning Research, 3:993Ė1022, January 2003.
9. T. Hofmann, "The Cluster-Abstraction Model: Unsupervised Learning of Topic
Hierarchies from Text Data."
Proc. of the IJCAI'99.
1) Using Learning Techniques for Anaphora Resolution and Relation Extraction
Problem: Conditional Random Fields and HMMs give good results for Noun
Coreference and Relation Extraction. I intend to explore semi-supervised
learning techniques that would learn to resolve multiple coreference to
entities as well as learn relations between entities. Further, active
learning could be used to obtain more labeled data.
Related Work: A maximum entropy approach to extract relations using
lexical, syntactic and semantic features is presented in . Several
conditional models are presented in  that apply to coreference
resolution. A simple language model that looks at a fixed size window of
n words to define context is used in  for modeling entities.
Datasets: A co-reference annotated corpus is available at
http://clg.wlv.ac.uk/resources/corefann.php. It contains a listing of
annotated technical manuals. Further, there is a small percentage of "7
sectors" dataset that is tagged with classes publicly available. There
are also a few open source software packages available that does some
basic tasks like tokenization, sentence extraction, and rudimentary
named entity extraction.
Issues: Evaluation will be difficult because many of the previous papers
have used commercial software and commercial data sets that are not
available to us. However, there are some other data sets/tools that are
freely available, which have been used in slightly older papers.
 Andrew McCallum. Conditional Models of Identity Uncertainty with
Application to Noun Coreference
 Nanda Kambhatla, "Combining Lexical, Syntactic, and Semantic
Features with Maximum Entropy Models for Extracting Relations"
 Hema Raghavan, James Allan and Andrew McCallum. An Exploration of
Entity Models, Collective Classification
and Relation Description. KDD '04.
2) Disambiguating People and other Entities
Goal: The problem is to automatically identify the appearance of
different people or entities in a given discourse. For example, consider
the WWW. When we issue a query containing a name looking for some
person, it is possible that we get pges that do not refer to the person
we are looking for. The goal is to automatically identify multiple
occurrences of the same person, from that of other persons.
Method : In , the authors have used a couple of clustering algorithms
and leverage the link structure of the social network formed by
different related people to solve the problem. I propose to associate
other entities that occur with the appearance of a person or the event
we are looking for, and do a co-occurrence analysis. The same entity or
similar entities are likely to occur in the same context. An approach
such as Latent Semantic Indexing will be useful for co-occurrence
analysis. A tagged data set containing 1085 webpages is available at the
webpages of the authors of .
 Ron Bekkerman and Andrew McCallum, "Disambiguating Web Appearance of
People in a Social Network",
My goal is to do a survey of classification methods. I will implement different
classification techniques, and test them to find out which ones run faster, and
which ones produce more accurate results for various sizes of training data. The
classifiers will be built in a generic way such that they can be easily used for
Classifier I will implement are
1. Generative Classifiers
a. Naive Bayes Classification: Both multivariate and
multinomial models will be implemented
2. Discriminative Classifiers
a. Nearest Neighbor Classifier: User will be able to write
their own similarity function, and change the value of k
b. Logistic Regression
c. Support Vector Machines: Tricks to deal with nonlinearity
will also be implemented.
The classifiers will be implemented in C++, and they will be based on the same
philosophy and design concepts of STL.
I will test the classifiers on the problem of recognizing hand written digits on
a grid. The size of the grid will be fixed. The user will be
able to specify the training set, and then ask the classifiers to classify a new
symbol. I will test the classifiers on scarce training set, and an extensive
one. For each case, the accuracy and running time of each classifier will be
recorded and a comparison will be made among all classifiers.
Noun-Verb Pair Based Text Classification
Context from a sentence is extracted by a human using the subject (noun)
and the action (verb) of the sentence. Two text corpuses can be
considered similar if they contain many sentences that share the same
noun-verb pairs. Using the LINK software suite to perform a standard
tree parsing of a sentence, the nouns and associated verbs of a sentence
will be found. WordNet will then be used to find synonyms of the verbs
to increase the possibility of a match. Classification using this
technique will be compared against the bigram model and a combination
technique using weighted noun-verb matches and the bigram model.
After more investigation the intoxicated speech recognition idea has been
scrapped. However, after extensive reading I have found several ideas in the
area of speech synthesis or text-to-speech. The first idea is text
normalization. Nonstandard tokens should be converted to words. For example, a
period can occur in sentences from abbreviated words or for the end of the
sentence. Abbreviated words should be stated in full form by the speech
synthesizer. Another example, a string of numbers could be a phone number, date,
dollar amount, etc. The synthesizer should state the numbers differently
depending on their type. Machine learning techniques
can be used to in this conversion from tokens to words . The second idea is
homograph disambiguation. A homograph is a word that can be pronounced
differently depending on the context and usage. For example, "read" has two
different pronunciations depending on the part of speech. Using only part of
speech to determine homograph pronunciation is not perfect for words such as
"wind" and "bass" where context must be taken in consideration. A speech
synthesizer needs to know the correct pronunciation of a word depending on the
context and part of speech . The third idea is grapheme-to-phoneme
conversion or letter-to-sound rules. This process converts the words into phones
that are stated by the synthesizer. A dictionary or lexicon is often used for
this process, but not every word is in the lexicon. Proper nouns, foreign words,
and slang are the most common words not found. For these words rules must be
to convert the word to a set of phones. The rules are generated using machine
learning techniques such as decision trees and
expectation-maximization . With either of these projects, I plan to integrate
their usage into Festival, an open source speech synthesizer .
 Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards,
C. 2001. Normalization of Non-standard Words, Computer Speech and
 Yarowsky, D. ^”Homograph Disambiguation in Speech Synthesis.^‘ In J. van
Santen, R. Sproat, J. Olive and J. Hirschberg (eds.), Progress in Speech
Synthesis. Springer-Verlag, pp. 159-175
 Black, A., Lenzo, K. and Pagel, V. (1998) Issues in Building General
Letter to Sound Rules, 3rd ESCA Workshop on Speech Synthesis, pp. 77-80,
Jenolan Caves, Australia.
Jurgen Van Gael
Cross Language News Clustering
News clustering is done at various portal sites such as news.google.com. The
clusters at these sites are single-language clusters and thus miss out on a
possibly valuable feature. My project will address this issue: it will use data
currently available (from news.google.com) and allow for users to find articles
in several different languages on the same news event.
I propose 4 distinct parts of the project:
1. Gather useful data from news.google.com
2. Find at least one model that clusters article across different languages
3. Implement the suggested models
4. Analyze the performance of the different models
Extra [5. Apply the technology]
1. Has been partly implemented and seems to gather a large amount of data on a
2. Will be the main part of the project and involve studying known techniques
from the literature.
3. In order to make user benchmarking possible we will probably implement a web
based interface for the project.
4. Since we do not have any existing examples of cross language clusters
available, it seems that we will have to resort to asking people to evaluate the
5. Depending on time, we might implement features such as one that reports the
relative importance of news events across different regions in the world and
possibly other ideas we come across along the way.
For my project, I plan to investigate semi-supervised learning
approaches to sentiment classification/analysis. This task is like
traditional text classification, except instead of predicting the
topic of a particular piece of text, the goal is to predict whether
the text conveys a positive or negative opinion. Practical
applications of this include automatic summarization of product
reviews, interpreting free text survey responses, or evaluating the
opinions of documents retrieved by a search engine. This is a
difficult task--while topical classification can easily rely on
certain keywords that indicate the topic at hand, opinions can be
much more subtly expressed. In movie reviews (the domain I plan to
investigate most), for example, authors tend to use sarcasm or
rhetorical questions which could use words that carry the opposite
sentiment as what they are really intending. Movie reviews also tend
to exhibit other difficult to learn phenomena such as "thwarted
expectations" (Pang et al, 2002), in which the reviewer deliberately
sets up a contrast between what they expected from a movie and what
they actually thought of it. This has a tendency to introduce many
words or phrases with sentiment opposite to that of the review as a
whole. Thus, making sentiment classification decisions based purely
on word occurrence/frequency could be quite difficult. A more complex
model might be necessary, which takes into account patterns of
positive and negative word usage throughout the course of the review.
In addition to predicting positive or negative, or 0 through 4 stars,
a common goal in this area is to predict a real number rating (say
from 0 to 1), in which case the task is one of regression rather than
Most work in this area has used supervised learning approaches (among
others: Pang et al., 2002; Pang and Lee, 2004; Pang and Lee, 2005)
that require all training examples to have explicit labels (usually
in the form of a number of stars for a movie or product review).
However, this is one of many areas where it would be extremely useful
to classify and make predictions without a large labeled training
corpus. Suppose we have a huge collection of movie reviews without
explicit labels. We should be able to exploit similarities between
unlabeled and labeled examples to build a classifier that is more
accurate than one based on the labeled examples alone. This is the
idea behind semi-supervised learning (SSL). A key component of graph-
based SSL methods is determining the similarity between examples.
With this knowledge, we can try to build a classifier such that it
assigns similar labels/ratings to similar examples. For example,
imagine we have review A with a known 4 star rating. Review B is
similar to A in terms of its features, but it does not have a known
label. Now suppose there's a new test example review C. It is very
similar to B, but not that similar to A. A classifier trained on A
alone might misclassify this, but a SSL approach would be able to
predict C more accurately based on the fact that it's similar to B,
which in turn is similar to A. This is the same as saying that the
decision surface over the space of examples is smooth.
I plan to study how effectively semi-supervised learning can be
applied to the movie review dataset found at
http://www.cs.cornell.edu/people/pabo/movie-review-data/, comparing my
results with the supervised approaches of Pang and Lee (2005). In
this work, the authors tried to exploit similarities between reviews
(based on the percentage of positive sentences each contained) to
enforce the constraint that reviews similar in content should wind up
being predicted to have similar ratings, but they only looked for
similar reviews among the labeled training examples. I plan to
explore new methods of comparing two movie reviews (or any
opinionated text) and try out several theoretical graph-based semi-
supervised learning approaches that exploit similarities between the
test examples and all other examples (both among the training and
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up?
Sentiment Classification using Machine Learning Techniques,
Proceedings of EMNLP 2002.
Bo Pang and Lillian Lee, A Sentimental Education: Sentiment Analysis
Using Subjectivity Summarization Based on Minimum Cuts, Proceedings
of ACL 2004.
Bo Pang and Lillian Lee, Seeing stars: Exploiting class relationships
for sentiment categorization with respect to rating scales,
Proceedings of ACL 2005.
In question-answering application, the answer to a question is usually extracted
from the prespecify text patterns. For example, if the question "When was [person]
born?", the typical answer would be "Abraham Lincoln was born in 1809." or
"Gandhi (1869-1948)". Thus, one can probably develop some regular expresions, which
can detect these pattern in the text. For example, in this case we would need
"[Name] was born in [birthdate]" and "[Name] ([Birthdate]-". These phrase-patterns
are refered to as surface patterns. There are a lot of possible surface pattern
to any given question. Thus, it would be nearly impossible to manually find such
pattern and learn these regular expressions along the web.
I would like to apply the unsupervised learning setting to this problem. Although,
I am still in the process of searching the literatures to find out the current state
of applying unsupervised learning method in QA applications.
Deepak Ravichandran and Eduard Hovy, Learning Surface Test Patterns for a qustion
Ansering System. Proceeding of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), Philadelphia, July 2002, pp. 41-47.
1) Disambiguation for cross-language IR
The higher level problem is to be able to retrieve relevant documents
from a target language when the query is given in a source
language. The chief problem here is that of words having multiple
senses. The usual ways of dealing with this are a) direct translation
of query to the target language, b) translation of documents from
target to source language and then perform monolingual retrieval. The
way I propose to address this issue is by using relevant documents in
the source language to perform sense diambiguation on the query and
once the sense is known, perform an automatic translation of the
keywords into the target language and then perform IR in the target
language. This would require query-relevant corpora in two languages
(need not be parallel) and a bi-lingual dictionary.
2) Identifying and characterizing the protagonist in a book
Given a book, the aim is to first find out the chief character in the
book. This can be done using Named Entity Recognition. Then, using
frequencies of named entities a reliable estimate to the protagonist's
name can be ascertained. The second step would be to find the
adjectives in the text that refer to him/her and the final step would
be to categorize the character of the protagonist into some predefined
classes based on the adjectives. The first step can be performed
automatically. The final step requires some user-tagging of adjectives
into a few classes. The referring adjectives can probably be obtained
using some POS analysis.
1) Metaphor Identification
Similes and Metaphors are figures of speech that are prevalent in plays/dramas etc. A simile is a figure of speech in which two essentially unlike things are compared, often in a phrase introduced by like or as. For example "He fought like a lion in the battle", describes the fighting ability of the person. Identifying similes in text documents can be done by identifying the occurrence of words 'like' or 'as' in the sentences. A Metaphor is also a figure of speech in which an expression is used to refer to something that it does not literally denote in order to suggest a similarity. But unlike a simile, there aren't any explicit words which identifies a metaphor. This makes it interesting to address the problem of identifying a metaphor in documents. Identifying a metaphor gives a better knowledge of the sentences and it useful for correctly translating documents from one language to another. An example of a metaphor is the sentence "He was a lion in the battle". This sentence attributes the bravery (quality) of a lion to the person. Metaphor identification will involve learning the qualities first and then using the context of the sentence and the parts of speech to identify the correct quality that is referred.
2) Novel Type Classifier
Given a novel, the system should classify it as a 'tragedy'/'comedy' etc. A related problem is the document summarization. Automatic summarization is the creation of a shortened version of a text by a computer program. The summary should contain the most important points of the original text which forms a plot. The Novel classifier should be aware of the plot of the story in order to correctly identify the type. At first thought it looks as though the words used in the novel can be an indicator of the type. But identifying the meaning/emotions/tone that are useful for type classification could be hard. I'm interested in studying how effective the statistical modeling techniques would be in this case. It would involve identifying the key features that help in understanding the overall plot and then use it for classification.
Natural language generation (NLG) is the process of taking some structured
representation of information an producing natural language. Generally speaking,
there are three types of sentence realization: "canned" text, template, and
agent based. Canned text is the most basic, where predetermined output can be
triggered by predetermined input. Templates are forms or various other
part-of-speech mechanisms into which information from the user is inserted.
Agent based systems try to form some sort of representation of the user's
intentions, as presented in a scenario such as a dialog.
In this project I would like to create a system that would produce a paraphrased
summary or comment on a set of news articles on a particular topic. This is
similar to a summarization task, except the system will be formulating its own
sentences. I feel that a template-based approach to the task will be best, given
the one-way nature of the problem and the time-constraints involved. In
addition, I would like to try using some probabilistic preprocessing steps, such
as using word frequencies to determine words relevant to the topic. I am looking
to find a separate template-based generator that I can use for developing the
templates (YAG http://tigger.cs.uwm.edu/~nlkrrg/yag/yaghome.html is the most
promising). I hope that this is enough structure to produce a
system that will generate understandable and interesting output.
Channarukul S. (1999) YAG: A Template-Based Natural Language Generator for Real
Time Systems. Master Thesis, Department of Electrical Engineering and Computer
Science, University of Wisconsin-Milwaukee.
Human Motion Recognition
Human Motion recognition has been a hot topic in areas such as surveillance,
and human machine interaction. Most research efforts focus on video based
motion recognition. The recognition can usually be divided into 2 process,
extracting motion information from videos and recognizing motion using the
extracted information. In this project, we focus on recognizing motion. That
is to say, we use existing 3D human motion data as the training dataset and
testing dataset. Our method can also be well applied to motion information
extracted from videos. Besides, our method can be used in motion data
retrieval and example based motion synthesis.
The basic idea of this project is to train motion classifiers using existing
motion capture data set, and use another data set to examine the performance
of the classifiers. To train good motion classifiers, several issues must be
1. Find a machine learning algorithm that is suitable for multivariate
2. Extract the important features from a motion.
3. Since motion is a high-dimensional long signal, efficiency is an
 Aphrodite Galata and Neil Johnson and David Hogg. Learning
variable-length Markov models of behavior, Computer Vision and Image
Understanding, vol. 81(3): 398--413.
 Liu, G., Zhang, J., Wang, W., and McMillan, L. 2005. A system for
analyzing and indexing human-motion databases. SIGMOD '05. ACM Press, New
York, NY, 924-926.
 Duong, T., Bui, H.; Phung, D., Venkatesh, S. Activity recognition and
abnormality detection with the switching hidden semi-Markov model. CVPR
2005, pp. 20-25.
 Ben-Arie, J., Pandit, P., Rajaram, S. View-based human activity
recognition by indexing and sequencing. CVPR 2001, pp. 78-83.
 Hongeng, S., Bremond, F., Nevatia, R. Representation and optimal
recognition of human activities, CVPR 2000, pp. 818 - 825.
 Yacoob, Y. and Black, M. Parameterized modeling and recognition of
activities, ICCV 1998, pp. 120 - 127.
 Nguyen, N., Phung, D., Venkatesh, S. and Bui, H. Learning and detecting
activities from movement trajectories using the hierarchical hidden Markov
model. CVPR 2005, pp. 955 - 960.
 Niu, F. and Abdel-Mottaleb, M. HMM-Based Segmentation and Recognition of
Human Activities from Video Sequences, ICME 2005, pp. 804 - 807.
Soil information retrieval from multi-temporal remote sensing images
(This project will deal with soil mapping from remote sensing images using
statistical machine learning methods).
Soil is a very important natural resource, and detailed soil information is
necessary for land use and environmental modeling applications. Traditional
soil survey and soil mapping have been largely based on field observation and
manual delineation, but remote sensing has provided a potential to detect soil
information without intensive field trip and at a much larger scale. Its basis
is that soil surface reflectance (given the effect of vegetation cover and
surface roughness is minimized) as detected by satellite sensors would be
different due to various moisture content.
Previous Approaches and Motivation
There are a few studies proposed that use remote sensing images to detect soil
information (), and they are usually based on a single band (usually
a microwave band which is most sensitive to soil moisture content) and at a
snapshot (one image). The motivation of this project is then the temporal change
of soil moisture content (as response to weather change) rather than a static
snapshot. The change of moisture content will be particularly prominent in a
few days after a rainfall event when the soil responds differently from saturation
to a stable state due to various soil texture and structure. Therefore, a series
of remote sensing images from a high temporal frequency (e.g. once a day)
satellite sensor will be chosen.
Data and Study Area
The two sensors, MODIS (Moderate Resolution Imaging Spectroradiometer) and
AMSR-E (Advanced Microwave Scanning Radiometer) aboard Earth Observing System
(EOS) satellite Aqua (daily temporal frequency), will be chosen. MODIS has 36
bands ranging from optical to thermal, but only relevant bands will be selected.
AMSR-E provides microwave coverage, which is most sensitive to soil moisture.
To minimize the effect of landcover, a prairie area will be chosen, and a series
of images will be gathered from late fall after a rainfall event.
Therefore, soil mapping becomes a classification problem. Each pixel has a feature
vector of reflectance values from several bands and from a series of time snapshots.
The pixel values across the feature dimensions constitute the "signature" of the pixel,
identifying itself in the feature space. The task is to classify the pixels in the
feature dimension. Various statistical learning methods are possible. One currently
under consideration is density estimation using Gaussian mixture models. There are two
possible approaches to train the GMMs: supervised learning and unsupervised learning.
Unsupervised learning doesn't require training data (which would be expensive to gather),
however, because the clustering uses merely distance metric and without human knowledge
guidance, the results can be arbitrary and useless. On the contrary, supervised training
could be very well guided but a large amount of training data would not be possible.
Existing survey maps could be used as training data, but the accuracy of the existing
maps is subject to investigation. The feasibility of semi-supervised approaches ()
can be explored in this project, to fully utilize the limited number of labeled points
and vast number of unlabeled points. Gaussian Mixture Models can be constructed using
Expectation Maximization algorithm given the training set. Other statistical methods
that can be investigated include neural network and support vector machine.
The expected result of this study is to find a feasible method for large scale soil
mapping with remote sensing images, in low relief and vegetation-sparse areas.
 Muller, E and H. Decamps. 2000. Modeling soil moisture-relectance. In Remote sensing
of Environment. 76, pp. 173-180.
 Wigneron, J.P., J.C. Calvet, T. Pellarin, A.A. Van de Griend, M. Berger and P. Ferrazzoli.
2002. Retrieing near-surface soil mositure from microwave radiometric observations: current
status and future plans. In Remote sensing of Environment. 85, pp. 489-505.
 Zribi, M. S.L. Hegarat-Mascle, C. Ottle, B. Kammoun and C. Guerin. 2003. Surface soil
moisture estimation from the synergistic use of the (multi-incidence and multi-resolution)
active microwave ERS wind scatterometer and SAR data. in Remote sensing of environment. 86,
 Zhu, X.J. 2005. Semi-Supervised learning literature survey. at:
Gene regulation inference from biological literatures
Large number of literatures are been published daily in research
journals. Researchers can only read a small portion of the literature in
their field. Sometimes successful findings heavily rely on the quantity
and quality of the literatures a researcher have accessed. Many
discoveries on gene regulatory network were made from knowledges gained
from literatures. However extraction of biological information is
facing difficulties: 1) biological literature is written in the form of
natural language that content can be understood only if researcher read
papers one by one; 2) a correlation found in the literature can only be
verified by researchers who read the paper; 3) the number of papers
needed to be read quickly exceed the practical capacity of researchers.
Thus automated information extraction from biological publications is
I proposed to use pubmed abstracts to infer proteins that regulates a
given set of genes. Two machine learning approaches will be used:
Support Vector Machine (SVM) and Bond Energy Algorithm(BEA)
partitioning. Each was previously described in separate papers and was
found to be superior to other common method. Algorithms will be
implemented, tested on well characterized gene sets, and comparison of
the performance will be compared.
Text classification has been at the heart of NLP and research in text
classification has been going on with undeterred enthusiasm. Here, I list
a couple of approaches to text classification and feature reduction.
One particular area of interest is "cross language text classification".
The problem statement is as follows: Given labeled data in a resource rich
language L1, and unlabeled data in another language L2 (assume we have
access to a comparable corpora), the task is to classify documents in
language L2. A few approaches in the literature use machine translation
approaches to achieve this. But, with the lack of bilingual lexicons, this
becomes a difficult task. One aproach would be to compare the documents in
both the languages in a "latent semantic space". A first step would be to
cluster the documents in L2 using a clustering algorithm like pLSA. Once
this is achieved, we can develop a language model for each cluster. A
similar task can be done for documents in L1 (clustering is trivial since
we know the labels). The clusters can then be projected onto a latent
semantic space and with the help of the cosine similarity, we can identify
the labels for documents in L2. Note that this initial assignment might
not be very accurate and hence, we need to extend with an EM algorithm. An
advantage of this approach is that it avoids translating the documents.
This approach belongs to the class of semi-supervised learning algorithms.
Text classification can be highly painful if the number of features (word
types) that you must process is large. Many of these features may play no
role in classification at all. Techniques in the literature use metrics
like odds ratio, information gain to reduce the number of features. Here,
I envisage a "latent sematic analysis approach" to feature reduction.
Instead of the conventional term-document matrix, define a "term-class"
matrix where a cell indicates the likelihood of a word appearing in a
given class. We can project this onto a semantic space using SVD. With the
help of similarity metrics, we can identify whether a "word type is
important to a class or not". Insignificant word types can then be
This briefs out an alternate approach to the above problem. We can go in
for a Maximum Entropy approach to feature reduction. Define a feature
associated with every word as the sum of its likelihood estimates for each
class weighted by the corresponding class priors. Then, we can estimate
p(w). If p(w) is above a defined threshold, we can include it else discard
it. Both these approaches help us in regaining the term-document
representation for the data. Feature reduction methods in the literature
have been biased towards specific classifiers. It will be interesting to
compare various classifiers trained on data subjected to both forms of
I was thinking about a project where I would compare traditional
information retrieval methods to language model based methods. It seems
that currently the state of the art is using methods based on
traditional IRs methods, because the LM based methods do not scale to
Google size data sets. I think it would be interesting to see how LM
methods could be adapted to perform better on large datasets, and
additionally to see how classification accuracy compares between the
different methods. Clearly this is a large problem space, so it will
probably have to be further restricted to comparing a couple specific
methods in order to be restricted to a class sized project.
To be announced...
I would like to explore the use of Support Vector Machines on a Natural
Language Processing task. Which task remains open ended, but one idea
is to build a classifier that can take speech/text as input and convert
it to features and then classify what language that speech/text is in.
This is useful to create an automatic translator that does not need the
user to specify the language first. Phone systems could automatically
determine the users language by classifying their speech. Or possibly a
browser that automatically translates web pages into the users
language. An interesting subtask is to apply to this to proper names
such as locations, person's names, and other proper words. This can be
used to automatically translate text based on a users name, or location.
I also was thinking about doing something like this in the area of
information extraction. Using a classifier to pull out key biological
relationships in a document for example, although I think this may be a
saturated area and SVMs may not be the optimal algorithm.
Given a set of constraints over data (such as which pairs of instances belonging
to same or different clusters), people have used Hidden Markov Random Fields (HMRFs)
to provide a framework for incorporating supervision into prototype-based
clustering. I wonder if it is possible to use the EM algorithm in this
semi-supervised training instead. The missing information here would be some of
the labels, since there are both labeled and unlabeled data available. A
possible question is how I should incorporate the constraints to the EM
algorithm. I will look into some EM and semi-supervised learning papers for