Hidayath Ansari, Chaitanya Gokhale |
Positive-only Semi-supervised Classification |
We discuss in this report, the task of binary classification
on a test set, given a training set consisting of
large number of unlabeled examples and a handful of
examples belonging to one class. The task is part of the
UCSD Data Mining Contest 2008.
|
|
Xiaoyong Chai |
Clustering Regular Expressions for Efficient Matching |
An information extraction system typically contains
hundreds of thousands of regular expressions to be matched
against text documents. Performing large-scale matching
efficiently is thus a challenging problem. In this project, I
attack the problem by clustering regular expressions, as a
way to reduce the number of document scans. A simple
heuristic-based iterative clustering algorithm is proposed.
Experiments with a real-world dataset show the effectiveness
of the clustering algorithm.
|
|
Nathanael Fillmore |
A* Romantic Poetry Generation |
Poetry publication in the United States is a multi-hundred
dollar industry. Yet current methods of production are
inefficient-they've hardly changed since before the Industrial
Revolution. In this paper we present novel methods
for training a computer to generate poetry using a corpus.
(In all seriousness, it is interesting to see how well we can
make the computer create meaning and form when we remove
the constraints on content and ordering present in machine
translation and typical natural language generation.)
Previous attempts at using computers to automatically
generate poetry tend to rely on hand-coded rules. For example,
(Gervas 2001) uses a rule-based system to generate
Spanish poetry. The rules were manually created by reviewing
academic literature on poetry. (Manurung, Ritchie, and
Thompson 2000) and (Manurung 2003) use stochastic hillcliming
search to create poems. But evaluation and mutation
of candidates rely on a hand-crafted grammar and lexicon.
(Levy 2001) proposes a similar evolutionary algorithm, but
again using a hand-crafted lexicon, conceptual knowledge
base, and grammar. Other examples, going back at least
to the 1970s, use hand-crafted template poems and fill in
the blanks to create new poems. (See §2.3.2 in (Manurung
2003) for an overview.)
On the other hand, several techniques we present here are
similar to corpus-based approaches used in machine translation.
These are referenced below.
|
|
Archit Gupta, Min Qiu |
Inferring Malware Relationships using Topics Model |
The diversity, sophistication and availability of malicious
software (malware) pose enormous challenges for
securing networks and end hosts from attacks. It is
imperative from the security community point of view
to understand how malware characteristics evolve over
time and the actual relationships between malware for
informed defense.
We analyze metadata describing malware compiled
over a period of 19 years for this end. We apply the
Latent Drichilet Allocation (LDA) (D. Blei and Jordan
2003) technique to uncover the latent semantic space
(topics) in the malware metadata. The weight vectors
of these topics represent a dimension reduced feature
space for each malware document. We design
a two phase clustering algorithm with timestamps on
feature vectors to establish the similarity and relationship
among different malware. We augment domain
specific frequent phrases as word types to the bag-ofwords
vocabulary for better topic modeling. The results
so far show relationship graphs that represent the most
"likely" edges between two malware.
|
|
Larry A. Hendrix |
Modeling tRNA using a Stochastic Context-Free Grammar |
Stochastic context-free grammars (SCFGs) are becoming
increasingly useful in biological sequence analysis tasks. RNA
secondary structure problems are a natural application of these
probabilistic models. This paper presents an application of a
SCFG to model a class of RNA sequences called terminators
(tRNA). The model is applied to a set of 100 known tRNA
sequences (positive test set) and a set of 100 non-tRNA sequences
(negative test set). This probabilistic model is then analyzed by
comparing the sum of the negative log likelihood (NLL) for each
test set. NLL is the negative log of the probability for each
sequence s given the hypothesized grammar G, -log(Prob(s, G)). I
expect the hypothesized grammar to be more likely to produce
sequences from the positive test set of known tRNA.
|
|
Lijie Heng |
Using Information Extraction to Build a CS User Search
System |
The goal of this project is to build a CS user search system to allow
queries on CS users. Although all the CS users are already categorized
into faculty, staff, graduates and undergraduates, it will be much more
convenient to have a query system, which will return the profile of a CS
user, including his/her relations with other people immediately when this
user is queried. To build such a system, information extraction techniques
are used to exploit useful information from each user's homepage obtained
from the dataset of cs.wisc.edu. With our system, it's much easier and
faster than using CS department web pages, to retrieve all the information
of a current CS user and his relations to other people, by querying on a
small piece of information known about him.
|
|
Shijin Huang |
TF.IDF-Based Expert Finding in Enterprise Corpora |
Expert finding is an important component in enterprise
knowledge management that helps identify the right persons
to consult with when such a need presents itself. The
traditional database-based approaches have some manual
steps which inherently limit their ability to adapt to the fast-
changing business world. This project implements a
TF.IDF-based expert finding system which can
automatically keep expertise information up-to-date and use
it to recommend experts. The evaluation results using a real-
world data set from Epic show that the system can make
practically good recommendations and suggest that the
system can be useful in enterprise settings to find experts.
|
|
Yancan Huang |
Study on Domain Adaptation for Sentiment Analysis |
Domain Adaptation for Sentiment Analysis is a typical
Machine Learning problem. Up to now, there has been
much related research on this topic. In this paper, we take a
novel Domain Adaptation algorithm[1] as our study case. We
have implemented this algorithm and conducted many
experiments with this algorithm using some datasets. We
compare the performance of this algorithm with existed
Domain Adaptation approaches. Finally we evaluate this
approach and analysis its efficiency and accuracy.
|
|
Steve Jackson |
Detecting Poetry from Prosody Patterns |
Poetry is a subtle art form with a rich history. In general,
the question "is this text poetic?" is a subjective judgement.
However, in some cases it may be possible to give an objective
measure of how "poetic" a text is by comparing its
patterns of sound with the patterns of known poems. To that
end, we attempt to devise sound-based text features that can
be used to distinguish poetry and prose.
|
|
Samuel Javner |
Word Sense Disambiguation
Using Semantic Similarity Measures |
The hypothesis that words occurring in the same context have
similar meanings is fairly intuitive, but it was not always obvious.
This insight is especially useful to the task of Word Sense
Disambiguation (WSD), determining the intended meaning of an
ambiguous word given its context. There are various approaches to
WSD, both supervised and unsupervised. I explore a variety of
methods for WSD, in particular, unsupervised knowledge-based
WSD using measures of semantic relatedness. A word sense is
chosen by determining which word sense is most related to its
immediate context.
|
|
Chamond Liu |
Classifying Painting Styles |
This project explores the feasibility of using support vector
machines (SVM) or stepwise logistic regression to
distinguish thumbnails of two styles of paintings. I initially
targeted impressionist and cubist paintings, hypothesizing
that features representing edges, texture, saturation, and
intensity would be relevant. For training and test examples I
used first a small corpus of impressionist and cubist
paintings, then expanded the number of impressionist
examples, and finally used a large corpus combining the
expanded impressionists with a large number of neoclassical
paintings. Ten fold cross validation shows that stepwise
logistic regression is markedly superior to SVM with mean
accuracies of 90.7%, 96.3%, and 76.1% for the 3 corpora,
respectively. Moreover the stepwise logistic regression also
yields an assessment of feature quality, showing, for
example, that intensity is promising but hue has no value in
distinguishing impressionists from cubists
|
|
Jie Liu |
Breast Cancer Identification from Structured and Free Text
Mammographic Findings with kFOIL |
In our project, we adopted kFOIL algorithm to identify breast cancer from
mammography findings which were represented in NMD features and NLP
features. We found out that NLP features did not improve the classification
performance significantly. In addition, our classification accuracy peaked
at 82.6% when top 20 NLP features were used, compared with doctor’s 88.5%
prediction accuracy.
|
|
Mayank Maheshwari |
Predicting stock returns using classification of annual
financial reports |
Short-term stock price movements or stock returns can
be predicted with some accuracy using annual financial
reports of companies. In this project, an SVM classifier
is trained to predict stock returns as positive("up")
or negative("down") by analyzing annual reports relative
to the volatility measure of the stock, beta and
the change in index value. The prediction is done over
a short window (event study return) of 2 days (t,t+1)
to gauge the market reaction of the report. Accuracies
obtained are of the order of 57.1% on average and maximum
of 76.47%.
|
|
Sarah Matz |
Analysis and Clustering |
In this paper, we explore methods to classify documents by sentiment
(positive or negative) using clustering techniques. We find that clustering
using bag-of-words (BOW) feature vectors does not detect sentiment. In one
case it clusters solely by document length, while in other cases, the
property(ies) defining the clusters are unknown. When some of the data is
labeled with the true sentiment, this becomes a semi-supervised learning
problem. Under this set-up, we find some indications of clustering by
sentiment, but not to a large extent.
|
|
Pratap Ramamurthy |
BIE - Badger Index Estimator |
In this report we describe BIE, a search engine index
size estimator. We use a technique called Capture-
Recapture, which is used in Ecology to measure the
population of animals in the wild. We require just two
sets of samples to get a reasonably accurate estimate. In
this report we compare the topical index size of three
search engines: Google, Yahoo and Live.
|
|
Farzad Rastegar |
Study of Evolution Using Pair Hidden Markov Models |
In this study, we seek to estimate phylogenies from DNA
sequence data. To compute the distance between sequences,
we work on the details of the EM algorithm for a specific
hidden Markov model called Pair HMM (PHMM) where
parameters of the model are tied to a hidden random
variable that represents the time since the two sequences
have diverged. A PHMM is a mechanism utilized for
pairwise sequence alignments. The EM algorithm allows for
more accurate sequence alignment and gives a very useful
distance function between sequences. Eventually, we utilize
the distance function to reconstruct the phylogenetic tree of
homologous sequences.
|
|
Tristan Ravitch |
RegExplainer: Explaining Regular Expressions in Natural
Language |
The goal of this project is to provide natural language
descriptions of the strings matched by a given regular
expression. This involves several steps: (1) translation
from a hierarchical representation with well-defined semantics
to semi-natural language with a slot-filling approach,
(2) grammatical smoothing, and (3) summarization.
|
|
Joel Scherpelz |
Online Novelty Detection for Network Data Streams |
Automated analysis of network data streams is a difficult
but important problem is computer science. Novelty detection
becomes difficult when then domain is an unbounded
stream. The large volume of data in combination with an
unstable underlying distribution renders most existing algorithms
useless. A number of single pass clustering algorithms
have been developed and this paper describes a method
for extracting novel event types from the output of
such a clustering algorithm. By maintaining a fixed size
population of clusters we can watch the evolution and creation
of clusters. By paying close attention to cluster lifecycles
we can extract information about changes in the underlying
distribution.
|
|
Brandon M. Smith |
Multi-View 3D Scene Reconstruction |
The goal of multi-view (or multi-camera) 3D scene
reconstruction is to infer the three-dimensional geometry of
a scene using several images captured from different
viewpoints. This is a generalization of two-view stereo 3D
scene reconstruction. Most techniques rely heavily on
machine learning. For example, Markov Random Fields
(MRFs) can be used to model spatial interactions between
multiple views of a scene. Belief propagation can be used to
solve such a model in a relative fast, approximate way [9].
Another popular technique relies on graph cuts to obtain a
solution with (approximately) lowest energy [6].
This project focuses on exploring belief propagation and
graph cuts to solve the multi-view 3D scene reconstruction
problem. Specifically, a 5x5 camera array [15] is used for
experimentation. Results are presented based on an
implementation of the graph cuts method.
|
|
Sriram Subramanian |
Spoken Document Retrieval |
A Spoken Document Retrieval system allows text search on
audio (speech) content. This would involve preprocessing
the audio files and retrieving text using a speech to text
engine. Sphinx ASR (v4.0) is used here along with Wall
Street Journal Acoustic Model and the language model is
produced using the ground truth transcripts. A TFIDF based
search technique is employed and the results are compared
against the ground truth transcripts.
|
|
Yoh Suzuki |
Clustering Traffic: Analysis of Images from a Time-Lapse
Camera |
Information about the kinds of cars people drive can be
useful to many organizations (i.e. car manufacturers can use
the information to understand consumer demand for certain
products). This work is the beginning of the analysis of
image data to extract meaningful information about the
traffic on the street the camera overlooks. We demonstrate
a simple way to separate foreground from background,
identify cars, and condense dozens of gigabytes of image
data into a meaningful feature vector representation, which
is used to cluster cars into groups of similar colors.
Improvements and further analysis to be made in future
work is suggested.
|
|
Zhuo Tao |
Some Methods forWord Sense Disambiguation |
Word sense disambiguation(WSD) is one of the major tasks
in natural language processing, which identify the intended
meaning of an ambiguous word in a certain context.
|
|