CS 769: Advanced Natural Language Processing Final Project summaries

Hidayath Ansari, Chaitanya Gokhale
Positive-only Semi-supervised Classification
We discuss in this report, the task of binary classification on a test set, given a training set consisting of large number of unlabeled examples and a handful of examples belonging to one class. The task is part of the UCSD Data Mining Contest 2008.

Xiaoyong Chai
Clustering Regular Expressions for Efficient Matching
An information extraction system typically contains hundreds of thousands of regular expressions to be matched against text documents. Performing large-scale matching efficiently is thus a challenging problem. In this project, I attack the problem by clustering regular expressions, as a way to reduce the number of document scans. A simple heuristic-based iterative clustering algorithm is proposed. Experiments with a real-world dataset show the effectiveness of the clustering algorithm.

Nathanael Fillmore
A* Romantic Poetry Generation
Poetry publication in the United States is a multi-hundred dollar industry. Yet current methods of production are inefficient-they've hardly changed since before the Industrial Revolution. In this paper we present novel methods for training a computer to generate poetry using a corpus. (In all seriousness, it is interesting to see how well we can make the computer create meaning and form when we remove the constraints on content and ordering present in machine translation and typical natural language generation.) Previous attempts at using computers to automatically generate poetry tend to rely on hand-coded rules. For example, (Gervas 2001) uses a rule-based system to generate Spanish poetry. The rules were manually created by reviewing academic literature on poetry. (Manurung, Ritchie, and Thompson 2000) and (Manurung 2003) use stochastic hillcliming search to create poems. But evaluation and mutation of candidates rely on a hand-crafted grammar and lexicon. (Levy 2001) proposes a similar evolutionary algorithm, but again using a hand-crafted lexicon, conceptual knowledge base, and grammar. Other examples, going back at least to the 1970s, use hand-crafted template poems and fill in the blanks to create new poems. (See §2.3.2 in (Manurung 2003) for an overview.) On the other hand, several techniques we present here are similar to corpus-based approaches used in machine translation. These are referenced below.

Archit Gupta, Min Qiu
Inferring Malware Relationships using Topics Model
The diversity, sophistication and availability of malicious software (malware) pose enormous challenges for securing networks and end hosts from attacks. It is imperative from the security community point of view to understand how malware characteristics evolve over time and the actual relationships between malware for informed defense. We analyze metadata describing malware compiled over a period of 19 years for this end. We apply the Latent Drichilet Allocation (LDA) (D. Blei and Jordan 2003) technique to uncover the latent semantic space (topics) in the malware metadata. The weight vectors of these topics represent a dimension reduced feature space for each malware document. We design a two phase clustering algorithm with timestamps on feature vectors to establish the similarity and relationship among different malware. We augment domain specific frequent phrases as word types to the bag-ofwords vocabulary for better topic modeling. The results so far show relationship graphs that represent the most "likely" edges between two malware.

Larry A. Hendrix
Modeling tRNA using a Stochastic Context-Free Grammar
Stochastic context-free grammars (SCFGs) are becoming increasingly useful in biological sequence analysis tasks. RNA secondary structure problems are a natural application of these probabilistic models. This paper presents an application of a SCFG to model a class of RNA sequences called terminators (tRNA). The model is applied to a set of 100 known tRNA sequences (positive test set) and a set of 100 non-tRNA sequences (negative test set). This probabilistic model is then analyzed by comparing the sum of the negative log likelihood (NLL) for each test set. NLL is the negative log of the probability for each sequence s given the hypothesized grammar G, -log(Prob(s, G)). I expect the hypothesized grammar to be more likely to produce sequences from the positive test set of known tRNA.

Lijie Heng
Using Information Extraction to Build a CS User Search System
The goal of this project is to build a CS user search system to allow queries on CS users. Although all the CS users are already categorized into faculty, staff, graduates and undergraduates, it will be much more convenient to have a query system, which will return the profile of a CS user, including his/her relations with other people immediately when this user is queried. To build such a system, information extraction techniques are used to exploit useful information from each user's homepage obtained from the dataset of cs.wisc.edu. With our system, it's much easier and faster than using CS department web pages, to retrieve all the information of a current CS user and his relations to other people, by querying on a small piece of information known about him.

Shijin Huang
TF.IDF-Based Expert Finding in Enterprise Corpora
Expert finding is an important component in enterprise knowledge management that helps identify the right persons to consult with when such a need presents itself. The traditional database-based approaches have some manual steps which inherently limit their ability to adapt to the fast- changing business world. This project implements a TF.IDF-based expert finding system which can automatically keep expertise information up-to-date and use it to recommend experts. The evaluation results using a real- world data set from Epic show that the system can make practically good recommendations and suggest that the system can be useful in enterprise settings to find experts.

Yancan Huang
Study on Domain Adaptation for Sentiment Analysis
Domain Adaptation for Sentiment Analysis is a typical Machine Learning problem. Up to now, there has been much related research on this topic. In this paper, we take a novel Domain Adaptation algorithm[1] as our study case. We have implemented this algorithm and conducted many experiments with this algorithm using some datasets. We compare the performance of this algorithm with existed Domain Adaptation approaches. Finally we evaluate this approach and analysis its efficiency and accuracy.

Steve Jackson
Detecting Poetry from Prosody Patterns
Poetry is a subtle art form with a rich history. In general, the question "is this text poetic?" is a subjective judgement. However, in some cases it may be possible to give an objective measure of how "poetic" a text is by comparing its patterns of sound with the patterns of known poems. To that end, we attempt to devise sound-based text features that can be used to distinguish poetry and prose.

Samuel Javner
Word Sense Disambiguation Using Semantic Similarity Measures
The hypothesis that words occurring in the same context have similar meanings is fairly intuitive, but it was not always obvious. This insight is especially useful to the task of Word Sense Disambiguation (WSD), determining the intended meaning of an ambiguous word given its context. There are various approaches to WSD, both supervised and unsupervised. I explore a variety of methods for WSD, in particular, unsupervised knowledge-based WSD using measures of semantic relatedness. A word sense is chosen by determining which word sense is most related to its immediate context.

Chamond Liu
Classifying Painting Styles
This project explores the feasibility of using support vector machines (SVM) or stepwise logistic regression to distinguish thumbnails of two styles of paintings. I initially targeted impressionist and cubist paintings, hypothesizing that features representing edges, texture, saturation, and intensity would be relevant. For training and test examples I used first a small corpus of impressionist and cubist paintings, then expanded the number of impressionist examples, and finally used a large corpus combining the expanded impressionists with a large number of neoclassical paintings. Ten fold cross validation shows that stepwise logistic regression is markedly superior to SVM with mean accuracies of 90.7%, 96.3%, and 76.1% for the 3 corpora, respectively. Moreover the stepwise logistic regression also yields an assessment of feature quality, showing, for example, that intensity is promising but hue has no value in distinguishing impressionists from cubists

Jie Liu
Breast Cancer Identification from Structured and Free Text Mammographic Findings with kFOIL
In our project, we adopted kFOIL algorithm to identify breast cancer from mammography findings which were represented in NMD features and NLP features. We found out that NLP features did not improve the classification performance significantly. In addition, our classification accuracy peaked at 82.6% when top 20 NLP features were used, compared with doctor’s 88.5% prediction accuracy.

Mayank Maheshwari
Predicting stock returns using classification of annual financial reports
Short-term stock price movements or stock returns can be predicted with some accuracy using annual financial reports of companies. In this project, an SVM classifier is trained to predict stock returns as positive("up") or negative("down") by analyzing annual reports relative to the volatility measure of the stock, beta and the change in index value. The prediction is done over a short window (event study return) of 2 days (t,t+1) to gauge the market reaction of the report. Accuracies obtained are of the order of 57.1% on average and maximum of 76.47%.

Sarah Matz
Analysis and Clustering
In this paper, we explore methods to classify documents by sentiment (positive or negative) using clustering techniques. We find that clustering using bag-of-words (BOW) feature vectors does not detect sentiment. In one case it clusters solely by document length, while in other cases, the property(ies) defining the clusters are unknown. When some of the data is labeled with the true sentiment, this becomes a semi-supervised learning problem. Under this set-up, we find some indications of clustering by sentiment, but not to a large extent.

Pratap Ramamurthy
BIE - Badger Index Estimator
In this report we describe BIE, a search engine index size estimator. We use a technique called Capture- Recapture, which is used in Ecology to measure the population of animals in the wild. We require just two sets of samples to get a reasonably accurate estimate. In this report we compare the topical index size of three search engines: Google, Yahoo and Live.

Farzad Rastegar
Study of Evolution Using Pair Hidden Markov Models
In this study, we seek to estimate phylogenies from DNA sequence data. To compute the distance between sequences, we work on the details of the EM algorithm for a specific hidden Markov model called Pair HMM (PHMM) where parameters of the model are tied to a hidden random variable that represents the time since the two sequences have diverged. A PHMM is a mechanism utilized for pairwise sequence alignments. The EM algorithm allows for more accurate sequence alignment and gives a very useful distance function between sequences. Eventually, we utilize the distance function to reconstruct the phylogenetic tree of homologous sequences.

Tristan Ravitch
RegExplainer: Explaining Regular Expressions in Natural Language
The goal of this project is to provide natural language descriptions of the strings matched by a given regular expression. This involves several steps: (1) translation from a hierarchical representation with well-defined semantics to semi-natural language with a slot-filling approach, (2) grammatical smoothing, and (3) summarization.

Joel Scherpelz
Online Novelty Detection for Network Data Streams
Automated analysis of network data streams is a difficult but important problem is computer science. Novelty detection becomes difficult when then domain is an unbounded stream. The large volume of data in combination with an unstable underlying distribution renders most existing algorithms useless. A number of single pass clustering algorithms have been developed and this paper describes a method for extracting novel event types from the output of such a clustering algorithm. By maintaining a fixed size population of clusters we can watch the evolution and creation of clusters. By paying close attention to cluster lifecycles we can extract information about changes in the underlying distribution.

Brandon M. Smith
Multi-View 3D Scene Reconstruction
The goal of multi-view (or multi-camera) 3D scene reconstruction is to infer the three-dimensional geometry of a scene using several images captured from different viewpoints. This is a generalization of two-view stereo 3D scene reconstruction. Most techniques rely heavily on machine learning. For example, Markov Random Fields (MRFs) can be used to model spatial interactions between multiple views of a scene. Belief propagation can be used to solve such a model in a relative fast, approximate way [9]. Another popular technique relies on graph cuts to obtain a solution with (approximately) lowest energy [6]. This project focuses on exploring belief propagation and graph cuts to solve the multi-view 3D scene reconstruction problem. Specifically, a 5x5 camera array [15] is used for experimentation. Results are presented based on an implementation of the graph cuts method.

Sriram Subramanian
Spoken Document Retrieval
A Spoken Document Retrieval system allows text search on audio (speech) content. This would involve preprocessing the audio files and retrieving text using a speech to text engine. Sphinx ASR (v4.0) is used here along with Wall Street Journal Acoustic Model and the language model is produced using the ground truth transcripts. A TFIDF based search technique is employed and the results are compared against the ground truth transcripts.

Yoh Suzuki
Clustering Traffic: Analysis of Images from a Time-Lapse Camera
Information about the kinds of cars people drive can be useful to many organizations (i.e. car manufacturers can use the information to understand consumer demand for certain products). This work is the beginning of the analysis of image data to extract meaningful information about the traffic on the street the camera overlooks. We demonstrate a simple way to separate foreground from background, identify cars, and condense dozens of gigabytes of image data into a meaningful feature vector representation, which is used to cluster cars into groups of similar colors. Improvements and further analysis to be made in future work is suggested.

Zhuo Tao
Some Methods forWord Sense Disambiguation
Word sense disambiguation(WSD) is one of the major tasks in natural language processing, which identify the intended meaning of an ambiguous word in a certain context.