CS 769 - Spring '08

2007 Class Projects for CS 769:

Anne Jorstad

I propose to use the spectral clustering method to divide a corpora of
Shakespeare's plays, considered as bags of words, into two categories (two-way
clustering). I am curious to discover what qualities will distinguish the two
groups, if they will be divided largely as comedy/tragedy, or if there will be
no obvious distinction at all. As this is a collection of only 37 documents,
this project will also test the limits of spectral clustering as applied to a
small number of data points.

I will use a corpora from the following website containing the works of
Shakespeare: http://www.it.usyd.edu.au/~matty/Shakespeare/

Another interesting and related application that could be tested with the same
implementation would be segmenting Aesop's fables, treating each fable and its
accompanying "moral" (such as "slow and steady wins the race") as separate
documents, and seeing how many pairs the algorithm ends up placing in the same
cluster. A corpora of the 21 most famous fables can be found here:
http://www.bygosh.com/aesop/index.htm .

Chris Hinrichs

For this project I will be analyzing acoustic data with latent semantic
space algorithms, LSA, pLSA and LDA if time permits. In order to do this,
the acoustic data stream will have to be split into repeating terms. The
sounds will include music, speech, and noise.

Once this is done, the semantic space representation of the sounds will
yield żtopicsż in the latent space, which could (depending on how the sound is
translated into żtermsż,) be thought of as distributions of either cadences, or
rhythm structures, or of chord or harmonic structures. A natural predictive
task is whether or not the sound is żmusicalż, i.e. does it have a repeating
cadence and chord structure, or is it more żvocalż, or is it just żnoiseż. To
classify a new document, i.e. a new sound, the document would first be
translated into terms, and then into latent topics. Finally, a tf/idf score
could be used to match it to the classes.

Daniel Wong

Consider the problem of a journal receiving papers one concern is whether or not
these papers have been published before. Current tools exist but few that use
the power of web based search. Other tools are commercial packages that target
cheating on high school reports and papers. This is a difficult problem since
many modifications can be made to documents that could be hard to detect. In
addition, searching manually by hand for all the papers received is far to time
consuming to be practical.

A proposed method is to take random n-grams and then use these word sequences
as search seeds in a search tool. Collect a list of documents that this
n-gram appears in and repeat. An annealing like method can be used starting
with very long n-grams less likely to have search hits and then reducing n to
collect a sequence of documents and number of co-occurrences of random ngrams
and specific documents. Some intuitive probabilistic statements can be
thought of the probability of a large n-gram matching with any given document
is less than a small n-gram matching with any given document. In addition,
the probability of many randomly selected n-grams from the seed document
causing search hits in a single document if the documents are unrelated is small
and likewise larger if the documents are related.

Derek Gjertson

The problem I am examining is the clustering of eBay auctions. Given a
set of eBay auctions, I want to be able to cluster individual auctions
that refer to the same item being sold. The clustering of items is
beneficial when comparison shopping to find quickly the best price for a
particular item.

Planned Solution:
To collect the data to cluster, I will use the eBay SOAP API to retrieve
a number of items and their descriptions. I will use a leaf node of the
eBay hierarchy to maximize the number of records that can be linked
together. After getting the data, I will first try to cluster based on
the title of the item and the basic information of the item. I will use
a bag of words representation and use more structured data to calculate
the similarity. I will compare the similarity to a threshold, and use
that to determine whether the items can be linked together. I will find
which auctions this model works well on, and find what auctions this
model fails. If I find a class of auctions that fails to cluster
properly, I will look at the item's extended information. I will need
to preprocess the extended data to remove irrelevant information and see
if this helps improve the accuracy of the model.

Giridhar Ravipati

Satire Recognition, What news are you reading?

In this project we propose to perform Satire Recognition by classifying news articles to be
from 'The Onion' or other news sources(CNN & Reuters in particular) using statistical Machine Learning
Techniques. Articles published in authentic news sources are sensible and true pieces of news
whereas those in The Onion are based on reality but are modified to give a satirical and humorous
touch to them.

We would like to proceed in stages for completing this project.
* Data Collection :
The Onion, CNN, Reuters allow external user agents to freely crawl their sites for news articles.
So instead of we crawling their sites we would like to use the Google AJAX search API for getting the web
addresses for articles from the three sites. The API returns an XML file with all the article URL's.
Once we get the XML file we can fetch the page from our program and then parse it for the appropriate news
article. We would like to get around 1000 articles from The Onion and a 1000 article mixture from both
The CNN and Rueters. After that we are going to divide it into a training set and a test set in a 80%:20% ratio.

* Feature vector Selection -
We would like to use the bag of words representation for the news articles. But at the same time like
to make some of the important features in the news article very prominent so that we would obtain good
classification accuracy. So we will weight each of the words in the article, such that the differentiating
words receive higher weight than the others.

Some of the features we think that are important to consider at this stage are the Satire Features
(A very common, almost defining feature of satire is a strong vein of irony or sarcasm. Also, parody,
burlesque, exaggeration, juxtaposition, comparison, analogy, devices frequently used in satirical speech
and writing). After going through articles from Onion, we think that exaggeration and slangs as the main
features that can be used for satire recognition. The most typical way of exaggerating is by using lot of
adjectives. These features are almost non-existent in articles from other news sources.

At this stage we certainly think that these might not be enough to get a good accuracy. We plan to
experiment with other features as we progress.

* Methodology :
We will use the support vector machines to train and classify. As discussed above we will represent
each article as a weighted bag of words representation to make some features prominent. Adjectives in the
articles can be identified by using part-of-speech tagger for English like CLAWS(has a free trial). List
of commonly used slangs can be obtained from żThe Online Slang Dictionaryż from Berkeley
(http://www.ocf.berkeley.edu/~wrader/slang/)

* Evaluation Methodology -
We will use the test set classification accuracy to evaluate our project. So after we train the SVM
classifier with our training data, we will use the test set and classify them to be articles from the onion
or the other news sources. since we already have labels for the test set, we can obtain the test set
classification accuracy

Houssam Nassif

Word Sense Disambiguation Using WordNet

Many words in natural languages have multiple meanings. In English, a
word like żcraneż can refer to the animal or the machine crane. Word Sense
Disambiguation (WSD) is the task of determining which sense a word have
in a given sentence.

WordNet is a lexical database for the English language [1]. It is a database
of word senses, grouped in synsets, unique sets of synonyms with a unique
meaning. Given a word, its different senses can be determined using Word-
Net. The aim of this project is to use WordNet for WSD.

Jake Rosin

Author Identification

Classifying documents based on content is a common endeavor in natural language processing. In
these cases the relevant element may be topic, sentiment, earnestness (identifying satire, for example), etc.
Automated content-based evaluation has applications ranging from sorting news stories to finding instances
of plagiarism. In the latter case, text in one document which closely matches text from another is considered
suspect, even if they are not identical, the assumption being that a plagiarist would permute the text to help
disguise the source. If the source material being stolen from is not available to the plagiarism detector,
however, then even a verbatim copy will be undetectable.

Personal style has a significant effect on written work, influencing word choice, sentence structure,
grammatical (mis)constructions, etc. A classifier which recognizes specific writing styles independent of
content could detect material taken from a known author, even if the source itself is not included in the
comparison data. Additionally it could detect a change in authorship for other reasons; examples include one
student doing portions of another's homework in academia, or in the real world, editorial changes made to a
news article or opinion piece.

I propose a system for classifying documents based on author. Support Vector Machines provide
good results for most classification systems, and will be used for this one. SVMs require that documents be
preprocessed into feature vectors ż finding a set of features which identifies authorial style will be the focus
of this project. Generating a bag-of-words is a good place to start, but depending on the corpus used this
may result in a traditional topic classification system. Deeper analysis of sentences will be aided by the use
of the Stanford Parser: additional features may be formed by counting the uses of various parts of speech,
or the appearance of specific parse subtrees.

Piramanayagam Arumugua Nainar

Discovering Topics in Software using Latent Dirichlet Allocation

The size of software is growing day by day and development teams are becoming larger and
larger. Very few developers have a complete knowledge about the system. Each one has an
individual area (module or feature) of expertise. In projects involving more than dozens of
developers, it will be difficult to allocate new feature requests or bug reports to the right
person. It would be useful to automatically extract the expertise of each developer. The
information available for this task are:
1. The messages written by the developer while checking code into version control systems.
2. The bug reports of the bugs fixed by the developer.
3. The actual source code written by the developer or, at a coarser level, just the source
files touched by her.
For this project, we propose to learn the topics present in a software, and if possible, learn a
topic distribution for each developer. We plan to use commit messages and bug reports. The
actual source code will be less useful because we may find topics corresponding to coding
style and variable naming styles than actual semantics of the code. But, we can still use the
coarser information, namely file names, that are available as a part of commit messages.

Pavan Kuppili, Rokas Venckevicius, Xiyang Chen

Our project will address the following problem: given an academic paper or any general textual query,
find a researcher who would be best suited to talk to you about it (based on their previous research).
For example, given a paper as a pdf file, we would like to know who in the UW Computer Science
department would be best suited to review the paper. Or given a phrase "probabilistic motion planning",
who are the people to talk to in the top Computer Science departments in the US?

In order to answer this question we will build a statistical profile for a list of researchers based on their
published papers, and do a cosine similarity between the query and the profiles to find the best match.

(The above list is incomplete.)