2007 Class Projects for CS 769: Anne Jorstad I propose to use the spectral clustering method to divide a corpora of Shakespeare's plays, considered as bags of words, into two categories (two-way clustering). I am curious to discover what qualities will distinguish the two groups, if they will be divided largely as comedy/tragedy, or if there will be no obvious distinction at all. As this is a collection of only 37 documents, this project will also test the limits of spectral clustering as applied to a small number of data points. I will use a corpora from the following website containing the works of Shakespeare: http://www.it.usyd.edu.au/~matty/Shakespeare/ Another interesting and related application that could be tested with the same implementation would be segmenting Aesop's fables, treating each fable and its accompanying "moral" (such as "slow and steady wins the race") as separate documents, and seeing how many pairs the algorithm ends up placing in the same cluster. A corpora of the 21 most famous fables can be found here: http://www.bygosh.com/aesop/index.htm . Chris Hinrichs For this project I will be analyzing acoustic data with latent semantic space algorithms, LSA, pLSA and LDA if time permits. In order to do this, the acoustic data stream will have to be split into repeating terms. The sounds will include music, speech, and noise. Once this is done, the semantic space representation of the sounds will yield ¿topics¿ in the latent space, which could (depending on how the sound is translated into ¿terms¿,) be thought of as distributions of either cadences, or rhythm structures, or of chord or harmonic structures. A natural predictive task is whether or not the sound is ¿musical¿, i.e. does it have a repeating cadence and chord structure, or is it more ¿vocal¿, or is it just ¿noise¿. To classify a new document, i.e. a new sound, the document would first be translated into terms, and then into latent topics. Finally, a tf/idf score could be used to match it to the classes. Daniel Wong Consider the problem of a journal receiving papers one concern is whether or not these papers have been published before. Current tools exist but few that use the power of web based search. Other tools are commercial packages that target cheating on high school reports and papers. This is a difficult problem since many modifications can be made to documents that could be hard to detect. In addition, searching manually by hand for all the papers received is far to time consuming to be practical. A proposed method is to take random n-grams and then use these word sequences as search seeds in a search tool. Collect a list of documents that this n-gram appears in and repeat. An annealing like method can be used starting with very long n-grams less likely to have search hits and then reducing n to collect a sequence of documents and number of co-occurrences of random ngrams and specific documents. Some intuitive probabilistic statements can be thought of the probability of a large n-gram matching with any given document is less than a small n-gram matching with any given document. In addition, the probability of many randomly selected n-grams from the seed document causing search hits in a single document if the documents are unrelated is small and likewise larger if the documents are related. Derek Gjertson The problem I am examining is the clustering of eBay auctions. Given a set of eBay auctions, I want to be able to cluster individual auctions that refer to the same item being sold. The clustering of items is beneficial when comparison shopping to find quickly the best price for a particular item. Planned Solution: To collect the data to cluster, I will use the eBay SOAP API to retrieve a number of items and their descriptions. I will use a leaf node of the eBay hierarchy to maximize the number of records that can be linked together. After getting the data, I will first try to cluster based on the title of the item and the basic information of the item. I will use a bag of words representation and use more structured data to calculate the similarity. I will compare the similarity to a threshold, and use that to determine whether the items can be linked together. I will find which auctions this model works well on, and find what auctions this model fails. If I find a class of auctions that fails to cluster properly, I will look at the item's extended information. I will need to preprocess the extended data to remove irrelevant information and see if this helps improve the accuracy of the model. Giridhar Ravipati Satire Recognition, What news are you reading? In this project we propose to perform Satire Recognition by classifying news articles to be from 'The Onion' or other news sources(CNN & Reuters in particular) using statistical Machine Learning Techniques. Articles published in authentic news sources are sensible and true pieces of news whereas those in The Onion are based on reality but are modified to give a satirical and humorous touch to them. We would like to proceed in stages for completing this project. * Data Collection : The Onion, CNN, Reuters allow external user agents to freely crawl their sites for news articles. So instead of we crawling their sites we would like to use the Google AJAX search API for getting the web addresses for articles from the three sites. The API returns an XML file with all the article URL's. Once we get the XML file we can fetch the page from our program and then parse it for the appropriate news article. We would like to get around 1000 articles from The Onion and a 1000 article mixture from both The CNN and Rueters. After that we are going to divide it into a training set and a test set in a 80%:20% ratio. * Feature vector Selection - We would like to use the bag of words representation for the news articles. But at the same time like to make some of the important features in the news article very prominent so that we would obtain good classification accuracy. So we will weight each of the words in the article, such that the differentiating words receive higher weight than the others. Some of the features we think that are important to consider at this stage are the Satire Features (A very common, almost defining feature of satire is a strong vein of irony or sarcasm. Also, parody, burlesque, exaggeration, juxtaposition, comparison, analogy, devices frequently used in satirical speech and writing). After going through articles from Onion, we think that exaggeration and slangs as the main features that can be used for satire recognition. The most typical way of exaggerating is by using lot of adjectives. These features are almost non-existent in articles from other news sources. At this stage we certainly think that these might not be enough to get a good accuracy. We plan to experiment with other features as we progress. * Methodology : We will use the support vector machines to train and classify. As discussed above we will represent each article as a weighted bag of words representation to make some features prominent. Adjectives in the articles can be identified by using part-of-speech tagger for English like CLAWS(has a free trial). List of commonly used slangs can be obtained from ¿The Online Slang Dictionary¿ from Berkeley (http://www.ocf.berkeley.edu/~wrader/slang/) * Evaluation Methodology - We will use the test set classification accuracy to evaluate our project. So after we train the SVM classifier with our training data, we will use the test set and classify them to be articles from the onion or the other news sources. since we already have labels for the test set, we can obtain the test set classification accuracy Houssam Nassif Word Sense Disambiguation Using WordNet Many words in natural languages have multiple meanings. In English, a word like ¿crane¿ can refer to the animal or the machine crane. Word Sense Disambiguation (WSD) is the task of determining which sense a word have in a given sentence. WordNet is a lexical database for the English language [1]. It is a database of word senses, grouped in synsets, unique sets of synonyms with a unique meaning. Given a word, its different senses can be determined using Word- Net. The aim of this project is to use WordNet for WSD. Jake Rosin Author Identification Classifying documents based on content is a common endeavor in natural language processing. In these cases the relevant element may be topic, sentiment, earnestness (identifying satire, for example), etc. Automated content-based evaluation has applications ranging from sorting news stories to finding instances of plagiarism. In the latter case, text in one document which closely matches text from another is considered suspect, even if they are not identical, the assumption being that a plagiarist would permute the text to help disguise the source. If the source material being stolen from is not available to the plagiarism detector, however, then even a verbatim copy will be undetectable. Personal style has a significant effect on written work, influencing word choice, sentence structure, grammatical (mis)constructions, etc. A classifier which recognizes specific writing styles independent of content could detect material taken from a known author, even if the source itself is not included in the comparison data. Additionally it could detect a change in authorship for other reasons; examples include one student doing portions of another's homework in academia, or in the real world, editorial changes made to a news article or opinion piece. I propose a system for classifying documents based on author. Support Vector Machines provide good results for most classification systems, and will be used for this one. SVMs require that documents be preprocessed into feature vectors ¿ finding a set of features which identifies authorial style will be the focus of this project. Generating a bag-of-words is a good place to start, but depending on the corpus used this may result in a traditional topic classification system. Deeper analysis of sentences will be aided by the use of the Stanford Parser: additional features may be formed by counting the uses of various parts of speech, or the appearance of specific parse subtrees. Piramanayagam Arumugua Nainar Discovering Topics in Software using Latent Dirichlet Allocation The size of software is growing day by day and development teams are becoming larger and larger. Very few developers have a complete knowledge about the system. Each one has an individual area (module or feature) of expertise. In projects involving more than dozens of developers, it will be difficult to allocate new feature requests or bug reports to the right person. It would be useful to automatically extract the expertise of each developer. The information available for this task are: 1. The messages written by the developer while checking code into version control systems. 2. The bug reports of the bugs fixed by the developer. 3. The actual source code written by the developer or, at a coarser level, just the source files touched by her. For this project, we propose to learn the topics present in a software, and if possible, learn a topic distribution for each developer. We plan to use commit messages and bug reports. The actual source code will be less useful because we may find topics corresponding to coding style and variable naming styles than actual semantics of the code. But, we can still use the coarser information, namely file names, that are available as a part of commit messages. Pavan Kuppili, Rokas Venckevicius, Xiyang Chen Our project will address the following problem: given an academic paper or any general textual query, find a researcher who would be best suited to talk to you about it (based on their previous research). For example, given a paper as a pdf file, we would like to know who in the UW Computer Science department would be best suited to review the paper. Or given a phrase "probabilistic motion planning", who are the people to talk to in the top Computer Science departments in the US? In order to answer this question we will build a statistical profile for a list of researchers based on their published papers, and do a cosine similarity between the query and the profiles to find the best match. (The above list is incomplete.)