cs838 spring 2007

CS 838-1: Advanced Natural Language Processing

Spring 2007

This course has two themes: applications in natural language processing, and statistical machine learning methods. The applications include text categorization, document summarization, sentiment analysis, word sense disambiguation, machine translating, speech recognition, while techniques include basic information theory and probabilistic modeling, Expectation-Maximization, Support Vector Machines, probabilistic Context Free Grammars, Hidden Markov Models, Conditional Random Fields, latent Dirichlet allocation, graphical models (Markov Chain Monte Carlo and variational inference), link analysis, and semi-supervised learning. The learning methods are also applicable to bioinformatics, computer vision, computer code analysis, and other fields. This course counts as core AI credit.

Schedule

Lecture:	11:00am-12:15pm TR, CS 1325
Office hour:	Thursday 3pm-4pm, CS 4369
Class mailing list:	compsci838-1-s07@lists.wisc.edu (archive)
Email the instructor:	jerryzhu@cs.wisc.edu
Teaching assistant:	Chi-Man Liu, cx@cs.wisc.edu

Instructor: Xiaojin (Jerry) Zhu

Please feel free to send me email, I usually respond quickly.

Course Outline and Readings

The order and exact content are subject to change.

Review of mathematical background

[cB] (see 'Books' below) 1.2, Appendix B, C, E
[dM] 2 or [MS] 2.1
Iain Murray's excellent crib sheet.

Statistics of the English language

[MS] 4.2, 1.4.2, 1.4.3
Zipf's law, literature
Wentian Li, Comments to "Bell Curves and Monkey Languages", 1996
Wentian Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842-1845, 1992
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926-1928.
Lillian Lee. "I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001. Computer Science: Reflections on the Field, Reflections from the Field, pp. 111--118, 2004.

Language modeling, smoothing

[cB] 2.1, 2.2
[MS] 6 or [JM] 6
Stanley F. Chen and Joshua Goodman, An empirical study of smoothing techniques for language modeling TR-10-98, Computer Science Group, Harvard University, 1998
Ronald Rosenfeld. Two decades of Statistical Language Modeling: Where Do We Go From Here? Proceedings of the IEEE, 88(8), 2000.
A Hierarchical Dirichlet Language Model David MacKay, Linda Peto. 1994
The CMU-Cambridge Statistical Language Modeling toolkit v2

The entropy of a language, information theory

[cB] 1.6, including a nice introduction to differential entropy
[MS] 2.2 or [JM] 6.7
An estimate of an upper bound for the entropy of English. Brown, Della Pietra, Mercer, Della Pietra, Lai. Computational Linguistics, 18(1), pp31-40, 1992
Claude Shannon: A mathematical theory of communication
Elements of information theory. Thomas Cover and Joy Thomas, ISBN 0471062596

Information retrieval and link analysis

John Lafferty and Chengxiang Zhai, Probabilistic relevance models based on document and query generation, In Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval, Vol. 13, 2003.
ChengXiang Zhai, John Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Transactions on Information Systems, Vol. 2, No. 2, April 2004.
[MS] 15
the Lemur toolkit
The PageRank Citation Ranking: Bringing Order to the Web. Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd. Stanford Digital Library Technologies Project. 1998
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604--632, 1999

Document summarization

P. Turney. Learning to extract keyphrases from text. Technical report, National Research Council, Institute for Information Technology, 1999.
A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. Conf. Empirical Methods in Natural Language Processing, 2003.
R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proc. Conf. Empirical Methods in Natural Language Processing, 2004.
G. Erkan and D. Radev. 2004. LexRank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research.
X. Zhu, A. Goldberg, J. Van Gael and D. Andrzejewski. Improving Diversity in Ranking using Absorbing Random Walks. NAACL-HLT, 2007.

Text categorization: Naive Bayes, logistic regression

[cB] 8.1, 8.2 for Naive Bayes; 4.3 for logistic regression.
A Comparison of Event Models for Naive Bayes Text Classification. Andrew McCallum and Kamal Nigam. AAAI-98 Workshop on "Learning for Text Categorization".
Andrew McCallum's rainbow statistical text classification code
Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A maximum entropy approach to natural language processing . Computational Linguistics 22(1).
Ronald Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer, Speech and Language 10, 187--228, 1996
Stanley Chen and Ronald Rosenfeld. Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models. In Proc. ICASSP '99, Phoenix, Arizona, March 1999.
Zhang Le's MaxEnt page
Y. Dan Rubenstein and Trevor Hastie, 1997. Discriminative vs Informative Learning. Proc. of KDD.
Andrew Y. Ng and Michael Jordan, 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Proc. of NIPS.

Sentiment, humor, gender analysis with Support Vector Machines

[cB] 7.1
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 2002.
Rada Mihalcea and Carlo Strapparava. Making Computers Laugh: Investigations in Automatic Humor Recognition. EMNLP, 2005.
Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
Chris Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
Alex J. Smola and Bernhard Scholkopf. A Tutorial on Support Vector Regression, NeuroCOLT Technical Report TR-98-030. 1998
Thorsten Joachims' SVM-light code

Word sense disambiguation: unlabeled text as knowledge source

[cB] 9 for the EM algorithm.
Self-training for word sense disambiguation: David Yarowsky, 1995. Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp 189--196.
Text Classification from Labeled and Unlabeled Documents using EM. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000.
Combining Labeled and Unlabeled Data with Co-Training. Avrim Blum and Tom Mitchell. Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999.
Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. Xiaojin Zhu, Zoubin Ghahramani, John Lafferty. The Twentieth International Conference on Machine Learning (ICML-2003)
Semi-Supervised Learning Literature Survey. Xiaojin Zhu, Computer Sciences TR 1530, University of Wisconsin - Madison.

Semantic space via probabilistic Latent Semantic Analysis, latent Dirichlet allocation

Probabilistic Latent Semantic Analysis. Thomas Hofmann. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99)
Probabilistic Latent Semantic Indexing. Thomas Hofmann. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99)
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
Griffiths, T., & Steyvers, M. Finding Scientific Topics.. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004
LDA code

Part of Speech tagging with Hidden Markov Models

[MS] 10 for POS tagging
[cB] 13.2, [MS] 9, or [JM] 7.1-7.4 for HMM
Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77(2), pp. 257-286. (An Erratum by Ali Rahimi)
David Elworthy, 1994. Does Baum-Welch Re-estimation help taggers? Proceedings of the 4th Conference on Applied Natural Language Processing.
Kevin Murphy's Hidden Markov Model (HMM) Toolbox for Matlab

Information extraction with Conditional Random Fields

John Lafferty, Andrew McCallum, Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), 2001.
Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006.
Hanna Wallach's conditional random fields page
Andrew McCallum's MALLET code

Parsing and context free grammars

[MS] 11 or [JM] 9, 12
Detlef Prescher. A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars. The 15th European Summer School in Logic, Language and Information (ESSLLI-03).

Machine Translation

Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, 1993. The Mathematics of Statistical Machine Translation. Computational Linguistics 19(2), pp. 263--311.
Papineni, Roukos, Ward, Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

Lecture notes

(the week of)

Jan 23	mathematical background
Jan 30	words, zipf's law, miller's monkey
Feb 6	language modeling
Feb 13	information theory, information retrieval
Feb 20	link analysis
Feb 27	naive Bayes, logistic regression
March 6	the EM algorithm
March 20	SVMs, text summarization (Andrew Goldberg)
March 27	latent topic models
April 3	spring break, no class
April 10	hidden Markov models
April 17, 24	inference in graphical models
May 1	conditional random fields

References

Books

Text book: [cB] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
[MS] Manning & Schutze, Foundations of statistical natural language processing, the MIT press, 1999.
[JM] Jurafsky & Martin, Speech and language processing, Prentice Hall, 2000.
[dM] David MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2002.

Courses

language and statistics, Rosenfeld, CMU
human language technologies, Callan, Black, Lavie, CMU
algorithms for NLP, Lavie, Frederking, CMU
speech recognition and understanding, Schultz, Waibel, CMU
information retrieval, Callan, Yang, CMU
text data mining, Callan, CMU
learning to turn words into data, Cohen, CMU
machine learning for text analysis, Craven, Shavlik, U Wisconsin
introduction to natural language processing, McCallum, U Mass
statistical methods for artificial intelligence, McAllester, TTI-C
statistical natural language processing: models and methods, Lee, Cornell
natural language processing, Lee, Cornell
natural language processing, Cardie, Cornell
statistical foundations of machine learning, Lafferty, Wasserman, CMU
Empirical Methods in Natural Language Processing, Koehn, Edinburgh
Natural Language Processing, Mihalcea, University of North Texas
Topics in Natural Language Processing, Ringger, BYU
introduction to bioinformatics, Craven, U Wisconsin
advanced bioinformatics, Craven, U Wisconsin
machine learning, Shavlik, U Wisconsin
advanced methods in artificial intelligence, Page, U Wisconsin

Grading

Homework 35%.

Homework 1: Words. Due 2/13/07 before class. (solution).

Homework 2: Language models and information theory. Due 2/27/07 before class. (solution).

Homework 3: Information retrieval and link analysis. Due 3/13/07 before class. (solution).

Homework 4: Naive Bayes. Due 3/27/07 before class. (solution).

Homework 5: Support Vector Machines. Due 4/10/07 before class. (EXTENDED: Due 4/12/07 before class) (solution).

Exam 30%.
Project 30%.

Completely open. If you have an idea on an interesting project, please send me email / talk to me. Last year's projects can be found here for your reference.
Semi-open. Consider the Web contents of all users in our department. These can be easily obtained (if you have an CS account) in each user's ~/public/html directory. Do something interesting and creative to it.
Focused. I will proposal a few concrete project ideas:
- Given a researcher's name (e.g., Michael Jordan), find the person's Web page, build a statistical profile for the person (e.g., a unigram language model from the researcher's publication pdf files). With many profiles from different researchers, we can ask questions like "who is best to ask for this question", "who can review this paper", etc.
- Paper search. Given a PDF paper file, identify near-identical papers in a collection of PDF files or online.
- Satire recognition. For example, distinguish articles from The Onion vs. CNN.

"Quality" class participation 5%.

Quotes

"I'm right now working at Google in the Search Quality team. I see lot of concepts we covered in CS838 being used here. Thanks again for offering the course." -- former student

"The best part of the course was the classes. Teaching was very good. Even the very difficult concepts appeared simple." -- former student

"Dude, it is big... but lovely." -- word samples from a unigram language model trained on movie reviews (punctuations added)