CS 769 - Advanced Natural Language Processing

Books and References for CS 769:

Matlab tutorials
A Very Elementary MATLAB Tutorial from Mathworks

Reference books
[cB] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
[MS] Manning & Schutze, Foundations of statistical natural language processing, the MIT press, 1999.
[JM] Jurafsky & Martin, Speech and language processing, Prentice Hall, 2000.
[HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Second Edition, 2009. Available online.
[dM] David MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2002.

Mathematical background
[cB] 1.2, Appendix B, C, E
[dM] 2 or [MS] 2.1
Iain Murray's crib sheet.
Sam Roweis' matrix identities.
Stephen Boyd and Lieven Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
Dan Klein's Lagrange Multipliers without Permanent Scarring.
Peter Doyle and Laurie Snell. Random Walks and Electric Networks. Mathematical Association of America, 1984

Statistics of the English language
[MS] 4.2, 1.4.2, 1.4.3
Zipf's law
Wentian Li, Comments to "Bell Curves and Monkey Languages", 1996
Wentian Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842-1845, 1992
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274, 1926-1928.
Lillian Lee. "I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001. Computer Science: Reflections on the Field, Reflections from the Field, pp. 111--118, 2004.

Language modeling
[cB] 2.1, 2.2
[MS] 6 or [JM] 6
Stanley F. Chen and Joshua Goodman, An empirical study of smoothing techniques for language modeling TR-10-98, Computer Science Group, Harvard University, 1998
Ronald Rosenfeld. Two decades of Statistical Language Modeling: Where Do We Go From Here? Proceedings of the IEEE, 88(8), 2000.
Yee Whye Teh. A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean. Large Language Models in Machine Translation. EMNLP 2007.
A Hierarchical Dirichlet Language Model David MacKay, Linda Peto. 1994
The CMU-Cambridge Statistical Language Modeling toolkit v2

The entropy of a language, information theory
[cB] 1.6, including a nice introduction to differential entropy
[MS] 2.2 or [JM] 6.7
Brown, Della Pietra, Mercer, Della Pietra, Lai. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), pp31-40, 1992
Claude Shannon. A mathematical theory of communication
Thomas Cover and Joy Thomas. Elements of information theory. ISBN 0471062596

Information retrieval and link analysis
John Lafferty and Chengxiang Zhai. Probabilistic relevance models based on document and query generation, In Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval, Vol. 13, 2003.
ChengXiang Zhai, John Lafferty. A study of smoothing methods for language models applied to information retrieval, ACM Transactions on Information Systems, Vol. 2, No. 2, April 2004.
[MS] 15
The Lemur toolkit
Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project. 1998
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604--632, 1999
C. Faloutsos, T. Kolda and J. Sun. Mining Large Time-evolving Data Using Matrix and Tensor Tools. ICML 2007 tutorial, Cornvallis, OR, USA

Document summarization
P. Turney. Learning to extract keyphrases from text. Technical report, National Research Council, Institute for Information Technology, 1999.
A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. Conf. Empirical Methods in Natural Language Processing, 2003.
R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proc. Conf. Empirical Methods in Natural Language Processing, 2004.
G. Erkan and D. Radev. 2004. LexRank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research.
X. Zhu, A. Goldberg, J. Van Gael and D. Andrzejewski. Improving Diversity in Ranking using Absorbing Random Walks. NAACL-HLT, 2007.

Text categorization: Naive Bayes, logistic regression
[cB] 8.1, 8.2 for Naive Bayes; 4.3 for logistic regression.
A Comparison of Event Models for Naive Bayes Text Classification. Andrew McCallum and Kamal Nigam. AAAI-98 Workshop on "Learning for Text Categorization".
Andrew McCallum's rainbow statistical text classification code
Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A maximum entropy approach to natural language processing . Computational Linguistics 22(1).
Ronald Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer, Speech and Language 10, 187--228, 1996
Stanley Chen and Ronald Rosenfeld. Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models. In Proc. ICASSP '99, Phoenix, Arizona, March 1999.
Zhang Le's MaxEnt page
Y. Dan Rubenstein and Trevor Hastie, 1997. Discriminative vs Informative Learning. Proc. of KDD.
Andrew Y. Ng and Michael Jordan, 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Proc. of NIPS.
Florian Wolf, Tomao Poggio and Pawan Sinha, 2006. Human Document Classification Using Bags of Words. Tech report MIT-CSAIL-TR-2006-054.

Sentiment, humor, gender analysis with Support Vector Machines
[cB] 7.1
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998.
Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), pp. 1135, 2008.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP, 2002.
The Yahoo! SentimentAI group: Sentiment and Affect in Text. (need to join the group)
Bing Liu's Opinion Mining page.
Rada Mihalcea and Carlo Strapparava. Making Computers Laugh: Investigations in Automatic Humor Recognition. EMNLP, 2005.
Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
Chris Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
Alex J. Smola and Bernhard Scholkopf. A Tutorial on Support Vector Regression, NeuroCOLT Technical Report TR-98-030. 1998
Thorsten Joachims' SVM-light code

Clustering
Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 395-416 (12 2007).
ICML 2004 tutorial on spectral clustering by Chris Ding
Fernando Pereira, Naftali Tishby and Lillian Lee. Distributional clustering of English words. Proceedings of the 31st annual meeting on Association for Computational Linguistics, 1993.
C.J.C. Burges. Dimension Reduction: A Guided Tour. Foundations and Trends in Machine Learning, 2010.

Semi-supervised learning: using both labeled and unlabeled data
[cB] 9 for the EM algorithm.
Self-training for word sense disambiguation: David Yarowsky, 1995. Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp 189--196.
Text Classification from Labeled and Unlabeled Documents using EM. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000.
Combining Labeled and Unlabeled Data with Co-Training. Avrim Blum and Tom Mitchell. Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998
T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999.
Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. Xiaojin Zhu, Zoubin Ghahramani, John Lafferty. The Twentieth International Conference on Machine Learning (ICML-2003)
Semi-Supervised Learning Literature Survey. Xiaojin Zhu, Computer Sciences TR 1530, University of Wisconsin - Madison.

Latent topic models
Semantic space via probabilistic Latent Semantic Analysis, latent Dirichlet allocation Probabilistic Latent Semantic Analysis. Thomas Hofmann. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99)
Probabilistic Latent Semantic Indexing. Thomas Hofmann. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99)
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
Griffiths, T., & Steyvers, M. Finding Scientific Topics.. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004

Part of Speech tagging with Hidden Markov Models
[MS] 10 for POS tagging
[cB] 13.2, [MS] 9, or [JM] 7.1-7.4 for HMM
Zoubin Ghahramani, 2001. An Introduction to Hidden Markov Models and Bayesian Networks, International Journal of Pattern Recognition and Artificial Intelligence 15(1):9-42.
Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77(2), pp. 257-286. (An Erratum by Ali Rahimi)
David Elworthy, 1994. Does Baum-Welch Re-estimation help taggers? Proceedings of the 4th Conference on Applied Natural Language Processing.
Kevin Murphy's Hidden Markov Model (HMM) Toolbox for Matlab
Stanford Log-linear Part-Of-Speech Tagger

Information extraction with Conditional Random Fields
John Lafferty, Andrew McCallum, Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), 2001.
Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006.
Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL 2003.
Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. Uncertainty in AI, 2003.
Hanna Wallach's conditional random fields page
Andrew McCallum's MALLET code

Parsing and context free grammars
[MS] 11 or [JM] 9, 12
Detlef Prescher. A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars. The 15th European Summer School in Logic, Language and Information (ESSLLI-03).

Machine Translation
Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, 1993. The Mathematics of Statistical Machine Translation. Computational Linguistics 19(2), pp. 263--311.
Papineni, Roukos, Ward, Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

Speech
The CMU Pronouncing Dictionary

Spoken Document Retrieval
The TREC spoken document retrieval track: A success story, 1999
SpeechFind

Related Courses

text data mining, Callan, CMU
human language technologies, Callan, Black, Lavie, CMU
information retrieval, Callan, Yang, CMU
natural language processing, Cardie, Cornell
learning to turn words into data, Cohen, CMU
Machine Learning Approaches for Natural Language Processing, Collins, MIT
introduction to bioinformatics, Craven, U Wisconsin
advanced bioinformatics, Craven, U Wisconsin
machine learning for text analysis, Craven, Shavlik, U Wisconsin
Empirical Methods in Natural Language Processing, Koehn, Edinburgh
statistical foundations of machine learning, Lafferty, Wasserman, CMU
algorithms for NLP, Lavie, Frederking, CMU
statistical natural language processing: models and methods, Lee, Cornell
natural language processing, Lee, Cornell
statistical methods for artificial intelligence, McAllester, TTI-C
introduction to natural language processing, McCallum, U Mass
natural Language Processing, Mihalcea, University of North Texas
advanced methods in artificial intelligence, Page, U Wisconsin
topics in Natural Language Processing, Ringger, BYU
language and statistics, Rosenfeld, CMU
machine learning, Shavlik, U Wisconsin
speech recognition and understanding, Schultz, Waibel, CMU
Graphs and Networks, Spielman, Yale
Practical Machine Learning, Jordan, Berkeley
Topics in machine learning, Sha, USC
Computational Data Analysis: FOUNDATIONS OF MACHINE LEARNING & DATA MINING, Gray, Georgia Tech
Analysis of Social Media, Cohen and Glance, CMU
Text-Driven Forecasting, Smith, CMU