Books and References for CS 769:

   Matlab tutorials
        A Very Elementary MATLAB Tutorial from Mathworks

   Reference books
	[cB] Christopher M. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.
	[MS] Manning & Schutze, Foundations of statistical natural language processing, the MIT press, 1999.
	[JM] Jurafsky & Martin, Speech and language processing, Prentice Hall, 2000.
	[HTF] Trevor Hastie, Robert Tibshirani, and Jerome Friedman.  The Elements of Statistical Learning: 
	      Data Mining, Inference, and Prediction.  Second Edition, 2009.  Available online.
	[dM] David MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2002.

   Mathematical background
	[cB] 1.2, Appendix B, C, E
	[dM] 2 or [MS] 2.1
	Iain Murray's crib sheet.
	Sam Roweis' matrix identities.
	Stephen Boyd and Lieven Vandenberghe, Convex Optimization.  Cambridge University Press, 2004.
	Dan Klein's Lagrange Multipliers without Permanent Scarring.
	Peter Doyle and Laurie Snell. Random Walks and Electric Networks. Mathematical Association of America, 1984

   Statistics of the English language
	[MS] 4.2, 1.4.2, 1.4.3
	Zipf's law
	Wentian Li, Comments to "Bell Curves and Monkey Languages", 1996
	Wentian Li.  Random texts exhibit Zipf's-law-like word frequency distribution.  IEEE Transactions on Information Theory, 38(6), 1842-1845, 1992
	Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996).  Statistical learning by 8-month-old infants.  Science, 274, 1926-1928.
	Lillian Lee.  "I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001. Computer Science: Reflections on the Field, Reflections from the Field, pp. 111--118, 2004. 

   Language modeling
	[cB] 2.1, 2.2 
	[MS] 6 or [JM] 6
	Stanley F. Chen and Joshua Goodman,  An empirical study of smoothing techniques for language modeling TR-10-98, Computer Science Group, Harvard University, 1998
	Ronald Rosenfeld.  Two decades of Statistical Language Modeling: Where Do We Go From Here?  Proceedings of the IEEE, 88(8), 2000.
	Yee Whye Teh. A Bayesian Interpretation of Interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, NUS.
	Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och and Jeffrey Dean.  Large Language Models in Machine Translation.  EMNLP 2007.
	A Hierarchical Dirichlet Language Model David MacKay, Linda Peto. 1994
	The CMU-Cambridge Statistical Language Modeling toolkit v2

   The entropy of a language, information theory
	[cB] 1.6, including a nice introduction to differential entropy
	[MS] 2.2 or [JM] 6.7
	Brown, Della Pietra, Mercer, Della Pietra, Lai.  An estimate of an upper bound for the entropy of English.  Computational Linguistics, 18(1), pp31-40, 1992
	Claude Shannon.  A mathematical theory of communication
	Thomas Cover and Joy Thomas.  Elements of information theory.  ISBN 0471062596

   Information retrieval and link analysis
	John Lafferty and Chengxiang Zhai.  Probabilistic relevance models based on document and query generation, In Language Modeling and Information Retrieval, Kluwer International Series on Information Retrieval, Vol. 13, 2003.
	ChengXiang Zhai, John Lafferty.  A study of smoothing methods for language models applied to information retrieval, ACM Transactions on Information Systems, Vol. 2, No. 2, April 2004.
	[MS] 15
	The Lemur toolkit
	Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd.  The PageRank Citation Ranking: Bringing Order to the Web.  Stanford Digital Library Technologies Project. 1998
	Jon M. Kleinberg.  Authoritative sources in a hyperlinked environment.  Journal of the ACM, 46(5), 604--632, 1999
	C. Faloutsos, T. Kolda and J. Sun. Mining Large Time-evolving Data Using Matrix and Tensor Tools. ICML 2007 tutorial, Cornvallis, OR, USA 

   Document summarization
	P. Turney. Learning to extract keyphrases from text. Technical report, National Research Council, Institute for Information Technology, 1999. 
	A. Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc.  Conf. Empirical Methods in Natural Language Processing, 2003. 
	R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proc. Conf. Empirical Methods in Natural Language Processing, 2004.
	G. Erkan and D. Radev. 2004. LexRank: Graph-based centrality as salience in text summarization.  Journal of Artificial Intelligence Research. 
	X. Zhu, A. Goldberg, J. Van Gael and D. Andrzejewski.  Improving Diversity in Ranking using Absorbing Random Walks.  NAACL-HLT, 2007. 

   Text categorization: Naive Bayes, logistic regression
	[cB] 8.1, 8.2 for Naive Bayes; 4.3 for logistic regression.
	A Comparison of Event Models for Naive Bayes Text Classification. Andrew McCallum and Kamal Nigam. AAAI-98 Workshop on "Learning for Text Categorization".
	Andrew McCallum's rainbow statistical text classification code
	Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A maximum entropy approach to natural language processing . Computational Linguistics 22(1).
	Ronald Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling.  Computer, Speech and Language 10, 187--228, 1996
	Stanley Chen and Ronald Rosenfeld. Efficient Sampling and Feature Selection in Whole Sentence Maximum Entropy Language Models. In Proc. ICASSP '99, Phoenix, Arizona, March 1999.
	Zhang Le's MaxEnt page
	Y. Dan Rubenstein and Trevor Hastie, 1997. Discriminative vs Informative Learning. Proc. of KDD.
	Andrew Y. Ng and Michael Jordan, 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Proc. of NIPS.
	Florian Wolf, Tomao Poggio and Pawan Sinha, 2006. Human Document Classification Using Bags of Words. Tech report MIT-CSAIL-TR-2006-054.

   Sentiment, humor, gender analysis with Support Vector Machines
	[cB] 7.1 
	Thorsten Joachims.  Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML), Springer, 1998.
	Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), pp. 1135, 2008.
	Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.  Thumbs up? Sentiment Classification using Machine Learning Techniques.  EMNLP, 2002.
	The Yahoo! SentimentAI group: Sentiment and Affect in Text. (need to join the group)
	Bing Liu's Opinion Mining page.
	Rada Mihalcea and Carlo Strapparava.  Making Computers Laugh: Investigations in Automatic Humor Recognition.  EMNLP, 2005.
	Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni.   Automatically Categorizing Written Texts by Author Gender.  Literary and Linguistic Computing 17(4), November 2002, pp. 401-412. 
	Chris Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.
	Alex J. Smola and Bernhard Scholkopf. A Tutorial on Support Vector Regression, NeuroCOLT Technical Report TR-98-030.  1998
	Thorsten Joachims' SVM-light code

	Ulrike von Luxburg. A Tutorial on Spectral Clustering. Statistics and Computing 17(4), 395-416 (12 2007).
   	ICML 2004 tutorial on spectral clustering by Chris Ding
	Fernando Pereira, Naftali Tishby and Lillian Lee. Distributional clustering of English words.  Proceedings of the 31st annual meeting on Association for Computational Linguistics, 1993.
	C.J.C. Burges. Dimension Reduction: A Guided Tour. Foundations and Trends in Machine Learning, 2010.

   Semi-supervised learning: using both labeled and unlabeled data 
	[cB] 9 for the EM algorithm.
	Self-training for word sense disambiguation: David Yarowsky, 1995. Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp 189--196.
	Text Classification from Labeled and Unlabeled Documents using EM. Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000.
	Combining Labeled and Unlabeled Data with Co-Training. Avrim Blum and Tom Mitchell. Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998
	T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning (ICML), 1999.
	Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions.  Xiaojin Zhu, Zoubin Ghahramani, John Lafferty.  The Twentieth International Conference on Machine Learning (ICML-2003) 
	Semi-Supervised Learning Literature Survey. Xiaojin Zhu, Computer Sciences TR 1530, University of Wisconsin - Madison.

   Latent topic models
	Semantic space via probabilistic Latent Semantic Analysis, latent Dirichlet allocation  Probabilistic Latent Semantic Analysis. Thomas Hofmann. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99)  
	Probabilistic Latent Semantic Indexing. Thomas Hofmann. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99)
	D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, January 2003.
	Griffiths, T., & Steyvers, M. Finding Scientific Topics..  Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004 

   Part of Speech tagging with Hidden Markov Models
	[MS] 10 for POS tagging 
	[cB] 13.2, [MS] 9, or [JM] 7.1-7.4 for HMM
	Zoubin Ghahramani, 2001. An Introduction to Hidden Markov Models and Bayesian Networks, International Journal of Pattern Recognition and Artificial Intelligence 15(1):9-42. 
	Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77(2),     pp. 257-286. (An Erratum by Ali Rahimi)
	David Elworthy, 1994. Does Baum-Welch Re-estimation help taggers? Proceedings of the 4th Conference on Applied Natural Language Processing.
	Kevin Murphy's Hidden Markov Model (HMM) Toolbox for Matlab
	Stanford Log-linear Part-Of-Speech Tagger

   Information extraction with Conditional Random Fields
	John Lafferty, Andrew McCallum, Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), 2001.
	Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning.  In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006.
	Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL 2003.
	Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. Uncertainty in AI, 2003.
	Hanna Wallach's conditional random fields page
	Andrew McCallum's MALLET code

   Parsing and context free grammars
	[MS] 11 or [JM] 9, 12
	Detlef Prescher. A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars.  The 15th European Summer School in Logic, Language and Information (ESSLLI-03).

   Machine Translation
	Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L.  Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology
	Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, 1993. The Mathematics of Statistical Machine Translation. Computational Linguistics 19(2), pp. 263--311.
	Papineni, Roukos, Ward, Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

   	The CMU Pronouncing Dictionary

   Spoken Document Retrieval
   	The TREC spoken document retrieval track: A success story, 1999

Related Courses

text data mining, Callan, CMU
human language technologies, Callan, Black, Lavie, CMU
information retrieval, Callan, Yang, CMU
natural language processing, Cardie, Cornell
learning to turn words into data, Cohen, CMU
Machine Learning Approaches for Natural Language Processing, Collins, MIT
introduction to bioinformatics, Craven, U Wisconsin
advanced bioinformatics, Craven, U Wisconsin
machine learning for text analysis, Craven, Shavlik, U Wisconsin
Empirical Methods in Natural Language Processing, Koehn, Edinburgh
statistical foundations of machine learning, Lafferty, Wasserman, CMU
algorithms for NLP, Lavie, Frederking, CMU
statistical natural language processing: models and methods, Lee, Cornell
natural language processing, Lee, Cornell
statistical methods for artificial intelligence, McAllester, TTI-C
introduction to natural language processing, McCallum, U Mass
natural Language Processing, Mihalcea, University of North Texas
advanced methods in artificial intelligence, Page, U Wisconsin
topics in Natural Language Processing, Ringger, BYU 
language and statistics, Rosenfeld, CMU
machine learning, Shavlik, U Wisconsin
speech recognition and understanding, Schultz, Waibel, CMU
Graphs and Networks, Spielman, Yale
Practical Machine Learning, Jordan, Berkeley
Topics in machine learning, Sha, USC
Computational Data Analysis: FOUNDATIONS OF MACHINE LEARNING & DATA MINING, Gray, Georgia Tech
Analysis of Social Media, Cohen and Glance, CMU
Text-Driven Forecasting, Smith, CMU