CS 639 (Spring 2021) Topics in Sequential Decision Making and Learning



Description In artificial intelligence, sequential decision making refers to agents that make decisions over time. Importantly, the world gives feedback to the agent after each decision, and may change the environment surrounding the agent. Earlier agent decisions affect the availability and quality of future decision options. The agent must improve itself from the feedback and make good decisions as time goes on. This is in contrast to supervised learning where learning is typically done only once. The focus of the course will be on reinforcement learning, though we will also discuss active learning, multi-armed bandits, and stochastic games. Mathematical maturity (probability and statistics, linear algebra, calculus), programming skills (data structure, python), and knowledge of machine learning at the level of cs540 or cs532 are necessary. This is an undergraduate level course. Prereq Prerequisites: CS540 OR CS532. Instructor Professor Jerry Zhu, jerryzhu@cs.wisc.edu Time and location Syllabus Week 1: Probably Approximately Correct: supervised learning, active learning Foundations of Machine Learning. Mohri, Rostamizadeh, Talwalkar. Second Ed, 2018. Ch 1, 2 (may skip 2.3 and beyond) Theory of Active Learning. Hanneke. 2014. Section 1.3 Optional: Active Learning. Settles 2012. (download from UW IP address) Week 2: multi-armed bandits Chapter 2.1-2.7 of textbook Chapter 1, 4.1-4.5, 6, 7 (may skip proofs) in Bandit Algorithms. Lattimore and Szepesvari, 2020. Week 3: contextual bandits, best arm identification Chapter 18.1, 19.1-19.2, 33.1 (may skip proofs) in Lattimore and Szepesvari Week 4: Markov Decision Process Ch 1; Part I Tabular Solution Methods intro (p23); Ch 3 in textbook 1.1, 1.2, 1.4 in Reinforcement Learning: Theory and Algorithms. Agarwal et al 2021. (may skip proofs) Week 5: Value functions, Bellman equations Ch 5, 6 in textbook Week 6: Planning in MDP (policy iteration, value iteration) Cuttlefish exert self-control in a delay of gratification task by Schnell et al. 2021 Week 7: Monte Carlo, temporal difference Ch 9 in textbook Week 8: SARSA, Q-learning, function approximation Read Monte Carlo tutorial at least to section 5.2. Then revisit sections 5.5 and 5.7 in the textbook. Week 9: Off-policy methods Ch 13 in textbook Week 10: policy gradient Read R-max and UCBVI Week 11: exploration An Algorithmic Perspective on Imitation Learning. Osa et al. Foundations and Trends in Robotics, 2018. Sections 1.1-1.4, 2.2, 2.4, 2.6, 3.1, 3.4.3.3, 4.1 Week 12: imitation learning Algorithmic Game Theory. Nisan et al. 2007. Sections 1.1-1.7 Week 13: stochastic games An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective. Yang, Wang. 2021. Sections 1-4 Week 14: stochastic games Textbooks Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto. Second Edition. MIT Press, Cambridge, MA, 2018. Grading: weekly reading summary (40%), math / coding homework (40%), exams (20%)
Homework All assignments are in Canvas. There are two kinds of homework: 1. Weekly reading summary. Usually posted on Thursdays and due on Mondays at 5pm, students submit a paragraph in Canvas in response to each reading assignment. The reading assignment will specify the book chapters or papers to read, and then may either ask a specific question or be open-ended. Grade will be based on evidence that you have done the readings thoughtfully. For open-ended reading summary, you may pose insightful questions, relate to previous classes, suggest in-depth discussion directions, summarize key points you learned from reading, etc. The Monday 5pm deadline enables the instructor to potentially incorporate your respnoses in teaching for that week. 2. Math and coding problems. This is the traditional homework, assigned about every two weeks. Homework is always due the minute before class starts on the due date (usually Thursdays at 9:29am). Late submissions will not be accepted. However, we will automatically drop two lowest weekly reading summary scores and two lowest math and coding problem scores from your final homework average calculation. These drops are meant for emergency. We do not provide additional drops, late days, or homework extensions. Grading questions must be raised with the instructor within one week after it is returned. Regrading request for a part of a homework question may trigger the grader to regrade the entire homework and could potentially take points off. Regrading will be done on the original submitted work, no changes allowed. We encourage you to use a study group for doing your homework. Students are expected to help each other out, and if desired, form ad-hoc homework groups. Exam Final exam: Tuesday May 4 12:25-2:25pm Madison time, via Canvas assignment. Academic Integrity: You are encouraged to discuss with your peers, the TA or the instructors ideas, approaches and techniques broadly. However, all examinations, programming assignments, and written homeworks must be written up individually. For example, code for programming assignments must not be developed in groups, nor should code be shared. Make sure you work through all problems yourself, and that your final write-up is your own. If you feel your peer discussions are too deep for comfort, declare it in the homework solution: "I discussed with X,Y,Z the following specific ideas: A, B, C; therefore our solutions may have similarities on D, E, F..." You may use books or legit online resources to help solve homework problems, but you must always credit all such sources in your writeup and you must never copy material verbatim. Do not bother to obfuscate plagiarism (e.g. change variable names, code style, etc.) One application of AI is to develop sophisticated plagiarism detection techniques! Cheating and plagiarism will be dealt with in accordance with University procedures (see the UW-Madison Academic Misconduct Rules and Procedures). Disability Information The University of Wisconsin-Madison supports the right of all enrolled students to a full and equal educational opportunity. The Americans with Disabilities Act (ADA), Wisconsin State Statute (36.12), and UW-Madison policy (Faculty Document 1071) require that students with disabilities be reasonably accommodated in instruction and campus life. Reasonable accommodations for students with disabilities is a shared faculty and student responsibility. Students are expected to inform Professor Zhu of their need for instructional accommodations by the end of the third week of the semester, or as soon as possible after a disability has been incurred or recognized. Professor Zhu will work either directly with the student or in coordination with the McBurney Center to identify and provide reasonable instructional accommodations. Disability information, including instructional accommodations as part of a student's educational record, is confidential and protected under FERPA. Additional Course Information Class learning outcome Student will be able to: - gain familiarity with advanced learning paradigms, including active learning, multi-armed bandits, reinforcement learning. - implement basic sequential decision making algorithms - understand basic theoretical analysis in sequential decision making Number of credits associated with the course: 3 How credit hours are met by the course: For each 50min of classroom instruction, a minimum of two hours of out of class student work is expected. This course has two 75-minute classes each week over approximately 15 weeks, which amounts to the standard definition of a 3-credit course.