CS 639 (Spring 2021) Topics in Sequential Decision Making and Learning

Description
In artificial intelligence, sequential decision making refers to agents that make decisions over time.
Importantly, the world gives feedback to the agent after each decision, and may change the environment
surrounding the agent. Earlier agent decisions affect the availability and quality of future decision
options. The agent must improve itself from the feedback and make good decisions as time goes on.
This is in contrast to supervised learning where learning is typically done only once. The focus of
the course will be on reinforcement learning, though we will also discuss active learning, multi-armed
bandits, and stochastic games. Mathematical maturity (probability and statistics, linear algebra,
calculus), programming skills (data structure, python), and knowledge of machine learning at the level
of cs540 or cs532 are necessary. This is an undergraduate level course.

Prereq
Prerequisites: CS540 OR CS532.

Instructor
Professor Jerry Zhu, jerryzhu@cs.wisc.edu

Time and location

Synchronous online lectures in Canvas/BBCollaborate Ultra during class time: TuTh 9:30AM - 10:45AM
Assignments, lecture recordings and whiteboard notes in Canvas, too.
Discussions in Piazza
Office hours synchronous online, see below.

Syllabus
Week 1: Probably Approximately Correct: supervised learning, active learning
Foundations of Machine Learning. Mohri, Rostamizadeh, Talwalkar. Second Ed, 2018. Ch 1, 2 (may skip 2.3 and beyond)
Theory of Active Learning. Hanneke. 2014. Section 1.3
Optional: Active Learning. Settles 2012. (download from UW IP address)
Week 2: multi-armed bandits
Chapter 2.1-2.7 of textbook
Chapter 1, 4.1-4.5, 6, 7 (may skip proofs) in Bandit Algorithms. Lattimore and Szepesvari, 2020.
Week 3: contextual bandits, best arm identification
Chapter 18.1, 19.1-19.2, 33.1 (may skip proofs) in Lattimore and Szepesvari
Week 4: Markov Decision Process
Ch 1; Part I Tabular Solution Methods intro (p23); Ch 3 in textbook
1.1, 1.2, 1.4 in Reinforcement Learning: Theory and Algorithms. Agarwal et al 2021. (may skip proofs)
Week 5: Value functions, Bellman equations
Ch 5, 6 in textbook
Week 6: Planning in MDP (policy iteration, value iteration)
Cuttlefish exert self-control in a delay of gratification task by Schnell et al. 2021
Week 7: Monte Carlo, temporal difference
Ch 9 in textbook
Week 8: SARSA, Q-learning, function approximation
Read Monte Carlo tutorial at least to section 5.2. Then revisit sections 5.5 and 5.7 in the textbook.
Week 9: Off-policy methods
Ch 13 in textbook
Week 10: policy gradient
Read R-max and UCBVI
Week 11: exploration
An Algorithmic Perspective on Imitation Learning. Osa et al. Foundations and Trends in Robotics, 2018.
Sections 1.1-1.4, 2.2, 2.4, 2.6, 3.1, 3.4.3.3, 4.1
Week 12: imitation learning
Algorithmic Game Theory. Nisan et al. 2007. Sections 1.1-1.7
Week 13: stochastic games
An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective. Yang, Wang. 2021. Sections 1-4
Week 14: stochastic games

Textbooks

Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto. Second Edition. MIT Press, Cambridge, MA, 2018.

Grading: weekly reading summary (40%), math / coding homework (40%), exams (20%)

Homework
All assignments are in Canvas. There are two kinds of homework:

1. Weekly reading summary. Usually posted on Thursdays and due on Mondays at 5pm, students submit a
paragraph in Canvas in response to each reading assignment. The reading assignment will specify the
book chapters or papers to read, and then may either ask a specific question or be open-ended. Grade
will be based on evidence that you have done the readings thoughtfully. For open-ended reading summary,
you may pose insightful questions, relate to previous classes, suggest in-depth discussion directions,
summarize key points you learned from reading, etc. The Monday 5pm deadline enables the instructor
to potentially incorporate your respnoses in teaching for that week.

2. Math and coding problems. This is the traditional homework, assigned about every two weeks.
Homework is always due the minute before class starts on the due date (usually Thursdays at 9:29am).

Late submissions will not be accepted. However, we will automatically drop two lowest weekly reading
summary scores and two lowest math and coding problem scores from your final homework average
calculation. These drops are meant for emergency. We do not provide additional drops, late days, or
homework extensions. Grading questions must be raised with the instructor within one week after it is
returned. Regrading request for a part of a homework question may trigger the grader to regrade the
entire homework and could potentially take points off. Regrading will be done on the original
submitted work, no changes allowed. We encourage you to use a study group for doing your homework.
Students are expected to help each other out, and if desired, form ad-hoc homework groups.

Exam
Final exam: Tuesday May 4 12:25-2:25pm Madison time, via Canvas assignment.

Academic Integrity:
You are encouraged to discuss with your peers, the TA or the instructors ideas, approaches
and techniques broadly. However, all examinations, programming assignments, and written
homeworks must be written up individually. For example, code for programming assignments
must not be developed in groups, nor should code be shared. Make sure you work through all
problems yourself, and that your final write-up is your own. If you feel your peer discussions
are too deep for comfort, declare it in the homework solution: "I discussed with X,Y,Z the following
specific ideas: A, B, C; therefore our solutions may have similarities on D, E, F..."

You may use books or legit online resources to help solve homework problems, but you must always credit
all such sources in your writeup and you must never copy material verbatim.

Do not bother to obfuscate plagiarism (e.g. change variable names, code style, etc.) One application
of AI is to develop sophisticated plagiarism detection techniques!

Cheating and plagiarism will be dealt with in accordance with University procedures
(see the UW-Madison Academic Misconduct Rules and Procedures).

Disability Information
The University of Wisconsin-Madison supports the right of all enrolled students to a full and equal
educational opportunity. The Americans with Disabilities Act (ADA), Wisconsin State Statute (36.12),
and UW-Madison policy (Faculty Document 1071) require that students with disabilities be reasonably
accommodated in instruction and campus life. Reasonable accommodations for students with disabilities
is a shared faculty and student responsibility. Students are expected to inform Professor Zhu of their
need for instructional accommodations by the end of the third week of the semester, or as soon as
possible after a disability has been incurred or recognized. Professor Zhu will work either directly
with the student or in coordination with the McBurney Center to identify and provide reasonable
instructional accommodations. Disability information, including instructional accommodations as part
of a student's educational record, is confidential and protected under FERPA.

Additional Course Information

Class learning outcome
Student will be able to:
- gain familiarity with advanced learning paradigms, including active learning, multi-armed bandits, reinforcement learning.
- implement basic sequential decision making algorithms
- understand basic theoretical analysis in sequential decision making

Number of credits associated with the course: 3

How credit hours are met by the course: For each 50min of classroom instruction, a minimum of two hours
of out of class student work is expected. This course has two 75-minute classes each week over approximately
15 weeks, which amounts to the standard definition of a 3-credit course.