Prev: L23, Next: L25

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)



# Warning: this is a draft and will be updated one day before the lecture.


Slide:

# Multi-Agent Reinforcement Learning

📗 Multi-Agent Reinforcement Learning (MARL) is harder than Single Agent RL.
Single Agent RL Multi Agent RL
MDP stationary Non-stationary environment
Unique optimal value Multiple equilibria with different values
One Agent Problem scales exponentially with number of agents


# Markov Game

📗 Full observation MARL problems are modeled by Markov Games (MGs, or Stochastic Games): \((I, S, A, R, P)\).
➩ \(I\) is the set of players.
➩ \(S\) is the set of states (common to all players).
➩ \(A = \displaystyle\prod_{i \in I} A_{i}\) is the set of actions.
➩ \(R_{i} : S \times A \to  \mathbb{R}\) is the reward function for player \(i \in I\).
➩ \(P : S \times A \to  \Delta S\) is the state transition function (common to all players).
➩ \(P\left(\emptyset\right) \in \Delta S\) is the initial state distribution.

# Reduction to Single Agent RL

📗 Centralized Q-Learning: define joint reward (sometimes called social welfare function) and treat \(a \in A = \displaystyle\prod_{i \in I} A_{i}\) as the action space.
📗 Indepedent Q-Learning: single agent RL algorithm to control each agent, often does not converge.

# Value Iteration

📗 Value iteration for MGs, repeat until converge:
➩ \(Q_{i}\left(s, a\right) = R_{i}\left(s, a\right) + \beta \displaystyle\sum_{s' \in S} P\left(s' | s, a\right) V_{i}\left(s'\right)\).
➩ \(V_{i}\left(s\right) = \text{NE}\left(Q\left(s, \cdot\right)\right)\), where Nash Equilibrium (NE) can be replaced by other game equilibrium concepts (for example Correlated Equilibrium (CE) or Coarse Correlated Equilibrium (CCE)).
📗 The solution (Nash equilibrium policy, if converges) is called Markov Perfect Equilibrium (MPE).
📗 Learning, with for example, epsilon-Greedy:
➩ With probably \(\varepsilon\), choose a random action, otherwise, sample \(a_{t} \sim \pi\left(s_{t}\right) \in \text{NE}\left(Q\left(s_{t}, \cdot\right)\right)\), and observe reward \(r_{t}, s_{t+1}\).
➩ Update \(Q\left(s_{t}, a_{t}\right) = Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \text{NE}\left(Q\left(s_{t+1}, \cdot\right)\right) - Q\left(s_{t}, a_{t}\right)\right)\).

# Minimax Q-Learning

📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is not unique for general-sum games, meaning for general-sum MGs, value functions are not defined.
📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is unique for zero-sum games, minimax-Q (use minimax solver for stage game \(Q\left(s, \cdot\right)\)) converges to an MPE.
Soccer Game Example
📗 Grid-world soccer game (numbers are percentage won and episode length)
- minimax Q independent Q
vs random \(99.3 \left(13.89\right)\) \(99.5 \left(11.63\right)\)
vs hand-built \(53.7 \left(18.87\right)\) \(76.3 \left(30.30\right)\)
vs optimal \(37.5 \left(22.73\right)\) \(0 \left(83.33\right)\)


(Littman 1994: PDF)

# Nash Q-Learning

📗 Equilibrium selection by finding the Nash equilibrium with the highest sum of rewards (computationally intractable).
📗 Nash Q is guaranteed to converge under very restrictive assumptions.
➩ CE or CCE can be used in place of NE (CE and CCE can be solved with a linear program, easier than NE).
➩ No known condition for Correlated Q to converge.

# Deep MARL

📗 Most of the deep MARL models assume partial observability (recurrent units or attention modules) and uses either centralized or independent learning (through best response dynamics, similar to GAN).
➩ Hide and seek: Link.
➩ Adversarial attack on one player: Link.
➩ UW Madison Soccer Team: Link.




# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.
Additional In-class Discussion
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 

Additional In-class Quiz
📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:
A.
B.
C.
D.
E.
Notes (not visible to other students):


Submit your answer to see other students answers (click the submit button to refresh): 






# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:
➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.
➩ Some questions require [notes] to earn the point.
➩ Some questions require special ID (given during the lecture) to earn the point.
➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.
➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.
➩ The grade on Canvas Assignment Q24 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.
📗 If there are any issues with submission on the website, please use this Google form: Link.
📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).
📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L23, Next: L25





Last Updated: June 26, 2026 at 3:06 AM