Young Wu's Homepage

Prev: L23, Next: L25

Zoom: Link, Piazza: Link, Google Form: Link.

Wisc ID for in-class quiz: (if your wisc email is "test@wisc.edu", please enter "test")
Token: (will be given during the lectures)

# Warning: this is a draft and will be updated one day before the lecture.

Slide:

# Multi-Agent Reinforcement Learning

📗 Multi-Agent Reinforcement Learning (MARL) is harder than Single Agent RL.

Single Agent RL	Multi Agent RL
MDP stationary	Non-stationary environment
Unique optimal value	Multiple equilibria with different values
One Agent	Problem scales exponentially with number of agents

# Markov Game

📗 Full observation MARL problems are modeled by Markov Games (MGs, or Stochastic Games): \((I, S, A, R, P)\).

➩ \(I\) is the set of players.

➩ \(S\) is the set of states (common to all players).

➩ \(A = \displaystyle\prod_{i \in I} A_{i}\) is the set of actions.

➩ \(R_{i} : S \times A \to \mathbb{R}\) is the reward function for player \(i \in I\).

➩ \(P : S \times A \to \Delta S\) is the state transition function (common to all players).

➩ \(P\left(\emptyset\right) \in \Delta S\) is the initial state distribution.

# Reduction to Single Agent RL

📗 Centralized Q-Learning: define joint reward (sometimes called social welfare function) and treat \(a \in A = \displaystyle\prod_{i \in I} A_{i}\) as the action space.

📗 Indepedent Q-Learning: single agent RL algorithm to control each agent, often does not converge.

# Value Iteration

📗 Value iteration for MGs, repeat until converge:

➩ \(Q_{i}\left(s, a\right) = R_{i}\left(s, a\right) + \beta \displaystyle\sum_{s' \in S} P\left(s' | s, a\right) V_{i}\left(s'\right)\).

➩ \(V_{i}\left(s\right) = \text{NE}\left(Q\left(s, \cdot\right)\right)\), where Nash Equilibrium (NE) can be replaced by other game equilibrium concepts (for example Correlated Equilibrium (CE) or Coarse Correlated Equilibrium (CCE)).

📗 The solution (Nash equilibrium policy, if converges) is called Markov Perfect Equilibrium (MPE).

📗 Learning, with for example, epsilon-Greedy:

➩ With probably \(\varepsilon\), choose a random action, otherwise, sample \(a_{t} \sim \pi\left(s_{t}\right) \in \text{NE}\left(Q\left(s_{t}, \cdot\right)\right)\), and observe reward \(r_{t}, s_{t+1}\).

➩ Update \(Q\left(s_{t}, a_{t}\right) = Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \text{NE}\left(Q\left(s_{t+1}, \cdot\right)\right) - Q\left(s_{t}, a_{t}\right)\right)\).

# Minimax Q-Learning

📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is not unique for general-sum games, meaning for general-sum MGs, value functions are not defined.

📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is unique for zero-sum games, minimax-Q (use minimax solver for stage game \(Q\left(s, \cdot\right)\)) converges to an MPE.

Soccer Game Example

📗 Grid-world soccer game (numbers are percentage won and episode length)

-	minimax Q	independent Q
vs random	\(99.3 \left(13.89\right)\)	\(99.5 \left(11.63\right)\)
vs hand-built	\(53.7 \left(18.87\right)\)	\(76.3 \left(30.30\right)\)
vs optimal	\(37.5 \left(22.73\right)\)	\(0 \left(83.33\right)\)

(Littman 1994: PDF)

# Nash Q-Learning

📗 Equilibrium selection by finding the Nash equilibrium with the highest sum of rewards (computationally intractable).

📗 Nash Q is guaranteed to converge under very restrictive assumptions.

➩ CE or CCE can be used in place of NE (CE and CCE can be solved with a linear program, easier than NE).

➩ No known condition for Correlated Q to converge.

# Deep MARL

📗 Most of the deep MARL models assume partial observability (recurrent units or attention modules) and uses either centralized or independent learning (through best response dynamics, similar to GAN).

➩ Hide and seek: Link.

➩ Adversarial attack on one player: Link.

➩ UW Madison Soccer Team: Link.

# Questions?

📗 If you have questions, please use (i) Zoom chat, (ii) Piazza: Link, (iii) Office hours and discussion sessions. Please do NOT use Canvas mail and use email only to the course instructor (not TAs) for grading issues.

Additional In-class Discussion

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

Notes (not visible to other students):
[Q1] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

Additional In-class Quiz

📗 Sometimes a question not in the notes will be asked during the lecture, you can submit your answer here:

A.
B.
C.
D.
E.
Notes (not visible to other students):
[Q2] Please check the box to confirm submission (submissions for questions not discussed during the lectures will result in in-class quiz point deduction).

Submit your answer to see other students answers (click the submit button to refresh):

# In-class Quiz Instructions

📗 To get full points on the in-class quizzes for a lecture:

➩ Submit relevant answers to the questions discussed during the lecture: incorrect answers are okay.

➩ Some questions require [notes] to earn the point.

➩ Some questions require special ID (given during the lecture) to earn the point.

➩ Do not submit answers to questions that are not discussed during the lectures. Each such submission will result in a deduction of one point.

➩ Submissions after the lecture, before the midterm (first 14 lectures) and the final exam (last 14 lectures), are accepted. After the exams, no in-class quiz submissions will be accepted.

➩ The grade on Canvas Assignment Q24 is computed as number of points divided by the number of questions asked (out of 1) and updated on Canvas every weekend.

📗 If there are any issues with submission on the website, please use this Google form: Link.

📗 Bonus point opportunities during a few lectures (added to in-class quiz above 20 points).

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Blerina Gkotse, Yudong Chen, Yingyu Liang, Charles Dyer. Some content are generated using Copilot .

Prev: L23, Next: L25

Last Updated: July 19, 2026 at 1:41 PM