Young Wu's Homepage

Prev: L23, Next: L25

Zoom: Link, TopHat: Link, GoogleForm: Link, Piazza: Link, Feedback: Link.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Multi-Agent Reinforcement Learning

📗 Multi-Agent Reinforcement Learning (MARL) is harder than Single Agent RL.

Single Agent RL	Multi Agent RL
MDP stationary	Non-stationary environment
Unique optimal value	Multiple equilibria with different values
One Agent	Problem scales exponentially with number of agents

# Markov Game

📗 Full observation MARL problems are modeled by Markov Games (MGs, or Stochastic Games): \((I, S, A, R, P)\).

➩ \(I\) is the set of players.

➩ \(S\) is the set of states (common to all players).

➩ \(A = \displaystyle\prod_{i \in I} A_{i}\) is the set of actions.

➩ \(R_{i} : S \times A \to \mathbb{R}\) is the reward function for player \(i \in I\).

➩ \(P : S \times A \to \Delta S\) is the state transition function (common to all players).

➩ \(P\left(\emptyset\right) \in \Delta S\) is the initial state distribution.

# Reduction to Single Agent RL

📗 Centralized Q-Learning: define joint reward (sometimes called social welfare function) and treat \(a \in A = \displaystyle\prod_{i \in I} A_{i}\) as the action space.

📗 Indepedent Q-Learning: single agent RL algorithm to control each agent, often does not converge.

# Value Iteration

📗 Value iteration for MGs, repeat until converge:

➩ \(Q_{i}\left(s, a\right) = R_{i}\left(s, a\right) + \beta \displaystyle\sum_{s' \in S} P\left(s' | s, a\right) V_{i}\left(s'\right)\).

➩ \(V_{i}\left(s\right) = \text{NE}\left(Q\left(s, \cdot\right)\right)\), where Nash Equilibrium (NE) can be replaced by other game equilibrium concepts (for example Correlated Equilibrium (CE) or Coarse Correlated Equilibrium (CCE)).

📗 The solution (Nash equilibrium policy, if converges) is called Markov Perfect Equilibrium (MPE).

📗 Learning, with for example, epsilon-Greedy:

➩ With probably \(\varepsilon\), choose a random action, otherwise, sample \(a_{t} \sim \pi\left(s_{t}\right) \in \text{NE}\left(Q\left(s_{t}, \cdot\right)\right)\), and observe reward \(r_{t}, s_{t+1}\).

➩ Update \(Q\left(s_{t}, a_{t}\right) = Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \beta \text{NE}\left(Q\left(s_{t+1}, \cdot\right)\right) - Q\left(s_{t}, a_{t}\right)\right)\).

# Minimax Q-Learning

📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is not unique for general-sum games, meaning for general-sum MGs, value functions are not defined.

📗 \(\text{NE}\left(Q\left(s, \cdot\right)\right)\) is unique for zero-sum games, minimax-Q (use minimax solver for stage game \(Q\left(s, \cdot\right)\)) converges to an MPE.

Soccer Game Example

📗 Grid-world soccer game (numbers are percentage won and episode length)

-	minimax Q	independent Q
vs random	\(99.3 \left(13.89\right)\)	\(99.5 \left(11.63\right)\)
vs hand-built	\(53.7 \left(18.87\right)\)	\(76.3 \left(30.30\right)\)
vs optimal	\(37.5 \left(22.73\right)\)	\(0 \left(83.33\right)\)

(Littman 1994: PDF)

# Nash Q-Learning

📗 Equilibrium selection by finding the Nash equilibrium with the highest sum of rewards (computationally intractable).

📗 Nash Q is guaranteed to converge under very restrictive assumptions.

➩ CE or CCE can be used in place of NE (CE and CCE can be solved with a linear program, easier than NE).

➩ No known condition for Correlated Q to converge.

# Deep MARL

📗 Most of the deep MARL models assume partial observability (recurrent units or attention modules) and uses either centralized or independent learning (through best response dynamics, similar to GAN).

➩ Hide and seek: Link.

➩ Adversarial attack on one player: Link.

➩ UW Madison Soccer Team: Link.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yudong Chen, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: L23, Next: L25

Last Updated: November 03, 2025 at 1:03 PM