Young Wu's Homepage

Prev: W5, Next: W7

# Overview

📗 Readings: MARL Chapter 6.

📗 Wikipedia page: Link

# Minimax Q Learning

📗 Training can be centralized, where all players jointly compute the Nash equilibrium, or decentralized, where each player computes its own optimal policy treating the other players' actions as a part of the state.

📗 For zero-sum games, centralized training can be executed similar to single agent Q learning with the update: \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \gamma \displaystyle\min \displaystyle\max_{a} Q\left(s_{t+1}, a\right)\right)\). For general-sum games, without assumptions on equilibrium selection, it not clear how \(V\left(s_{t+1}\right)\) can be computed.

📗 Not all decentralized algorithms converge to a Markov perfect equilibrium policy, but many learning algorithms are decentralized and treat Markov games as independent MPDs one for each player.

➩ For bandit games, no regret learning on zero-sum games converge to the Nash equilibrium of the stage games; and for special classes of general-sum games called potential games, best response dynamics converge to the Nash equilibrium.

Last Updated: July 01, 2025 at 1:46 AM