Prev: W2, Next: W4

# Minimax Q Learning

📗 Training can be centralized, where all players jointly compute the Nash equilibrium, or decentralized, where each player computes its own optimal policy treating the other players' actions as a part of the state.
📗 For zero-sum games, centralized training can be executed similar to single agent Q learning with the update: \(Q\left(s_{t}, a_{t}\right) = \left(1 - \alpha\right) Q\left(s_{t}, a_{t}\right) + \alpha \left(r_{t} + \gamma \displaystyle\min \displaystyle\max_{a} Q\left(s_{t+1}, a\right)\right)\). For general-sum games, without assumptions on equilibrium selection, it not clear how \(V\left(s_{t+1}\right)\) can be computed.

📗 Not all decentralized algorithms converge to a Markov perfect equilibrium policy, but many learning algorithms are decentralized and treat Markov games as independent MPDs one for each player.
➩ For bandit games, no regret learning on zero-sum games converge to the Nash equilibrium of the stage games; and for special classes of general-sum games called potential games, best response dynamics converge to the Nash equilibrium.





Last Updated: December 10, 2024 at 3:36 AM