Young Wu's Homepage

Prev: W4, Next: W6

# Overview

📗 Readings: MARL Chapter 4.

📗 Wikipedia page: Link

# Markov Games

📗 The game environment is modeled by a Markov Game (MG): \(M = \left(S, A, R, P\right)\), where \(S\) is the set of states, \(A = A_{1} \times A_{2} \times ... \times A_{n}\) is the set of action profiles (one action for each player), \(R\) is the reward function where \(R_{i}\left(s, \left(a_{i}, a_{-i}\right)\right)\) is the reward for player \(i\) from state \(s \in S\) when \(i\) uses \(a_{i} \in A_{i}\) and the other players use \(a_{-i} \in A_{-i}\), and \(P\) is the transition function where \(P\left(s' |\right. s, a\) is the probability that the state becomes \(s' \in S\) given action profile \(a \in A\) is used in state \(s \in S\), and \(P\left(s\right)\) is the probability that the initial state is \(s\).

📗 A stage game in state \(s\) is given policy \(\pi\) by \(Q^{\pi}\left(s, a\right) = R\left(s, a\right) + \gamma P\left(s' | s, a\right) Q^{\pi}\left(s', \pi\left(s'\right)\right)\). Note that without a given policy, \(Q\) functions cannot be defined and thus stage games are not defined.

📗 There are no optimal policies for a Markov game. A policy \(\pi\) is an Markov perfect equilibrium policy if \(\pi\left(s\right)\) is a Nash equilibrium of the stage game in state \(s\), meaning \(\pi_{i}\left(s\right) = \mathop{\mathrm{argmax}}_{a} Q^{\pi}\left(s, \left(a, \pi_{-i}\left(s\right)\right)\right)\) for every player \(i\).

📗 Note: the value functions can be defined for zero-sum games since Nash equilibrium value is unique, but they are undefined for general-sum games. In some algorithms, the value of a game is defined as the value of the Nash equilibrium with the largest total reward for all players (called social welfare), but this is not a standard definition in game theory: the players do not have to coordinate on any particular equilibrium in case there are many.

# Car Too

Positions: (1) , (2)
Rewards in period : (1) , (2)
Animate car: , speed:

# Output:

📗 Q Function (rows and columns are U, D, L, R):

📗 V Function:

📗 Policy:

📗 Check Nash:

# Settings:

📗 Reward Function: , zero-sum:

or by
from to

📗 Collision reward: (1) , (2)

📗 Collision transition:

📗 Boundaries:

📗 Discount Factor:

📗 Number of value iteration:

Last Updated: July 01, 2025 at 1:46 AM