Zero-sum Game with Unique Nash and Minimal Modification#

game_redesign.RAP(policy_1, policy_2, original_reward, reward_bound=inf, value_lower_bound=-inf, value_upper_bound=inf, value_gap=None, perturbation_bound=None, perturbation_value=None, condition_number_bound=100)#

Find a zero-sum game redesigned_reward with (policy_1, policy_2) being the unique Nash equilibrium. The game that is the closest to original_reward, measured by L1 norm, is returned if original_reward is not None.

Parameters:

policy_1 ((M,) array) – Target policy for player 1. policy_1[i] is the probability player 1 uses action i.
policy_2 ((N,) array) – Target policy for player 2. policy_2[j] is the probability player 2 uses action j.
original_reward ((M, N) array, optional) – The original reward matrix. original_reward[i][j] is the reward for player 1 when player 1 uses action i and player 2 uses actions j.
reward_bound (float or inf, optional) – All entries of redesigned_reward will be in the interval [-reward_bound, reward_bound] (the reward bound is symmetric since the game is zero-sum). The entries in original_reward are not required to be in this interval.
value_lower_bound (float or inf, optional) – The lower bound for the Nash equilibrium value of redesigned_reward.
value_upper_bound (float or inf, optional) – The upper bound for the Nash equilibrium value of redesigned_reward.
value_gap (float or inf, optional) – The Nash equilibrium value of redesigned_reward will be at least value_gap larger than the value from actions that are not in the support of policy_1 and policy_2. If value_gap is None, the default value of 0.01 * numpy.norm(original_reward, inf) is used. This is the \(\iota\) in [1] and [2].
perturbation_value (float, optional) – The value of perturbation used. If perturbation_value is None, a uniform random number in the interval [-perturbation_bound, perturbation_bound] will be used; otherwise, if the value supplied does not lead to a game with unique Nash equilibrium, InvalidPerturbationError will be raised. This is the \(\epsilon\) in [1].
perturbation_bound (float, optional) – The maximum absolute perturbation to be added to redesigned_reward. The perturbation_value will be a uniform random number in the interval [-perturbation_bound, perturbation_bound]. If perturbation_bound is None, the default value of 0.01 * numpy.norm(original_reward, inf) is used. This is the \(\lambda\) in [1].
condition_number_bound (float, optional) – If the condition number of a matrix is larger than condition_number_bound, it is considered singular when checking the invertibility condition for Nash uniqueness.

Returns:

The zero-sum game redesigned_reward with (policy_1, policy_2) being the unique Nash equilibrium. The game that is the closest to original_reward, measured by L1 norm, is returned if original_reward is not None.

Return type:

(M, N) array

Raises:

InvalidPolicyError – If policy_1 and policy_2 are not probability distributions, this error will be raised.
InvalidOriginalRewardError – If policy_1 has M elements, policy_2 has N elements, but original_reward is not an M by N matrix, this error will be raised.
UnequalSupportSizeError – If the sizes of the support of policy_1 and policy_2 are not equal, the problem will be infeasible.
ValueRangeError – If the Nash value range [value_lower_bound, value_upper_bound] and the interior of the adjusted reward value range (-reward_bound + value_gap + perturbation_bound, reward_bound - value_gap - perturbation_bound) has an empty intersection, the problem will be infeasible.
InvalidPerturbationError – If the perturbation_value is supplied but the resulting redesigned_reward does not have a unique Nash equilibrium (value_gap and condition_number_bound are used to check the SIISOW and INV conditions, see [1] for details), this error will be raised.

Notes#

The required packages include: numpy and gurobi.
Given policy_1 (\(\mathbf{p}\) with support \(\mathcal{I} \subseteq \mathcal{A}_{1}\)), policy_2 (\(\mathbf{q}\) with support \(\mathcal{J} \subseteq \mathcal{A}_{2}\)), original_reward (\(R^\circ\)), reward_bound (\(b\)), value_lower_bound (\(\overline{v}\)), value_upper_bound (\(\underline{v}\)), value_gap (\(\iota\)), perturbation_value (\(\epsilon\)), and perturbation_bound (\(\lambda\)), the output redesigned_reward (\(R^\dagger\)) is computed by,

Step 1 (Relax):

\[\begin{split}\displaystyle\min_{R,v} & \;\left\|R - R^\circ\right\|_{1} \\ s.t. & \;R_{\mathcal{I}\bullet} \mathbf{q} = v {\mathbf{1}}_{\left| \mathcal{I} \right|} \\ & \mathbf{p}^\top R_{\bullet\mathcal{J}} = v {\mathbf{1}}^\top_{\left| \mathcal{J} \right|} \\ & R_{\mathcal{A}_{1}\setminus\mathcal{I}\bullet} \mathbf{q} \leq \left(v - \iota\right) {\mathbf{1}}_{\left| \mathcal{A}_{1}\setminus\mathcal{I} \right|} \\ & \mathbf{p}^\top R_{\bullet \mathcal{A}_{2}\setminus\mathcal{J}} \geq \left(v + \iota\right) {\mathbf{1}}^\top_{\left| \mathcal{A}_{2}\setminus\mathcal{J} \right|} \\ & \underline{v} \leq v \leq \overline{v} \\ & -b+\lambda \leq R_{i j} \leq b-\lambda, \forall \left(i, j\right) \in \mathcal{A}.\end{split}\]

Step 2 (and Perturb):

Start with \(R^\dagger = R\) and while the condition number of \(\begin{bmatrix} R_{\mathcal{I} \mathcal{J}} & -{\mathbf{1}}_{\left| \mathcal{I} \right|} \\ {\mathbf{1}}^\top_{\left| \mathcal{J} \right|} & 0 \end{bmatrix}\) is larger than condition_number_bound, repeat the following:

\[\begin{split}\text{sample} & \;\epsilon \sim \text{Uniform}\left[-\lambda, \lambda\right], \\ & R^\dagger = R + \epsilon R^{\text{eRPS}}.\end{split}\]

The \(R^{\text{eRPS}}\) is the perturbation step is the extended rock paper scissors game, see game_redesign.eRPS(). Details of the algorithm can be found in [1].

If perturbation_value is None, the function uses a random perturbation and therefore returns a different game every time the function is called.

References#

Examples#

Example 1 from [1]: The two finger morra example.

game_redesign.RAP([7/12, 5/12], [7/12, 5/12], [[2, -3], [-3, 4]], float("inf"), 0, 0)

Example 2 from [1]: The rock paper scissors fire water example.

game_redesign.RAP([1/5, 1/5, 1/5, 1/5, 1/5], [1/5, 1/5, 1/5, 1/5, 1/5], [[0, -1, 1, -1, 1], [1, 0, -1, -1, 1], [-1, 1, 0, -1, 1], [1, 1, 1, 0, -1], [-1, -1, -1, 1, 0]], 1)

An example with constant zero original reward, so the redesigned reward matrix will be the extended rock paper scissors game scaled by perturbation_value.

game_redesign.RAP([1/3, 1/3, 1/3], [1/3, 1/3, 1/3], [[0, 0, 0], [0, 0, 0], [0, 0, 0]], reward_bound = 1, perturbation_value = 0.01)

An example where UnequalSupportSizeError is raised since the supports of policy_1 and policy_2 have different sizes.

game_redesign.RAP([1/3, 1/3, 1/3], [1/2, 0, 1/2], [[0, 0, 0], [0, 0, 0], [0, 0, 0]])

An example where ValueRangeError is raised since the adjusted reward value range is empty.

game_redesign.RAP([1/2, 1/2], [1/2, 1/2], [[1, 0], [0, 1]], reward_bound = 1, value_gap = 0.5, perturbation_bound = 0.5)

An example where InvalidPerturbationError is raised since the perturbation_value is too small for SIISOW or INV conditions to be satisfied numerically based on value_gap and condition_number_bound.

game_redesign.RAP([1/2, 1/2], [1/2, 1/2], [[0, 0], [0, 0]], value_gap = 0.2, perturbation_value = 0.1, condition_number_bound = 100)