Zero-sum Game with Unique Nash and Minimal Modification#
- game_redesign.RAP(policy_1, policy_2, original_reward, reward_bound=inf, value_lower_bound=-inf, value_upper_bound=inf, value_gap=None, perturbation_bound=None, perturbation_value=None, condition_number_bound=100)#
Find a zero-sum game
redesigned_rewardwith(policy_1, policy_2)being the unique Nash equilibrium. The game that is the closest tooriginal_reward, measured by L1 norm, is returned iforiginal_rewardis notNone.- Parameters:
policy_1 ((M,) array) – Target policy for player 1.
policy_1[i]is the probability player 1 uses actioni.policy_2 ((N,) array) – Target policy for player 2.
policy_2[j]is the probability player 2 uses actionj.original_reward ((M, N) array, optional) – The original reward matrix.
original_reward[i][j]is the reward for player 1 when player 1 uses actioniand player 2 uses actionsj.reward_bound (float or inf, optional) – All entries of
redesigned_rewardwill be in the interval[-reward_bound, reward_bound](the reward bound is symmetric since the game is zero-sum). The entries inoriginal_rewardare not required to be in this interval.value_lower_bound (float or inf, optional) – The lower bound for the Nash equilibrium value of
redesigned_reward.value_upper_bound (float or inf, optional) – The upper bound for the Nash equilibrium value of
redesigned_reward.value_gap (float or inf, optional) – The Nash equilibrium value of
redesigned_rewardwill be at leastvalue_gaplarger than the value from actions that are not in the support ofpolicy_1andpolicy_2. Ifvalue_gapisNone, the default value of0.01 * numpy.norm(original_reward, inf)is used. This is the \(\iota\) in [1] and [2].perturbation_value (float, optional) – The value of perturbation used. If
perturbation_valueisNone, a uniform random number in the interval[-perturbation_bound, perturbation_bound]will be used; otherwise, if the value supplied does not lead to a game with unique Nash equilibrium,InvalidPerturbationErrorwill be raised. This is the \(\epsilon\) in [1].perturbation_bound (float, optional) – The maximum absolute perturbation to be added to
redesigned_reward. Theperturbation_valuewill be a uniform random number in the interval[-perturbation_bound, perturbation_bound]. Ifperturbation_boundisNone, the default value of0.01 * numpy.norm(original_reward, inf)is used. This is the \(\lambda\) in [1].condition_number_bound (float, optional) – If the condition number of a matrix is larger than
condition_number_bound, it is considered singular when checking the invertibility condition for Nash uniqueness.
- Returns:
The zero-sum game
redesigned_rewardwith(policy_1, policy_2)being the unique Nash equilibrium. The game that is the closest tooriginal_reward, measured by L1 norm, is returned iforiginal_rewardis notNone.- Return type:
(M, N) array
- Raises:
InvalidPolicyError – If
policy_1andpolicy_2are not probability distributions, this error will be raised.InvalidOriginalRewardError – If
policy_1has M elements,policy_2has N elements, butoriginal_rewardis not an M by N matrix, this error will be raised.UnequalSupportSizeError – If the sizes of the support of
policy_1andpolicy_2are not equal, the problem will be infeasible.ValueRangeError – If the Nash value range
[value_lower_bound, value_upper_bound]and the interior of the adjusted reward value range(-reward_bound + value_gap + perturbation_bound, reward_bound - value_gap - perturbation_bound)has an empty intersection, the problem will be infeasible.InvalidPerturbationError – If the
perturbation_valueis supplied but the resultingredesigned_rewarddoes not have a unique Nash equilibrium (value_gapandcondition_number_boundare used to check the SIISOW and INV conditions, see [1] for details), this error will be raised.
Notes#
Given
policy_1(\(\mathbf{p}\) with support \(\mathcal{I} \subseteq \mathcal{A}_{1}\)),policy_2(\(\mathbf{q}\) with support \(\mathcal{J} \subseteq \mathcal{A}_{2}\)),original_reward(\(R^\circ\)),reward_bound(\(b\)),value_lower_bound(\(\overline{v}\)),value_upper_bound(\(\underline{v}\)),value_gap(\(\iota\)),perturbation_value(\(\epsilon\)), andperturbation_bound(\(\lambda\)), the outputredesigned_reward(\(R^\dagger\)) is computed by,
Step 1 (Relax):
Step 2 (and Perturb):
Start with \(R^\dagger = R\) and while the condition number of \(\begin{bmatrix} R_{\mathcal{I} \mathcal{J}} & -{\mathbf{1}}_{\left| \mathcal{I} \right|} \\ {\mathbf{1}}^\top_{\left| \mathcal{J} \right|} & 0 \end{bmatrix}\) is larger than condition_number_bound, repeat the following:
The \(R^{\text{eRPS}}\) is the perturbation step is the extended rock paper scissors game, see game_redesign.eRPS(). Details of the algorithm can be found in [1].
If
perturbation_valueisNone, the function uses a random perturbation and therefore returns a different game every time the function is called.
References#
Examples#
Example 1 from [1]: The two finger morra example.
game_redesign.RAP([7/12, 5/12], [7/12, 5/12], [[2, -3], [-3, 4]], float("inf"), 0, 0)
Example 2 from [1]: The rock paper scissors fire water example.
game_redesign.RAP([1/5, 1/5, 1/5, 1/5, 1/5], [1/5, 1/5, 1/5, 1/5, 1/5], [[0, -1, 1, -1, 1], [1, 0, -1, -1, 1], [-1, 1, 0, -1, 1], [1, 1, 1, 0, -1], [-1, -1, -1, 1, 0]], 1)
An example with constant zero original reward, so the redesigned reward matrix will be the extended rock paper scissors game scaled by
perturbation_value.
game_redesign.RAP([1/3, 1/3, 1/3], [1/3, 1/3, 1/3], [[0, 0, 0], [0, 0, 0], [0, 0, 0]], reward_bound = 1, perturbation_value = 0.01)
An example where
UnequalSupportSizeErroris raised since the supports ofpolicy_1andpolicy_2have different sizes.
game_redesign.RAP([1/3, 1/3, 1/3], [1/2, 0, 1/2], [[0, 0, 0], [0, 0, 0], [0, 0, 0]])
An example where
ValueRangeErroris raised since the adjusted reward value range is empty.
game_redesign.RAP([1/2, 1/2], [1/2, 1/2], [[1, 0], [0, 1]], reward_bound = 1, value_gap = 0.5, perturbation_bound = 0.5)
An example where
InvalidPerturbationErroris raised since theperturbation_valueis too small for SIISOW or INV conditions to be satisfied numerically based onvalue_gapandcondition_number_bound.
game_redesign.RAP([1/2, 1/2], [1/2, 1/2], [[0, 0], [0, 0]], value_gap = 0.2, perturbation_value = 0.1, condition_number_bound = 100)