Young Wu's Homepage

Prev: L19, Next: L21
Course Links: Canvas, Piazza, TopHat (212925)
Zoom Links: MW 4:00, TR 1:00, TR 2:30.

Tools

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 For visibility, you can resize all diagrams on page to have a maximum height that is percent of the screen height: .

📗 Calculator:

📗 Canvas:

Slide:

# Static Evaluation Function

📗 The heuristic to estimate the value at an internal state for games is called a static (board) evaluation (SBE) function: Wikipedia.

➩ For zero-sum games, SBE for one player should be the negative of the SBE for the other player.

➩ At terminal states, the SBE should agree with the cost or reward at that state.

📗 For Chess, the SBE can be computed by a neural network based some features such as: material, mobility, king safety, center control; or a convolutional neural network treating the board as an image.

📗 IDS can be used with SBE.

➩ In iteration \(d\), the depth is limited to \(d\) and the SBE of the internal states at depth \(d\) are used as their cost or reward.

TopHat Discussion

📗 [4 points] The subscripts are heuristics (static evaluation, or estimated alpha and beta values) at the internal nodes.

TopHat Discussion

📗 What are some heuristic (SBE) for the game of Teeko: Link, Wikipedia?

Name:

📗 [1 points] Find the optimal strategy against a min player that uses a random strategy with probability \(p\):

Probability \(p\): 1
Heuristic: 0
Winner: -

# Monte Carlo Tree Searh

📗 Random subgames can be simulated by selecting random moves for both players: Wikipedia.

📗 The move corresponding to the highest expected reward (win rates) can be picked.

➩ The move corresponding to the highest optimistic estimate of the reward (win rates) can be also picked.

Example

📗 Alpha GO uses Monte Carlo Tree Search with more than \(10^{5}\) play-outs: Wikipedia.

📗 Alpha GO uses convolutional neural network to compute SBE: Link and Link.

Math Notes (Optional)

📗 The optimistic estimate of the reward is called upper confidence bound of the rewards (or win rates here): \(\dfrac{w_{s}}{n_{s}} + c \sqrt{\dfrac{\log T}{n_{s}}}\), where \(w\) is the number of wins after state \(s\), \(n\) is the number of simulations after \(s\), and \(T\) is the total number of simulations.

➩ More details will be discussed in the reinforcement learning lectures.

# Rationalizability

📗 Unlike sequential games, for simultaneous move games, one player (agent) does not know the action taken by the other player.

📗 Given the actions of the other players, the optimal action is called the best response.

📗 An action is dominated if it is worse than another action given all actions of the other players.

➩ For finite games (finite number players and finite number of actions), an action is dominated if and only if it is never a best response.

➩ An action is strictly dominated if it is strictly worse than another action given all actions of the other players. A dominated action is weakly dominated if it is not strictly dominated.

📗 Rationalizability (IESDS, Iterative Elimination of Strictly Dominated Strategies): iteratively remove the actions are that dominated (or never best responses for finite games): Wikipedia.

TopHat Discussion

📗 [1 points] Write down an integer between and that is the closest to two thirds \(\dfrac{2}{3}\) of the average of everyone's (including yours) integers.

📗 Answer: .

TopHat Quiz

(Past Exam Question) ID:

📗 [4 points] Perform iterated elimination of strictly dominated strategies (i.e. find rationalizable actions). Player A's strategies are the rows. The two numbers are (A, B)'s payoffs, respectively. Recall each player wants to maximize their own payoff. Enter the payoff pair that survives the process. If there are more than one rationalizable action, enter the pair that leads to the largest payoff for player A.

📗 Answer (comma separated vector): .

# Prisoner's Dilemma

📗 A symmetric simultaneous move game is a prisoner's dilemma game if the Nash equilibrium (using strictly dominant actions) is strictly worse for all players than another outcome: Link, Wikipedia.

➩ For two players, the game can be represented by a game matrix: \(\begin{bmatrix} - & C & D \\ C & \left(x, x\right) & \left(0, y\right) \\ D & \left(y, 0\right) & \left(1, 1\right) \end{bmatrix}\), where C stands for Cooperate (or Deny) and D stands for Defect (or Confess), and \(y > x > 1\). Here, \(\left(D, D\right)\) is the only Nash equilibrium (using strictly dominant actions) but \(\left(C, C\right)\) is strictly preferred by both players.

Example

📗 Split or Steal games: (YouTube playlist: Link, Solution: Link).

📗 Random examples: 0, 1, 2, 3, 4, 5, 6, 7.

📗 Notes and code adapted from the course taught by Professors Jerry Zhu, Yingyu Liang, and Charles Dyer.

📗 Content from note blocks marked "optional" and content from Wikipedia and other demo links are helpful for understanding the materials, but will not be explicitly tested on the exams.

📗 Please use Ctrl+F5 or Shift+F5 or Shift+Command+R or Incognito mode or Private Browsing to refresh the cached JavaScript.

📗 You can expand all TopHat Quizzes and Discussions: , and print the notes: , or download all text areas as text file: .

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form.

Prev: L19, Next: L21

Last Updated: July 01, 2025 at 1:47 AM

A \ B	I	II	III
I
II
III
IV

A \ B	I	II	III	IV
I
II
III
IV

# Static Evaluation Function

# Monte Carlo Tree Searh

# Rationalizability

# Nash Equilibrium

# Prisoner's Dilemma