Prev: M30 Next: M31

# M31 Past Exam Problems

📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit enter key)
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could print the page: , solve the problems, then enter all your answers at the end.
📗 Please do not refresh the page: your answers will not be saved.

# Warning: please enter your ID before you start!


# Question 1


📗

# Question 2


📗

# Question 3


📗

# Question 4


📗

# Question 5


📗

# Question 6


📗

# Question 7


📗

# Question 8


📗

# Question 9


📗

# Question 10


📗

# Question 11


📗

# Question 12


📗

# Question 13


📗

# Question 14


📗

# Question 15


📗

# Question 16


📗

# Question 17


📗

# Question 18


📗

# Question 19


📗

# Question 20


📗

# Question 21


📗

# Question 22


📗

# Question 23


📗

# Question 24


📗

# Question 25


📗


📗 [3 points] Consider the Grid World with terminal states "RED" and "GREEN" and 7 other states shown in the table below.
RED 1 2
3 4 5
6 7 GREEN

There are four actions UP, DOWN, LEFT, RIGHT describing the movement between the states on the grid. The grid does not wrap around, i.e. using the action UP in state 1 results in state 1, not state 7.
Suppose the reward on all transitions (from actions UP, DOWN, LEFT, RIGHT) are \(R_{t}\) = , and the discount factor is \(\gamma\) = . The current policy \(\pi\) (probabilities of actions UP, DOWN, LEFT, RIGHT when in each state) is given in the following table.
State UP DOWN LEFT RIGHT
1
2
3
4
5
6
7

The current value function \(V_{k}\) is given in the table below.
\(0\)
\(0\)

Find the value of state in the next step of value iteration (i.e. \(V_{k+1}\) for state ). Enter one number.
📗 Answer: .
📗 [1 points] The boxes have different mean rewards between 0 and 1. Click on one of them to collect the reward from the box. The goal is to maximize the total reward (or minimize the regret) given a fixed number of clicks. You have clicks left. Your current total reward is . Refresh the page to restart. Which box has the largest mean reward?


📗 Answer: .
📗 [4 points] Consider the following Markov Decision Process. It has two states \(s\), A and B. It has two actions \(a\): move and stay. The state transition is deterministic: "move" moves to the other state, while "stay" stays at the current state. The reward \(r\) is for move, for stay. Suppose the discount rate is \(\beta\) = .
Find the Q table \(Q_{i}\) after \(i\) = updates of every entry using Q value iteration (\(i = 0\) initializes all values to \(0\)) in the format described by the following table. Enter a two by two matrix.
State \ Action stay move
A ? ?
B ? ?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [4 points] Consider the following Markov Decision Process. It has two states \(s\), A and B. It has two actions \(a\): move and stay. The state transition is deterministic: "move" moves to the other state, while "stay" stays at the current state. The reward \(r\) is for move (from A and B), for stay (in A and B). Suppose the discount rate is \(\beta\) = .

Find the Q table \(Q_{i}\) after \(i\) = updates of every entry using Q value iteration (\(i = 0\) initializes all values to \(0\)) in the format described by the following table. Enter a two by two matrix.
State \ Action stay move
A ? ?
B ? ?

📗 Answer (matrix with multiple lines, each line is a comma separated vector): .
📗 [2 points] A robot initializes Q-learning by setting \(q\left(s, a\right) = 0\) for all state \(s\) and action \(a\). It has a learning rate \(\alpha\) = , and discounting factor \(\gamma\) = . The robot senses that it is in state \(s_{105}\) and decides to perform action \(a_{540}\). For this action, the robot receives reward and arrives at state \(s_{7331}\). What value is \(q\left(s_{105}, a_{540}\right)\) after this one step of Q-learning?
📗 Answer: .
📗 [4 points] There are 3 states \(s_{0}, s_{1}, s_{2}\) and 3 actions \(a_{0}, a_{1}, a_{2}\). We start from , choose , we get the reward and then move to , choose . Update the Q value for (, ) based on the current Q table and the movement above, using SARSA and Q-learning (enter two numbers, comma separated)? The reward decay (discount rate) is \(\gamma\) = , and the step size (learning rate) is \(\alpha\) = .
State \ Action \(a_{0}\) \(a_{1}\) \(a_{2}\)
\(s_{0}\)
\(s_{1}\)
\(s_{2}\)

📗 Answer (comma separated vector): .
📗 [3 points] In an infinite horizon MDP (Markov Decision Process), there are \(n\) = states: initial state \(s_{0}\), and absorbing states \(s_{1}, s_{2}, ..., s_{n-1}\). In state \(s_{0}\), the agent can stay or move to any other state, but in all other absorbing states the agent can only choose to stay. The reward from staying in those states are summarized in the following table. Compute the Q value (under the optimal policy, not from Q learning) \(Q\left(s_{0}, \text{stay}\right)\). Use the discount factor \(\gamma\) = .
State \(s_{0}\) \(s_{1}\) \(s_{2}\) \(s_{3}\) \(s_{4}\)
Reward from stay
Reward from move - - - -

📗 Answer: .
📗 [3 points] Consider state space \(S = \left\{s_{1}, s_{2}\right\}\) and action space \(A\) = {left, right}. In \(s_{1}\) the action "right" sends the agent to \(s_{2}\) and collects reward \(r = 1\). In \(s_{2}\) the action "left" sends the agent to \(s_{1}\) but with zero reward. All other state-action pairs stay in that state with zero reward. With discounting factor \(\gamma\) = , what is the value \(v\left(s_{2}\right)\) under the optimal policy.

📗 Answer: .
📗 [3 points] Suppose the UCB1 (Upper Confidence Bound) Algorithm is used to select arms in a multi-armed bandit problem, and in round \(t\) = , the arms pulls and empirical means \(\hat{\mu}\) for the arms are summarized in the following table, and in period \(t + 1\), an arm is pulled according to the UCB1 Algorithm and the reward is . Compute the updated empirical means of the arms after period \(t + 1\), i.e. updated \(\hat{\mu}_{1}, \hat{\mu}_{2}, ...\). Use \(c\) = .
Arms arm pulls (\(n_{k}\)) empirical means \(\hat{\mu}_{k}\) upper confidence bounds \(\hat{\mu}_{k} + c \sqrt{2 \dfrac{\log t}{n_{k}}}\)
\(k = 1\)
\(k = 2\)
\(k = 3\)

📗 Answer (comma separated vector): .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .
📗 [1 points] Blank.
📗 Answer: .

# Grade


 * * * *

 * * * * *


📗 You could save the text in the above text box to a file using the button or copy and paste it into a file yourself .
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##m: 31" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.


📗 You can find videos going through the questions on Link.





Last Updated: January 20, 2025 at 3:12 AM