📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key) 1,2,3,4,5,6,7,8,9,10,11a8
📗 You can also load from your saved file and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is August 26, late submissions within one week will be accepted without penalty.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 (Introduction) In this programming homework, you will use reinforcement learning planning algorithms to find the optimal path of an autonomous vehicle moving on a grid. You will compare the optimal policy found using value iteration and the myopic policy where the vehicle use the action that maximizes the immediate reward. The state space is \(n\) by \(n\) grid cells where \(n\) = ?, with indices shown in the following diagram. The action set is \(\left\{U, D, L, R\right\}\) (up, down, left, right, in this order, action \(0\) is \(U\)) and the grid wraps around (moving left from cells in column \(0\) will lead to cells in column \(n - 1\); moving up from cells in row \(0\) will lead to cells in row \(n - 1\)).
The reward matrix is drawn in the diagram above (red is negative, blue is positive) and the numerical values are: . Use the discount factor \(\beta\) = ?.
📗 (Part 1) Compute the value function based on a given randomized policy in which the vehicle moves in all four directions with equal probabilities.
📗 (Part 1) Compute the myopic (deterministic) policy in which the vehicle moves to the neighboring cell with the highest reward. Break ties in the order: up, down, left, right (that is, if up and down leads to the same reward, move up).
📗 (Part 2) Compute the optimal policy using value iteration (or Q learning). Report the Q function, the value function, and the policy. Break ties in the order: up, down, left, right.
📗 [20 points] Enter the value function based on the randomized policy in which the vehicle moves in all four directions with equal probabilities, one number for each state, \(n^{2}\) numbers in one line, rounded to four decimal places.
Hint
📗 Use the value iteration formula: \(V^{\pi}\left(s\right) = \mathbb{E}_{\pi}\left[R\left(s, a\right) + \beta V^{\pi}\left(s'\right)\right]\) where \(s'\) is the state reached by using action \(a\) in state \(s\).
📗 For the particular policy for this question, \(V^{\pi}\left(s\right) = \dfrac{1}{4} \left(R\left(s, U\right) + \beta V^{\pi}\left(T\left(s, U\right)\right) + R\left(s, D\right) + \beta V^{\pi}\left(T\left(s, D\right)\right)\right) + R\left(s, L\right) + \beta V^{\pi}\left(T\left(s, L\right)\right) + R\left(s, D\right) + \beta V^{\pi}\left(T\left(s, D\right)\right)\).
📗 Note: implementing a transition function \(T\left(s, a\right) \to s'\) will be helpful, in particular, the state \(s\) should be converted into \(\left(x, y\right)\) coordinate form, then apply action \(a\) (one of U, D, L, R) to \(\left(x, y\right)\) to get \(\left(x, y - 1\right), \left(x, y + 1\right), \left(x - 1, y\right), \left(x + 1, y\right)\), and convert it back to \(s'\) at the end.
📗 Value iteration can be repeatedly applied to all states until convergence (when all \(V\left(s\right)\) values do not change.
📗 [5 points] Enter the myopic policy in which the vehicle moves to the neighboring cell with the highest reward, one number for each state, \(n^{2}\) integers (\(0, 1, 2, 3\)) in one line.
Hint
📗 Find the action that leads to the largest reward, \(\pi\left(s\right) = \mathop{\mathrm{argmax}}_{a} R\left(s, a\right)\) or \(\pi\left(s\right) = \mathop{\mathrm{argmax}} \left\{R\left(s, U\right), R\left(s, D\right), R\left(s, L\right), R\left(s, R\right)\right\}\).
📗 [10 points] Enter the value function based on the myopic policy, one number for each state, \(n^{2}\) numbers in one line, rounded to four decimal places.
Hint
📗 Use the value iteration from Question 1 on the policy from Question 2.
📗 There is no randomness, so the formula can be simplified as \(V^{\pi}\left(s\right) = R\left(s, \pi\left(s\right)\right) + \beta V^{\pi}\left(s'\right)\) where \(s' = T\left(s, \pi\left(s\right)\right)\).
You can click to plot the value function and the policy.
📗 [2 points] Enter the initial Q function, four numbers for each state (in the order up, down, left, right), \(n^{2}\) lines with \(4\) numbers in one line, rounded to four decimal places. It can be all zeros or all random.
📗 [2 points] Enter the Q function after one iteration (each entry is updated once), four numbers for each state (in the order up, down, left, right), \(n^{2}\) lines with \(4\) numbers in one line, rounded to four decimal places. Do not use a learning rate (i.e. use \(\alpha = 1\)).
Hint
📗 Use the Q iteration formula (similar to value iteration): \(Q\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a} Q\left(s', a\right)\) where \(s' = T\left(s, a\right)\).
📗 Apply the same formula on each \(s, a\) pair, but keep two copies of the Q function instead of replace the values of a single Q matrix. This means \(Q^{\left(t\right)}\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a} Q^{\left(t-1\right)}\left(s', a\right)\), not \(Q^{\left(t\right)}\left(s, a\right) = R\left(s, a\right) + \beta \displaystyle\max_{a} Q^{\left(t\right)}\left(s', a\right)\), where \(t\) is the iteration index.
📗 [10 points] Enter the Q function after value iteration converges, four numbers for each state (in the order up, down, left, right), \(n^{2}\) lines with \(4\) numbers in one line, rounded to four decimal places.
Hint
📗 Repeat Question 5 until the Q matrix converges (does not change after updates).
📗 [10 points] Enter the optimal policy consistent with the Q function from the previous question, one number for each state, \(n^{2}\) integers (\(0, 1, 2, 3\)) in one line.
Hint
📗 Find the index of the maximum Q value from each row in the Q matrix from Question 6.
📗 [10 points] Enter the value function based on the optimal policy, one number for each state, \(n^{2}\) numbers in one line, rounded to four decimal places.
Hint
📗 Compute the maximum Q value from each row in the Q matrix from Question 6.
You can click to plot the value function and the policy.
📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment A8, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.
📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment A8.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##a: 8" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.
📗 The sample solution in Java and Python will be posted on Piazza around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. You are allowed to use large language models such as GPT4 to write parts of the code for you, but you have to include the prompts you used in the code submission. For example, you can put the following comments at the beginning of your code:
% Code attribution: (TA's name)'s A8 example solution.
% Code attribution: (student name)'s A8 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).
% Code attribution: (large language model name e.g. GPT4): (include the prompts you used).
📗 You can get help on understanding the algorithm from any of the office hours; to get help with debugging, please go to the TA's office hours. For times and locations see the Home page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.