📗 Enter your ID (the wisc email ID without @wisc.edu) here: and click (or hit the "Enter" key) 1,2,3,4,5,6,7,8,9,10,11a9
📗 You can also load from your saved file and click .
📗 If the questions are not generated correctly, try refresh the page using the button at the top left corner.
📗 The official deadline is August 12, late submissions within one week will be accepted without penalty.
📗 The same ID should generate the same set of questions. Your answers are not saved when you close the browser. You could either copy and paste or load your program outputs into the text boxes for individual questions or print all your outputs to a single text file and load it using the button at the bottom of the page.
📗 Please do not refresh the page: your answers will not be saved.
📗 You should implement the algorithms using the mathematical formulas from the slides. You can use packages and libraries to preprocess and read the data and format the outputs. It is not recommended that you use machine learning packages or libraries, but you will not lose points for doing so.
📗 (Introduction) In this project, you will use either supervised learning or reinforcement learning techniques to train an autonomous vehicle to move on a continuous state space, similar to this project: Link, Link or Link. You will submit a policy network (the neural network that produces the actions based on the state of the environment) that outputs a deterministic Markov policy ("turn left", "turn right", "speed up" or "no action" given the state of the vehicle, including position, velocity and distances to walls or other vehicles observed by the sensors). The neural network should be fully connected with two hidden layers (ReLU activation, a maximum of 100 units in each layer) and input layer with \(4 + k\) units, and output layer with 4 units (softmax activation), where \(k\) is the number of sensors in front of the car, an integer between \(1\) and \(9\).
📗 (Part 1) Given a neural network with random weights, make sure you can produce the correct Markov policy to control the vehicle.
📗 (Part 2) Train your network to replicate a simple behavioral policy for a simple environment without other vehicles. The vehicle has 5 sensors: (i) if one of the two sensors on the left has the highest value, turn left; (ii) if one of the two sensors on the right has the highest value, turn right; (iii) if the sensor in the middle has the highest value, speed up. Tie break rule for the highest value: speed up > left > right.
📗 (Competition) Submit a policy network to compete in a random environment with other students. Game theoretic considerations can be made to modify your policy (through retraining with different training sets). The competition will be held twice, once around the midterm (no other players) and once around the final exam (with other players). You can optionally submit two policy networks and the maximum score from the two will be used in determining your ranking (the second vehicle can be designed to crash into other players and slow them down too).
Your submission should contain (i) your player name (not necessarily your real name), (ii) your player icon (single emoji from this list: Link), (iii) your team (a random number between 0 and 1, rounded to four decimal places), (iv) your network weights, (v) [optional] your second network weights, and have the following format in a .txt text file:
➩ Small example:
➩ Large example:
Your score will be the total distance traveled within a fixed number of frames, that is,
➩ Your score will be higher if your speed up more often.
➩ Your score will be higher if you crash fewer number of times.
Your project grade is based on your submission to this assignment (out of 5) plus your ranking within your team (out of 5):
Top 10% gets 5/5 in each team.
Next 10% gets 4/5 in each team.
Next 10% gets 3/5 in each team.
Next 10% gets 2/5 in each team.
Next 10% gets 1/5 in each team.
(The students who do not participate in the competition will be evenly split into each of the teams with scores of 0s when computing the rankings).
You can use following demo to test policies or generate training sets (each line is ["action", "x", "y", "vx", "vy", "s1", "s2", ... "sk"], where "si" is distance measured by sensor i). The dark red lines represent how the sensors compute the distances for the vehicle you control. To collect data points, enter your network, the number of data points and number of frames to skip, then check the "Collect" checkbox to start.
Neural network to control the car:
Data set: Collect (max 1000) every frames.
Number of other cars (or networks to control them, separated by "====="):
Race track:
You can also manually control one vehicle (arrow keys or "wasd" to use actions "speed up" , "turn left", "no action", "turn right"). Not pressing any of the keys will perform "no action" too. Click anywhere inside the square (or check the "Collect" box) before you start.
Number of sensors:
Data set: Collect (max 1000) every frames.
Number of other cars (or networks to control them, separated by "====="):
📗 [1 points] Enter a random set of weights (biases in the last row) of your network in the correct format (three matrices separated by -----, each matrix has rows separated by lines, columns separated by commas, the first matrix should be \(4 + k + 1\) by \(h_{1}\), second matrix should be \(h_{1} + 1\) by \(h_{2}\), and the last matrix should be \(h_{2} + 1\) by \(4\), all numbers rounded by 4 decimal places).
Hint
📗 TBA.
You can use the demo below to test if your network predictions are correct.
📗 [10 points] Use the network from the previous question on the feature matrix you provided, and output the stochastic policy (300 lines, 4 numbers in each line, [probability of 0 (turn left), probability of 1 (turn right), probability of 2 (speed up), probability of 3 (slow down)], rounded to 4 decimal places, comma separated).
📗 [5 points] Enter the feature matrix of 300 training items from a single training episode (300 lines, \(4 + k\) numbers in each line, comma separated). You can use the same one from Part 1 to get the points for this question, but you should train your policy network first, then use the demo to produce the training items instead.
📗 [10 points] Enter the actions based on behavior policy provided in the instructions (not your network) for the 300 training items from the previous question. (300 numbers in one line, 0 (turn left), 1 (turn right), 2 (speed up), 3 (no action), comma separated).
📗 [5 points] Enter the set of weights (biases in the last row) of your network after training on your training set to clone the behavior policy (three matrices separated by -----, each matrix has rows separated by lines, columns separated by commas, the first matrix should be \(4 + k + 1\) by \(h_{1}\), second matrix should be \(h_{1} + 1\) by \(h_{2}\), and the last matrix should be \(h_{2} + 1\) by \(4\), all numbers rounded by 4 decimal places).
📗 [10 points] Use the network from the previous question on the feature matrix you provided, and output the stochastic policy (300 lines, 4 numbers in each line, [probability of 0 (turn left), probability of 1 (turn right), probability of 2 (speed up), probability of 3 (no action)], rounded to 4 decimal places, comma separated).
📗 [15 points] Find the actions with the highest probability based on the stochastic policy in the previous questions. The actions should be the same as the ones from your behavior policy. This question is graded based on the consistency with your stochastic policy from the previous question and the consistency with your behavior policy.
📗 [1 points] Please confirm that you are going to submit the code on Canvas under Assignment A9, and make sure you give attribution for all blocks of code you did not write yourself (see bottom of the page for details and examples).
📗 [1 points] Please enter any comments and suggestions including possible mistakes and bugs with the questions and the auto-grading, and materials relevant to solving the question that you think are not covered well during the lectures. If you have no comments, please enter "None": do not leave it blank.
📗 Please do not modify the content in the above text field: use the "Grade" button to update.
📗 Warning: grading may take around 10 to 20 seconds. Please be patient and do not click "Grade" multiple times.
📗 You could submit multiple times (but please do not submit too often): only the latest submission will be counted.
📗 Please also save the text in the above text box to a file using the button or copy and paste it into a file yourself . You can also include the resulting file with your code on Canvas Assignment A9.
📗 You could load your answers from the text (or txt file) in the text box below using the button . The first two lines should be "##a: 9" and "##id: your id", and the format of the remaining lines should be "##1: your answer to question 1" newline "##2: your answer to question 2", etc. Please make sure that your answers are loaded correctly before submitting them.
📗 Saving and loading may take around 10 to 20 seconds. Please be patient and do not click "Load" multiple times.
📗 The sample solution in Java and Python will be posted on Piazza around the deadline. You are allowed to copy and use parts of the solution with attribution. You are allowed to use code from other people (with their permission) and from the Internet, but you must and give attribution at the beginning of the your code. You are allowed to use large language models such as GPT4 to write parts of the code for you, but you have to include the prompts you used in the code submission. For example, you can put the following comments at the beginning of your code:
% Code attribution: (TA's name)'s A9 example solution.
% Code attribution: (student name)'s A9 solution.
% Code attribution: (student name)'s answer on Piazza: (link to Piazza post).
% Code attribution: (person or account name)'s answer on Stack Overflow: (link to page).
% Code attribution: (large language model name e.g. GPT4): (include the prompts you used).
📗 You can get help on understanding the algorithm from any of the office hours; to get help with debugging, please go to the TA's office hours. For times and locations see the Home page. You are encouraged to work with other students, but if you use their code, you must give attribution at the beginning of your code.