Each week you have an assigned weekly reading to which you will submit an email response. These responses will be used to shape the week's lecture and discussion.

Logitstics:

Send in the email body, not an attachment. Please use sparingly: you may submit solutions to exercises as a pdf if not possible to do so in an email body.
CC the course graders, Ritu (rraut2 'at' wisc 'dot' edu) and Shubham (agarwal68 'at' wisc 'dot' edu)
Submit by 4pm central time on Monday of the week due.
Send an email to jphanna@cs.wisc.edu with subject line “CS 839: Response for mm/dd” where mm/dd is the date of the Monday when the reading is due.

Possible response types:

Questions
Solutions to exercises in the book.
Critiques or suggestions for extensions.
What you want to learn more about.
Thoughts on what you find most important.

In general, there is not a specific required length or number of responses to submit. For your grade, I'm looking for evidence that the reading was completed so it will be helpful to have some mention of a variety of the read scetions throughout your responses. For example, week 1 had two assigned chapters. If your response only concerned chapter 2 then there isn't evidence that chapter 3 was read.

The following are comments / questions from students in week 1 that can serve as a useful model for crafting your responses.

In the plots on page 29 which compare the (%optimal arm) of the epsilon-greedy strategies for different epsilon, the plot clearly shows that eps=0.1 outperforms eps=0.01. However this is a function of the horizon: if we took say 10^9 steps instead of 10^4, then eps=0.1 will asymptote at playing the optimal arm with at most 0.9 probability, while eps=0.01 will eventually play the optimal arm with probability 0.99, which is obviously better, and with this many steps, the fact that eps=0.1 does better for the first 1000 would be insignificant. Thus it seems like a better strategy would be to be epsilon-greedy for a decaying epsilon. From a more theoretical perspective, any fixed epsilon-greedy strategy will incur linear regret, but maybe there is a sublinear regret bound for playing epsilon-greedy strategy with appropriate decay of epsilon?
I would love more insight about why baselines work. Is it easy to determine what the optimal baseline is, and does it have a simple interpretation?
I don't know anything about Thompson sampling; I'd be very interested in what these algorithms look like. I'm aware that for example UCB motivates algorithms for more general RL settings; is the same true of Thompson sampling?
In section 2.3, we can see how various values of epsilon impact how well an agent learns. As mentioned, the task (eg. variance of the reward for each arm) can determine how well each epsilon value does. Then, is there some special reason why the ten armed test bed used ten arms and drew the rewards from a standard gaussian? ie. Would we see much different behavior if we used a different k or drew rewards from a different distribution?
"The reward signal is your way of communicating to the agent what you want achieved, not how you want it achieved". I can see what the authors mean by this, but how would we enforce certain behaviors for very sparse rewards (eg. winning a chess match)? Wouldn't having smaller rewards for achieving subgoals (eg. taking opposing pieces) be helpful?