📗 Number of periods (H): .
📗 Number of states (|S|): .
📗 Number of actions (|A_1|, |A_2|, ...): .
📗 Range of reward: min , max
📗 Constrains: bounds, worse case, zero sum
Uniform transition
📗 Mean rewards (R):
📗 Transition probabilities (T):
📗 Initial state (Mu):
📗 Number of episodes (K):
(Uniformly distributed actions)
📗 Policy (P0):
📗 Variance of reward (Gaussian):
Coverage,
📗 Simulated data (E0, based on H, S, A, R, T, Mu, P0):