Paper 2, Section II, I
In the context of stochastic dynamic programming, explain what is meant by an average-reward optimal policy.
A player has a fair coin and a six-sided die. At each epoch he may choose either to toss the coin or to roll the die. If he tosses the coin and it shows heads then he adds 1 to his total score, and if it shows tails then he adds 0 . If he rolls the die then he adds the number showing. He wins a reward of whenever his total score is divisible by 3 .
Suppose the player always tosses the coin. Find his average reward per toss.
Still using the above policy, and given that he starts with a total score of , let be the expected total reward over the next epochs. Find the value of
Use the policy improvement algorithm to find a policy that produces a greater average reward than the policy of only tossing the coin.
Find the average-reward optimal policy.