Paper 2, Section II, I

Optimization and Control
Part II, 2009

In the context of stochastic dynamic programming, explain what is meant by an average-reward optimal policy.

A player has a fair coin and a six-sided die. At each epoch he may choose either to toss the coin or to roll the die. If he tosses the coin and it shows heads then he adds 1 to his total score, and if it shows tails then he adds 0 . If he rolls the die then he adds the number showing. He wins a reward of £1£ 1 whenever his total score is divisible by 3 .

Suppose the player always tosses the coin. Find his average reward per toss.

Still using the above policy, and given that he starts with a total score of xx, let Fs(x)F_{s}(x) be the expected total reward over the next ss epochs. Find the value of

lims[Fs(x)Fs(0)].\lim _{s \rightarrow \infty}\left[F_{s}(x)-F_{s}(0)\right] .

Use the policy improvement algorithm to find a policy that produces a greater average reward than the policy of only tossing the coin.

Find the average-reward optimal policy.