Paper 2, Section II, J
Describe the elements of a discrete-time stochastic dynamic programming equation for the problem of maximizing the expected sum of non-negative rewards over an infinite horizon. Give an example to show that there may not exist an optimal policy. Prove that if a policy has a value function that satisfies the dynamic programming equation then the policy is optimal.
A squirrel collects nuts for the coming winter. There are plenty of nuts lying around, but each time the squirrel leaves its lair it risks being caught by a predator. Assume that the outcomes of the squirrel's journeys are independent, that it is caught with probability , and that it returns safely with a random weight of nuts, exponentially distributed with parameter . By solving the dynamic programming equation for the value function , find a policy maximizing the expected weight of nuts collected for the winter. Here the state variable takes values in (the weight of nuts so far collected) or (a no-return state when the squirrel is caught).