Paper 2, Section II, J

Optimization and Control
Part II, 2014

Describe the elements of a discrete-time stochastic dynamic programming equation for the problem of maximizing the expected sum of non-negative rewards over an infinite horizon. Give an example to show that there may not exist an optimal policy. Prove that if a policy has a value function that satisfies the dynamic programming equation then the policy is optimal.

A squirrel collects nuts for the coming winter. There are plenty of nuts lying around, but each time the squirrel leaves its lair it risks being caught by a predator. Assume that the outcomes of the squirrel's journeys are independent, that it is caught with probability pp, and that it returns safely with a random weight of nuts, exponentially distributed with parameter λ\lambda. By solving the dynamic programming equation for the value function F(x)F(x), find a policy maximizing the expected weight of nuts collected for the winter. Here the state variable xx takes values in R+\mathbb{R}_{+}(the weight of nuts so far collected) or 1-1 (a no-return state when the squirrel is caught).