Paper 2, Section II, J

Describe the elements of a discrete-time stochastic dynamic programming equation for the problem of maximizing the expected sum of non-negative rewards over an infinite horizon. Give an example to show that there may not exist an optimal policy. Prove that if a policy has a value function that satisfies the dynamic programming equation then the policy is optimal.

A squirrel collects nuts for the coming winter. There are plenty of nuts lying around, but each time the squirrel leaves its lair it risks being caught by a predator. Assume that the outcomes of the squirrel's journeys are independent, that it is caught with probability $p$ , and that it returns safely with a random weight of nuts, exponentially distributed with parameter $\lambda$ . By solving the dynamic programming equation for the value function $F(x)$ , find a policy maximizing the expected weight of nuts collected for the winter. Here the state variable $x$ takes values in $\mathbb{R}_{+}$ (the weight of nuts so far collected) or $-1$ (a no-return state when the squirrel is caught).