Paper 2, Section II,
(a) A ball may be in one of boxes. A search of the box costs and finds the ball with probability if the ball is in that box. We are given initial probabilities that the ball is in the box.
Show that the policy which at time searches the box with the maximal value of minimises the expected searching cost until the ball is found, where is the probability (given everything that has occurred up to time ) that the ball is in box .
(b) Next suppose that a reward is earned if the ball is found in the box. Suppose also that we may decide to stop at any time. Develop the dynamic programming equation for the value function starting from the probability distribution .
Show that if then it is never optimal to stop searching until the ball is found. In this case, is the policy defined in part (a) optimal?