Paper 2, Section II, $30 K$

(a) A ball may be in one of $n$ boxes. A search of the $i^{\text {th }}$ box costs $c_{i}>0$ and finds the ball with probability $\alpha_{i}>0$ if the ball is in that box. We are given initial probabilities $\left(P_{i}^{1}, i=1,2, \ldots, n\right)$ that the ball is in the $i^{\text {th }}$ box.

Show that the policy which at time $t=1,2, \ldots$ searches the box with the maximal value of $\alpha_{i} P_{i}^{t} / c_{i}$ minimises the expected searching cost until the ball is found, where $P_{i}^{t}$ is the probability (given everything that has occurred up to time $t$ ) that the ball is in box $i$ .

(b) Next suppose that a reward $R_{i}>0$ is earned if the ball is found in the $i^{\text {th }}$ box. Suppose also that we may decide to stop at any time. Develop the dynamic programming equation for the value function starting from the probability distribution $\left(P_{i}^{t}, i=1,2, \ldots, n\right)$ .

Show that if $\sum_{i} c_{i} /\left(\alpha_{i} R_{i}\right)<1$ then it is never optimal to stop searching until the ball is found. In this case, is the policy defined in part (a) optimal?