Consider a Markov decision problem with finite state space X, value function F and dynamic programming equation F=LF, where
(Lϕ)(i)=a∈{0,1}min⎩⎪⎨⎪⎧c(i,a)+βj∈X∑Pij(a)ϕ(j)⎭⎪⎬⎪⎫.
Suppose 0<β<1, and ∣c(i,a)∣⩽B for all i∈X,a∈{0,1}. Prove there exists a deterministic stationary Markov policy that is optimal, explaining what the italicised words mean.
Let Fn=LnF0, where F0=0, and Mn=maxi∈X∣F(i)−Fn(i)∣. Prove that
Mn⩽βMn−1⩽βnB/(1−β).
Deduce that the value iteration algorithm converges to an optimal policy in a finite number of iterations.