Consider an infinite-horizon controlled Markov process having per-period costs c(x,u)⩾0, where x∈X is the state of the system, and u∈U is the control. Costs are discounted at rate β∈(0,1], so that the objective to be minimized is
E[t⩾0∑βtc(Xt,ut)∣X0=x]
What is meant by a policy π for this problem?
Let L denote the dynamic programming operator
Lf(x)≡u∈Uinf{c(x,u)+βE[f(X1)∣X0=x,u0=u]}
Further, let F denote the value of the optimal control problem:
F(x)=πinfEπ[t⩾0∑βtc(Xt,ut)∣X0=x]
where the infimum is taken over all policies π, and Eπ denotes expectation under policy π. Show that the functions Ft defined by
Ft+1=LFt(t⩾0),F0≡0
increase to a limit F∞∈[0,∞]. Prove that F∞⩽F. Prove that F=LF
Suppose that Φ=LΦ⩾0. Prove that Φ⩾F.
[You may assume that there is a function u∗:X→U such that
LΦ(x)=c(x,u∗(x))+βE[Φ(X1)∣X0=x,u0=u∗(x)]
though the result remains true without this simplifying assumption.]