As a function of policy π and initial state x, let
F(π,x)=Eπ[t=0∑∞βtr(xt,ut)∣x0=x]
where β⩾1 and r(x,u)⩾0 for all x,u. Suppose that for a specific policy π, and all x,
F(π,x)=usup{r(x,u)+βE[F(π,x1)∣x0=x,u0=u]}.
Prove that F(π,x)⩾F(π′,x) for all π′ and x.
A gambler plays games in which he may bet 1 or 2 pounds, but no more than his present wealth. Suppose he has xt pounds after t games. If he bets i pounds then xt+1=xt+i, or xt+1=xt−i, with probabilities pi and 1−pi respectively. Gambling terminates at the first τ such that xτ=0 or xτ=100. His final reward is (9/8)τ/2xτ. Let π be the policy of always betting 1 pound. Given p1=1/3, show that F(π,x)∝x2x/2.
Is π optimal when p2=1/4 ?