Paper 1, Section II, J

Principles of Statistics
Part II, 2019

In a regression problem, for a given XRn×pX \in \mathbb{R}^{n \times p} fixed, we observe YRnY \in \mathbb{R}^{n} such that

Y=Xθ0+εY=X \theta_{0}+\varepsilon

for an unknown θ0Rp\theta_{0} \in \mathbb{R}^{p} and ε\varepsilon random such that εN(0,σ2In)\varepsilon \sim \mathcal{N}\left(0, \sigma^{2} I_{n}\right) for some known σ2>0\sigma^{2}>0.

(a) When pnp \leqslant n and XX has rank pp, compute the maximum likelihood estimator θ^MLE\hat{\theta}_{M L E} for θ0\theta_{0}. When p>np>n, what issue is there with the likelihood maximisation approach and how many maximisers of the likelihood are there (if any)?

(b) For any λ>0\lambda>0 fixed, we consider θ^λ\hat{\theta}_{\lambda} minimising

YXθ22+λθ22\|Y-X \theta\|_{2}^{2}+\lambda\|\theta\|_{2}^{2}

over Rp\mathbb{R}^{p}. Derive an expression for θ^λ\hat{\theta}_{\lambda} and show it is well defined, i.e., there is a unique minimiser for every X,YX, Y and λ\lambda.

Assume pnp \leqslant n and that XX has rank pp. Let Σ=XX\Sigma=X^{\top} X and note that Σ=VΛV\Sigma=V \Lambda V^{\top} for some orthogonal matrix VV and some diagonal matrix Λ\Lambda whose diagonal entries satisfy Λ1,1Λ2,2Λp,p\Lambda_{1,1} \geqslant \Lambda_{2,2} \geqslant \ldots \geqslant \Lambda_{p, p}. Assume that the columns of XX have mean zero.

(c) Denote the columns of U=XVU=X V by u1,,upu_{1}, \ldots, u_{p}. Show that they are sample principal components, i.e., that their pairwise sample correlations are zero and that they have sample variances n1Λ1,1,,n1Λp,pn^{-1} \Lambda_{1,1}, \ldots, n^{-1} \Lambda_{p, p}, respectively. [Hint: the sample covariance between uiu_{i} and uju_{j} is n1uiujn^{-1} u_{i}^{\top} u_{j}.]

(d) Show that

Y^MLE=Xθ^MLE=UΛ1UY.\hat{Y}_{M L E}=X \hat{\theta}_{M L E}=U \Lambda^{-1} U^{\top} Y .

Conclude that prediction Y^MLE\hat{Y}_{M L E} is the closest point to YY within the subspace spanned by the normalised sample principal components of part (c).

(e) Show that

Y^λ=Xθ^λ=U(Λ+λIp)1UY\hat{Y}_{\lambda}=X \hat{\theta}_{\lambda}=U\left(\Lambda+\lambda I_{p}\right)^{-1} U^{\top} Y

Assume Λ1,1,Λ2,2,,Λq,q>>λ>>Λq+1,q+1,,Λp,p\Lambda_{1,1}, \Lambda_{2,2}, \ldots, \Lambda_{q, q}>>\lambda>>\Lambda_{q+1, q+1}, \ldots, \Lambda_{p, p} for some 1q<p1 \leqslant q<p. Conclude that prediction Y^λ\hat{Y}_{\lambda} is approximately the closest point to YY within the subspace spanned by the qq normalised sample principal components of part (c) with the greatest variance.