In a regression problem, for a given X∈Rn×p fixed, we observe Y∈Rn such that
Y=Xθ0+ε
for an unknown θ0∈Rp and ε random such that ε∼N(0,σ2In) for some known σ2>0.
(a) When p⩽n and X has rank p, compute the maximum likelihood estimator θ^MLE for θ0. When p>n, what issue is there with the likelihood maximisation approach and how many maximisers of the likelihood are there (if any)?
(b) For any λ>0 fixed, we consider θ^λ minimising
∥Y−Xθ∥22+λ∥θ∥22
over Rp. Derive an expression for θ^λ and show it is well defined, i.e., there is a unique minimiser for every X,Y and λ.
Assume p⩽n and that X has rank p. Let Σ=X⊤X and note that Σ=VΛV⊤ for some orthogonal matrix V and some diagonal matrix Λ whose diagonal entries satisfy Λ1,1⩾Λ2,2⩾…⩾Λp,p. Assume that the columns of X have mean zero.
(c) Denote the columns of U=XV by u1,…,up. Show that they are sample principal components, i.e., that their pairwise sample correlations are zero and that they have sample variances n−1Λ1,1,…,n−1Λp,p, respectively. [Hint: the sample covariance between ui and uj is n−1ui⊤uj.]
(d) Show that
Y^MLE=Xθ^MLE=UΛ−1U⊤Y.
Conclude that prediction Y^MLE is the closest point to Y within the subspace spanned by the normalised sample principal components of part (c).
(e) Show that
Y^λ=Xθ^λ=U(Λ+λIp)−1U⊤Y
Assume Λ1,1,Λ2,2,…,Λq,q>>λ>>Λq+1,q+1,…,Λp,p for some 1⩽q<p. Conclude that prediction Y^λ is approximately the closest point to Y within the subspace spanned by the q normalised sample principal components of part (c) with the greatest variance.