Paper 1, Section I, K

Statistical Modelling
Part II, 2016

The body mass index (BMI) of your closest friend is a good predictor of your own BMI. A scientist applies polynomial regression to understand the relationship between these two variables among 200 students in a sixth form college. The RR commands

>> fit. 1<lm(BMI1<-\operatorname{lm}(B M I \sim poly (( friendBMI , 2, raw=T ))))

>> fit. 2<lm(BMI2<-\operatorname{lm}(B M I \sim poly (( friendBMI, 3, raw =T))=\mathrm{T}))

fit the models Y=β0+β1X+β2X2+εY=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\varepsilon and Y=β0+β1X+β2X2+β3X3+εY=\beta_{0}+\beta_{1} X+\beta_{2} X^{2}+\beta_{3} X^{3}+\varepsilon, respectively, with εN(0,σ2)\varepsilon \sim N\left(0, \sigma^{2}\right) in each case.

Setting the parameters raw to FALSE:

>> fit. 3<lm(BMI3<-\operatorname{lm}(B M I \sim poly (( friendBMI , 2, raw=F )) )

>> fit. 4<lm(BMI4<-\operatorname{lm}(\mathrm{BMI} \sim poly (( friendBMI, 3, raw =F))=\mathrm{F}))

fits the models Y=β0+β1P1(X)+β2P2(X)+εY=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+\varepsilon and Y=β0+β1P1(X)+β2P2(X)+Y=\beta_{0}+\beta_{1} P_{1}(X)+\beta_{2} P_{2}(X)+ β3P3(X)+ε\beta_{3} P_{3}(X)+\varepsilon, with εN(0,σ2)\varepsilon \sim N\left(0, \sigma^{2}\right). The function PiP_{i} is a polynomial of degree ii. Furthermore, the design matrix output by the function poly with raw=F satisfies:

>t(>t( poly (( friendBMI, 3, raw =F))%%=F)) \% * \% poly (a,3(a, 3, raw =F)=F)

12311.000000e+001.288032e163.187554e1721.288032e161.000000e+006.201636e1733.187554e176.201636e171.000000e+00\begin{array}{rrrr}1 & 2 & 3 \\ 1 & 1.000000 e+00 & 1.288032 \mathrm{e}-16 & 3.187554 \mathrm{e}-17 \\ 2 & 1.288032 \mathrm{e}-16 & 1.000000 \mathrm{e}+00 & -6.201636 \mathrm{e}-17 \\ 3 & 3.187554 \mathrm{e}-17 & -6.201636 \mathrm{e}-17 & 1.000000 \mathrm{e}+00\end{array}

How does the variance of β^\hat{\beta} differ in the models fit.2f i t .2 and fit.4f i t .4 ? What about the variance of the fitted values Y^=Xβ^\hat{Y}=X \hat{\beta} ? Finally, consider the output of the commands

>anova>\operatorname{anova} (fit.1,fit.2)

anova(fit.3,fit.4)

Define the test statistic computed by this function and specify its distribution. Which command yields a higher statistic?