Paper 1, Section II, J

Statistical Modelling
Part II, 2013

A cricket ball manufacturing company conducts the following experiment. Every day, a bowling machine is set to one of three levels, "Medium", "Fast" or "Spin", and then bowls 100 balls towards the stumps. The number of times the ball hits the stumps and the average wind speed (in kilometres per hour) during the experiment are recorded, yielding the following data (abbreviated):

 Day  Wind  Level  Stumps 110 Medium 2628 Medium 375012 Medium 32517 Fast 311203 Fast 281215 Spin 351506 Spin 31\begin{array}{llll}\text { Day } & \text { Wind } & \text { Level } & \text { Stumps } \\ 1 & 10 & \text { Medium } & 26 \\ 2 & 8 & \text { Medium } & 37 \\ \vdots & \vdots & \vdots & \vdots \\ 50 & 12 & \text { Medium } & 32 \\ 51 & 7 & \text { Fast } & 31 \\ \vdots & \vdots & \vdots & \vdots \\ 120 & 3 & \text { Fast } & 28 \\ 121 & 5 & \text { Spin } & 35 \\ \vdots & \vdots & \vdots & \vdots \\ 150 & 6 & \text { Spin } & 31\end{array}

Write down a reasonable model for Y1,,Y150Y_{1}, \ldots, Y_{150}, where YiY_{i} is the number of times the ball hits the stumps on the ithi^{t h} day. Explain briefly why we might want to include interactions between the variables. Write RR code to fit your model.

The company's statistician fitted her own generalized linear model using R\mathrm{R}, and obtained the following summary (abbreviated):

 >summary(ball)  Coefficients:  Estimate  Std. Error  z value Pr(>z) (Intercept) 0.372580.053886.9164.66e12 Wind 0.090550.015955.6761.38e08 LevelFast 0.100050.080441.2440.213570 LevelSpin 0.298810.082683.6140.000301 Wind: LevelFast 0.036660.023641.5510.120933 Wind: LevelSpin 0.076970.028452.7050.006825\begin{array}{lrrrrr}\text { >summary(ball) } & & & & \\ \text { Coefficients: } & & & & & \\ & \text { Estimate } & \text { Std. Error } & \text { z value } & \operatorname{Pr}(>|z|) & \\ \text { (Intercept) } & -0.37258 & 0.05388 & -6.916 & 4.66 \mathrm{e}-12 & * * * \\ \text { Wind } & 0.09055 & 0.01595 & 5.676 & 1.38 \mathrm{e}-08 & * * * \\ \text { LevelFast } & -0.10005 & 0.08044 & -1.244 & 0.213570 & \\ \text { LevelSpin } & 0.29881 & 0.08268 & 3.614 & 0.000301 & * * * \\ \text { Wind: LevelFast } & 0.03666 & 0.02364 & 1.551 & 0.120933 & \\ \text { Wind: LevelSpin } & -0.07697 & 0.02845 & -2.705 & 0.006825 & * *\end{array}

Why are LevelMedium and Wind: LevelMedium not listed?

Suppose that, on another day, the bowling machine is set to "Spin", and the wind speed is 5 kilometres per hour. What linear function of the parameters should the statistician use in constructing a predictor of the number of times the ball hits the stumps that day?

Based on the above output, how might you improve the model? How could you fit your new model in RR ?