A cricket ball manufacturing company conducts the following experiment. Every day, a bowling machine is set to one of three levels, "Medium", "Fast" or "Spin", and then bowls 100 balls towards the stumps. The number of times the ball hits the stumps and the average wind speed (in kilometres per hour) during the experiment are recorded, yielding the following data (abbreviated):
Day 12⋮5051⋮120121⋮150 Wind 108⋮127⋮35⋮6 Level Medium Medium ⋮ Medium Fast ⋮ Fast Spin ⋮ Spin Stumps 2637⋮3231⋮2835⋮31
Write down a reasonable model for Y1,…,Y150, where Yi is the number of times the ball hits the stumps on the ith day. Explain briefly why we might want to include interactions between the variables. Write R code to fit your model.
The company's statistician fitted her own generalized linear model using R, and obtained the following summary (abbreviated):
>summary(ball) Coefficients: (Intercept) Wind LevelFast LevelSpin Wind: LevelFast Wind: LevelSpin Estimate −0.372580.09055−0.100050.298810.03666−0.07697 Std. Error 0.053880.015950.080440.082680.023640.02845 z value −6.9165.676−1.2443.6141.551−2.705Pr(>∣z∣)4.66e−121.38e−080.2135700.0003010.1209330.006825∗∗∗∗∗∗∗∗∗∗∗
Why are LevelMedium and Wind: LevelMedium not listed?
Suppose that, on another day, the bowling machine is set to "Spin", and the wind speed is 5 kilometres per hour. What linear function of the parameters should the statistician use in constructing a predictor of the number of times the ball hits the stumps that day?
Based on the above output, how might you improve the model? How could you fit your new model in R ?