Paper 4, Section I, J

Statistical Modelling
Part II, 2011

The numbers of ear infections observed among beach and non-beach (mostly pool) swimmers were recorded, along with explanatory variables: frequency, location, age, and sex. The data are aggregated by group, with a total of 24 groups defined by the explanatory variables.

 freq F= frequent, NF= infrequent  loc NB= non-beach, B= beach  age 1519,2024,2429 sex F= female, M= male  count  the number of infections reported over a fixed time period n the total number of swimmers \begin{array}{ll} \text { freq } & \mathrm{F}=\text { frequent, } \mathrm{NF}=\text { infrequent } \\ \text { loc } & \mathrm{NB}=\text { non-beach, } \mathrm{B}=\text { beach } \\ \text { age } & 15-19,20-24,24-29 \\ \text { sex } & \mathrm{F}=\text { female, } \mathrm{M}=\text { male } \\ \text { count } & \text { the number of infections reported over a fixed time period } \\ \mathrm{n} & \text { the total number of swimmers } \end{array}

The data look like this:

 count  n  freq  loc  sex  age 16831F NB  M 15192144F NB  F 151933512F NB  M 202441611F NB  F 2024[]23515 NF  B  M 25292466 NF  B  F 2529\begin{array}{lrrrrrr} & \text { count } & \text { n } & \text { freq } & \text { loc } & \text { sex } & \text { age } \\ 1 & 68 & 31 & F & \text { NB } & \text { M } & 15-19 \\ 2 & 14 & 4 & F & \text { NB } & \text { F } & 15-19 \\ 3 & 35 & 12 & F & \text { NB } & \text { M } & 20-24 \\ 4 & 16 & 11 & F & \text { NB } & \text { F } & 20-24 \\ {[\ldots]} & & & & & & \\ 23 & 5 & 15 & \text { NF } & \text { B } & \text { M } & 25-29 \\ 24 & 6 & 6 & \text { NF } & \text { B } & \text { F } & 25-29 \end{array}

Let μj\mu_{j} denote the expected number of ear infections of a person in group jj. Explain why it is reasonable to model count j{ }_{j} as Poisson with mean njμjn_{j} \mu_{j}.

We fit the following Poisson model:

log(E(countj))=log(njμj)=log(nj)+xjβ\log \left(\mathbb{E}\left(\operatorname{count}_{j}\right)\right)=\log \left(n_{j} \mu_{j}\right)=\log \left(n_{j}\right)+\mathbf{x}_{j} \beta

where log(nj)\log \left(n_{j}\right) is an offset, i.e. an explanatory variable with known coefficient 1.1 . R\mathrm{R} produces the following (abbreviated) summary for the main effects model:

Why are expressions freq F\mathrm{F}, locB, age 151915-19, and sexF not listed?

Suppose that we plan to observe a group of 20 female, non-frequent, beach swimmers, aged 20-24. Give an expression (using the coefficient estimates from the model fitted above) for the expected number of ear infections in this group.

Now, suppose that we allow for interaction between variables age and sex. Give the R\mathrm{R} command for fitting this model. We test for the effect of this interaction by producing the following (abbreviated) ANOVA table:

Briefly explain what test is performed, and what you would conclude from it. Does either of these models fit the data well?