Paper 1, Section II, J

Mathematics of Machine Learning
Part II, 2020

(a) Let Z1,,ZnZ_{1}, \ldots, Z_{n} be i.i.d. random elements taking values in a set Z\mathcal{Z} and let F\mathcal{F} be a class of functions f:ZRf: \mathcal{Z} \rightarrow \mathbb{R}. Define the Rademacher complexity Rn(F)\mathcal{R}_{n}(\mathcal{F}). Write down an inequality relating the Rademacher complexity and

E(supfF1ni=1n(f(Zi)Ef(Zi)))\mathbb{E}\left(\sup _{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n}\left(f\left(Z_{i}\right)-\mathbb{E} f\left(Z_{i}\right)\right)\right)

State the bounded differences inequality.

(b) Now given i.i.d. input-output pairs (X1,Y1),,(Xn,Yn)X×{1,1}\left(X_{1}, Y_{1}\right), \ldots,\left(X_{n}, Y_{n}\right) \in \mathcal{X} \times\{-1,1\} consider performing empirical risk minimisation with misclassification loss over the class H\mathcal{H} of classifiers h:X{1,1}h: \mathcal{X} \rightarrow\{-1,1\}. Denote by h^\hat{h} the empirical risk minimiser [which you may assume exists]. For any classifier hh, let R(h)R(h) be its misclassification risk and suppose this is minimised over H\mathcal{H} by hHh^{*} \in \mathcal{H}. Prove that with probability at least 1δ1-\delta,

R(h^)R(h)2Rn(F)+2log(2/δ)nR(\hat{h})-R\left(h^{*}\right) \leqslant 2 \mathcal{R}_{n}(\mathcal{F})+\sqrt{\frac{2 \log (2 / \delta)}{n}}

for δ(0,1]\delta \in(0,1], where F\mathcal{F} is a class of functions f:X×{1,1}{0,1}f: \mathcal{X} \times\{-1,1\} \rightarrow\{0,1\} related to H\mathcal{H} that you should specify.

(c) Let Zi=(Xi,Yi)Z_{i}=\left(X_{i}, Y_{i}\right) for i=1,,ni=1, \ldots, n. Define the empirical Rademacher complexity R^(F(Z1:n))\hat{\mathcal{R}}\left(\mathcal{F}\left(Z_{1: n}\right)\right). Show that with probability at least 1δ1-\delta,

R(h^)R(h)2R^(F(Z1:n))+22log(3/δ)nR(\hat{h})-R\left(h^{*}\right) \leqslant 2 \hat{\mathcal{R}}\left(\mathcal{F}\left(Z_{1: n}\right)\right)+2 \sqrt{\frac{2 \log (3 / \delta)}{n}}