Paper 4, Section II, J

Suppose we have input-output pairs $\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right) \in \mathbb{R}^{p} \times\{-1,1\}$ . Consider the empirical risk minimisation problem with hypothesis class

\mathcal{H}=\left\{x \mapsto x^{T} \beta: \beta \in C\right\}

where $C$ is a non-empty closed convex subset of $\mathbb{R}^{p}$ , and logistic loss

\ell(h(x), y)=\log _{2}\left(1+e^{-y h(x)}\right),

for $h \in \mathcal{H}$ and $(x, y) \in \mathbb{R}^{p} \times\{-1,1\}$ .

(i) Show that the objective function $f$ of the optimisation problem is convex.

(ii) Let $\pi_{C}(x)$ denote the projection of $x$ onto $C$ . Describe the procedure of stochastic gradient descent (SGD) for minimisation of $f$ above, giving explicit forms for any gradients used in the algorithm.

(iii) Suppose $\hat{\beta}$ minimises $f(\beta)$ over $\beta \in C$ . Suppose $\max _{i=1, \ldots, n}\left\|x_{i}\right\|_{2} \leqslant M$ and $\sup _{\beta \in C}\|\beta\|_{2} \leqslant R$ . Prove that the output $\bar{\beta}$ of $k$ iterations of the SGD algorithm with some fixed step size $\eta$ (which you should specify), satisfies

\mathbb{E} f(\bar{\beta})-f(\hat{\beta}) \leqslant \frac{2 M R}{\log (2) \sqrt{k}}

(iv) Now suppose that the step size at iteration $s$ is $\eta_{s}>0$ for each $s=1, \ldots, k$ . Show that, writing $\beta_{s}$ for the $s$ th iterate of SGD, we have

\mathbb{E} f(\tilde{\beta})-f(\hat{\beta}) \leqslant \frac{A_{2} M^{2}}{2 A_{1}\{\log (2)\}^{2}}+\frac{2 R^{2}}{A_{1}}

where

\tilde{\beta}=\frac{1}{A_{1}} \sum_{s=1}^{k} \eta_{s} \beta_{s}, \quad A_{1}=\sum_{s=1}^{k} \eta_{s} \quad \text { and } \quad A_{2}=\sum_{s=1}^{k} \eta_{s}^{2}

[You may use standard properties of convex functions and projections onto closed convex sets without proof provided you state them.]