Paper 4, Section II, J

Mathematics of Machine Learning
Part II, 2020

Suppose we have input-output pairs (x1,y1),,(xn,yn)Rp×{1,1}\left(x_{1}, y_{1}\right), \ldots,\left(x_{n}, y_{n}\right) \in \mathbb{R}^{p} \times\{-1,1\}. Consider the empirical risk minimisation problem with hypothesis class

H={xxTβ:βC}\mathcal{H}=\left\{x \mapsto x^{T} \beta: \beta \in C\right\}

where CC is a non-empty closed convex subset of Rp\mathbb{R}^{p}, and logistic loss

(h(x),y)=log2(1+eyh(x)),\ell(h(x), y)=\log _{2}\left(1+e^{-y h(x)}\right),

for hHh \in \mathcal{H} and (x,y)Rp×{1,1}(x, y) \in \mathbb{R}^{p} \times\{-1,1\}.

(i) Show that the objective function ff of the optimisation problem is convex.

(ii) Let πC(x)\pi_{C}(x) denote the projection of xx onto CC. Describe the procedure of stochastic gradient descent (SGD) for minimisation of ff above, giving explicit forms for any gradients used in the algorithm.

(iii) Suppose β^\hat{\beta} minimises f(β)f(\beta) over βC\beta \in C. Suppose maxi=1,,nxi2M\max _{i=1, \ldots, n}\left\|x_{i}\right\|_{2} \leqslant M and supβCβ2R\sup _{\beta \in C}\|\beta\|_{2} \leqslant R. Prove that the output βˉ\bar{\beta} of kk iterations of the SGD algorithm with some fixed step size η\eta (which you should specify), satisfies

Ef(βˉ)f(β^)2MRlog(2)k\mathbb{E} f(\bar{\beta})-f(\hat{\beta}) \leqslant \frac{2 M R}{\log (2) \sqrt{k}}

(iv) Now suppose that the step size at iteration ss is ηs>0\eta_{s}>0 for each s=1,,ks=1, \ldots, k. Show that, writing βs\beta_{s} for the ss th iterate of SGD, we have

Ef(β~)f(β^)A2M22A1{log(2)}2+2R2A1\mathbb{E} f(\tilde{\beta})-f(\hat{\beta}) \leqslant \frac{A_{2} M^{2}}{2 A_{1}\{\log (2)\}^{2}}+\frac{2 R^{2}}{A_{1}}

where

β~=1A1s=1kηsβs,A1=s=1kηs and A2=s=1kηs2\tilde{\beta}=\frac{1}{A_{1}} \sum_{s=1}^{k} \eta_{s} \beta_{s}, \quad A_{1}=\sum_{s=1}^{k} \eta_{s} \quad \text { and } \quad A_{2}=\sum_{s=1}^{k} \eta_{s}^{2}

[You may use standard properties of convex functions and projections onto closed convex sets without proof provided you state them.]