We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a. Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a. Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets. kernel least-squares regression and logistic regression (see Section 2), with strong convexity assumptions (Section 3) and without (Section 4). − We provide a non-asymptotic analysis of Polyak-Ruppert averaging [4,5], with and without strong convexity (Sections 3.3 and 4.2). In particular, we show that slower decays of the learning rate, together with averaging, are crucial to robustly obtain fast convergence rates. − We illustrate our theoretical results through experiments on synthetic and non-synthetic examples in Section 5.Notation. We consider a Hilbert space H with a scalar product •, • . We denote by • the associated norm and use the same notation for the operator norm on bounded linear operators from H to H, defined as A = sup x 1 Ax (if H is a Euclidean space, then A is the largest singular value of A). We also use the notation "w.p.1" to mean "with probability one". We denote by E the expectation or conditional expectation with respect to the underlying probability space.
translated by 谷歌翻译