No More Pesky Learning Rates
Tom Schaul Sixin Zhang Yann LeCun Courant Institute of Mathematical Sciences New York University 715 Broadway, New York, NY 10003, USA
email@example.com firstname.lastname@example.org email@example.com
arXiv:1206.1106v2 [stat.ML] 18 Feb 2013
The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and eﬀectively removes the need for learning rate tuning.
learning rates for diﬀerent parameters), so as to minimize some estimate of the expectation of the loss at any one time. Starting from an idealized scenario where every sample’s contribution to the loss is quadratic and separable, we derive a formula for the optimal learning rates for SGD, based on estimates of the variance of the gradient. The formula has two components: one that captures variability across samples, and one that captures the local curvature, both of which can be estimated in practice. The method can be used to derive a single common learning rate, or local learning rates for each parameter, or each block of parameters, leading to ﬁve variations of the basic algorithm, none of which need any parameter tuning. The performance of the methods obtained without any manual tuning are reported on a variety of convex and non-convex learning models and tasks. They compare favorably with an “ideal SGD”, where the best possible learning rate was obtained through systematic search, as well as previous adaptive schemes.
Large-scale learning problems require algorithms that scale benignly (e.g. sub-linearly) with the size of the dataset and the number of trainable parameters. This has lead to a recent resurgence of interest in stochastic gradient descent methods (SGD). Besides fast convergence, SGD has sometimes been observed to yield signiﬁcantly better generalization errors than batch methods (Bottou & Bousquet, 2011). In practice, getting good performance with SGD requires some manual adjustment of the initial value of the learning rate (or step size) for each model and each problem, as well as the design of an annealing schedule for stationary data. The problem is particularly acute for non-stationary data. The contribution of this paper is a novel method to automatically adjust learning rates (possibly diﬀerent
SGD methods have a long history in adaptive signal processing, neural networks, and machine learning, with an extensive literature (see (Bottou, 1998; Bottou & Bousquet, 2011) for recent reviews). While the practical advantages of SGD for machine learning applications have been known for a long time (LeCun et al., 1998), interest in SGD has increased in recent years due to the ever-increasing amounts of streaming data, to theoretical optimality results for generalization error (Bottou & LeCun, 2004), and to competitions being won by SGD methods, such as the PASCAL Large Scale Learning Challenge (Bordes et al., 2009), where Quasi-Newton approximation of the Hessian was used within SGD. Still, practitioners need to deal with a sensitive hyper-parameter tuning phase to get top performance: each of the PASCAL tasks used
No More Pesky Learning Rates
very diﬀerent parameter settings. This tuning is very costly, as every parameter setting is typically tested over multiple epochs. Learning rates in SGD are generally decreased according a schedule of the form η(t) = η0 (1 + γt)−1 . Originally proposed as η(t)...
Please join StudyMode to read the full document