Learning Rate

Topics: Machine learning, Loss function, Learning curve Pages: 37 (7891 words) Published: August 21, 2013
No More Pesky Learning Rates

Tom Schaul Sixin Zhang Yann LeCun Courant Institute of Mathematical Sciences New York University 715 Broadway, New York, NY 10003, USA

schaul@cims.nyu.edu zsx@cims.nyu.edu yann@cims.nyu.edu

arXiv:1206.1106v2 [stat.ML] 18 Feb 2013

Abstract
The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.

learning rates for different parameters), so as to minimize some estimate of the expectation of the loss at any one time. Starting from an idealized scenario where every sample’s contribution to the loss is quadratic and separable, we derive a formula for the optimal learning rates for SGD, based on estimates of the variance of the gradient. The formula has two components: one that captures variability across samples, and one that captures the local curvature, both of which can be estimated in practice. The method can be used to derive a single common learning rate, or local learning rates for each parameter, or each block of parameters, leading to five variations of the basic algorithm, none of which need any parameter tuning. The performance of the methods obtained without any manual tuning are reported on a variety of convex and non-convex learning models and tasks. They compare favorably with an “ideal SGD”, where the best possible learning rate was obtained through systematic search, as well as previous adaptive schemes.

1. Introduction
Large-scale learning problems require algorithms that scale benignly (e.g. sub-linearly) with the size of the dataset and the number of trainable parameters. This has lead to a recent resurgence of interest in stochastic gradient descent methods (SGD). Besides fast convergence, SGD has sometimes been observed to yield significantly better generalization errors than batch methods (Bottou & Bousquet, 2011). In practice, getting good performance with SGD requires some manual adjustment of the initial value of the learning rate (or step size) for each model and each problem, as well as the design of an annealing schedule for stationary data. The problem is particularly acute for non-stationary data. The contribution of this paper is a novel method to automatically adjust learning rates (possibly different

2. Background
SGD methods have a long history in adaptive signal processing, neural networks, and machine learning, with an extensive literature (see (Bottou, 1998; Bottou & Bousquet, 2011) for recent reviews). While the practical advantages of SGD for machine learning applications have been known for a long time (LeCun et al., 1998), interest in SGD has increased in recent years due to the ever-increasing amounts of streaming data, to theoretical optimality results for generalization error (Bottou & LeCun, 2004), and to competitions being won by SGD methods, such as the PASCAL Large Scale Learning Challenge (Bordes et al., 2009), where Quasi-Newton approximation of the Hessian was used within SGD. Still, practitioners need to deal with a sensitive hyper-parameter tuning phase to get top performance: each of the PASCAL tasks used

No More Pesky Learning Rates

very different parameter settings. This tuning is very costly, as every parameter setting is typically tested over multiple epochs. Learning rates in SGD are generally decreased according a schedule of the form η(t) = η0 (1 + γt)−1 . Originally proposed as η(t)...
Continue Reading

Please join StudyMode to read the full document

You May Also Find These Documents Helpful

  • Brain-Based Learning and Play Essay
  • Growth Rates And How To Calculate Them Essay
  • Learning Essay
  • Learning Essay
  • Essay on Literature in English Learning and Teac
  • Learning and Development Assignment Sheet Essay
  • Philosophy of Teaching and Learning Essay
  • Reflection on Learning with Learning Contract Essay

Become a StudyMode Member

Sign Up - It's Free