Comparison of Optimized Backpropagation Algorithms

Only available on StudyMode
  • Topic: Artificial neural network, Machine learning, Neural network
  • Pages : 16 (2894 words )
  • Download(s) : 10
  • Published : December 29, 2012
Open Document
Text Preview
Comparison of Optimized Backpropagation Algorithms
W. Schiffmann, M. Joost, R. Werner
University of Koblenz
Institute of Physics
Rheinau 3–4
W-5400 Koblenz
e-mail: evol@infko.uni-koblenz.de
Presented at ESANN 93, Br¨ ssel
u

Abstract
Backpropagation is one of the most famous training algorithms for multilayer perceptrons. Unfortunately it can be very slow for practical applications. Over the last years many improvement strategies have been developed to speed up backpropagation. It’s very difficult to compare these different techniques, because most of them have been tested on various specific data sets. Most of the reported results are based on some kind of tiny and artificial training sets like XOR, encoder or decoder. It’s very doubtful if these results hold for more complicate practical application. In this report an overview of many different speedup techniques is given. All of them were assessed by a very hard practical classification task, which consists of a big medical data set. As you will see many of these optimized algorithms fail in learning the data set.

1

Introduction

This report is intended to summarize our experience using many different speedup techniques for the backpropagation algorithm. We have tested 16 different algorithms on a very hard classification task. Most of these algorithms are using many parameters, which have to be tuned by hand. So hundreds of the tests runs have to be performed. It’s beyond the scope of this paper to discuss every approach in detail. We rather group the different approaches in some classes of algorithms and discuss these classes. A much more detailed report will be available via ftp.

2

Thyroid-Data

In order to compare many different approaches we have used measurements of the thyroid gland [Quinlan, 1987]. Each measurement vector consists of 21 values – 15 This work is supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the project FE– generator (grant Schi 304/1–1)

binary and 6 analog. Three classes are assigned to each of the measurement vectors which correspond to the hyper-, hypo- and normal function of the thyroid gland. Since apporximately 92% of all patients have a normal function, a useful classifier must be significantly better than 92% correct classifications. The training set consists of 3772 measurement vectors and again 3428 vectors are available for testing. The training period was limited to 5000 epochs using a fixed 3 layer network architecture with 21 input- , 10 hidden- and 3 output units. The network was fully interconnected. Using a SPARC2 CPU training takes from 12 to 24 hours. The weights of the network have been randomly chosen by a normal distribution ( = 0:0; = 0:1). The bias of each unit has been computed as follows. First the average input pattern of the hole learning set has been calculated. While propagating this averaged pattern through the network the bias of each unit is tuned to half activate every hidden or output unit. By this means the gradient of the sigmoid activation function of every unit is maximized, which has some benefits on the gradient descent during the training.

3

Standard Backpropagation

Bascically, backpropagation [Rumelhart, 1986] is a gradient descent technique to minimize some error criteria E . In the batched mode variant the descent is based on the gradient rE for the total training set :

∆wij (n) = ?

@E
@wij

+

∆wij (n ? 1)

and are two non-negative constant parameters called learning rate and momentum. The momentum can speed up training in very flat regions of the error surface and suppresses weight oscillation in steep valleys or ravines. Unfortunately it is necessary propagate the hole training set through the network for calculating rE . This can slow down training for bigger training sets. For some tasks (e.g. neural controllers) no finite training set is available. Therefore often a online variant is used, which updates the connections based on the gradient...
tracking img