W. Schiffmann, M. Joost, R. Werner
University of Koblenz
Institute of Physics
Presented at ESANN 93, Br¨ ssel
Backpropagation is one of the most famous training algorithms for multilayer perceptrons. Unfortunately it can be very slow for practical applications. Over the last years many improvement strategies have been developed to speed up backpropagation. It’s very difﬁcult to compare these different techniques, because most of them have been tested on various speciﬁc data sets. Most of the reported results are based on some kind of tiny and artiﬁcial training sets like XOR, encoder or decoder. It’s very doubtful if these results hold for more complicate practical application. In this report an overview of many different speedup techniques is given. All of them were assessed by a very hard practical classiﬁcation task, which consists of a big medical data set. As you will see many of these optimized algorithms fail in learning the data set.
This report is intended to summarize our experience using many different speedup techniques for the backpropagation algorithm. We have tested 16 different algorithms on a very hard classiﬁcation task. Most of these algorithms are using many parameters, which have to be tuned by hand. So hundreds of the tests runs have to be performed. It’s beyond the scope of this paper to discuss every approach in detail. We rather group the different approaches in some classes of algorithms and discuss these classes. A much more detailed report will be available via ftp.
In order to compare many different approaches we have used measurements of the thyroid gland [Quinlan, 1987]. Each measurement vector consists of 21 values – 15 This work is supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the project FE– generator (grant Schi 304/1–1)
binary and 6 analog. Three classes are assigned to each of the measurement vectors which correspond to the hyper-, hypo- and normal function of the thyroid gland. Since apporximately 92% of all patients have a normal function, a useful classiﬁer must be signiﬁcantly better than 92% correct classiﬁcations. The training set consists of 3772 measurement vectors and again 3428 vectors are available for testing. The training period was limited to 5000 epochs using a ﬁxed 3 layer network architecture with 21 input- , 10 hidden- and 3 output units. The network was fully interconnected. Using a SPARC2 CPU training takes from 12 to 24 hours. The weights of the network have been randomly chosen by a normal distribution ( = 0:0; = 0:1). The bias of each unit has been computed as follows. First the average input pattern of the hole learning set has been calculated. While propagating this averaged pattern through the network the bias of each unit is tuned to half activate every hidden or output unit. By this means the gradient of the sigmoid activation function of every unit is maximized, which has some beneﬁts on the gradient descent during the training.
Bascically, backpropagation [Rumelhart, 1986] is a gradient descent technique to minimize some error criteria E . In the batched mode variant the descent is based on the gradient rE for the total training set :
∆wij (n) = ?
∆wij (n ? 1)
and are two non-negative constant parameters called learning rate and momentum. The momentum can speed up training in very ﬂat regions of the error surface and suppresses weight oscillation in steep valleys or ravines. Unfortunately it is necessary propagate the hole training set through the network for calculating rE . This can slow down training for bigger training sets. For some tasks (e.g. neural controllers) no ﬁnite training set is available. Therefore often a online variant is used, which updates the connections based on the gradient...