Hessian-Free Optimization: Supplementary

Materials

Contents

1 Pseudo-code for the damped Gauss-Newton vector product 2

2 Details of the pathological synthetic problems 3

2.1 The addition, multiplication, and XOR problem . . . . . . . . . . . . 3 2.2 The temporal order problem . . . . . . . . . . . . . . . . . . . . . . 4 2.3 The 3-bit temporal order problem . . . . . . . . . . . . . . . . . . . . 4 2.4 The random permutation problem . . . . . . . . . . . . . . . . . . . 4 2.5 Noiseless memorization . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Details of the natural problems 5

3.1 The bouncing balls problem . . . . . . . . . . . . . . . . . . . . . . 5 3.2 The MIDI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 The speech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1

1 Pseudo-code for the damped Gauss-Newton vector product

Algorithm 1 Computation of the matrix-vector product of the structurally-damped Gauss-Newton matrix with the vector v, for the case when e is the tanh non-linearity, g the logistic sigmoid, D and L are the corresponding matching loss functions. The notation reflects the “convex approximation” interpretation of the GN matrix so that we are applying the R operator to the forwards-backwards pass through the linearized and structurally damped objective ~k, and the desired matrix-vector product is given by Rd~k

d . All derivatives are implicitly evaluated at = n. The previously defined parameter symbols Wph, Whx, Whh, bh, bp binit

h will correspond to the parameter vector n if

they have no super-script and to the input parameter vector v if they have the ‘v’ superscript. The Rz notation follows Pearlmutter [1994], and for the purposes of reading the pseudo-code can be interpreted as merely defining a new symbol. We assume that intermediate quantities of the network (e.g. hi) have already been computed (from n). The operator is coordinate-wise multiplication. Line 17 (underlined) is responsible for structural damping.

1: for i = 1 to T do

2: if i = 1 then

3: Rti bv

h + binit

h

v +Wv

hxxi

4: else

5: Rti bv

h +Wv

hxxi +Wv

hhhi1 +WhhRhi1

6: end if

7: Rhi (1 + hi) (1 hi) Rti

8: Rsi bv

p +Wv

phhi +WphRhi

9: R^yi ^yi (1 ^yi) Rsi

10: end for

11: Rd~k

d 0

12: R d~k

dtT+1

0

13: for i = T down to 1 do

14: R d~k

dsi

R^yi

15: R d~k

dhi

W>

hhR d~k

dti+1

+W>

phR d~k

dsi

16: R d~k

dti

(1 + hi) (1 hi) R d~k

dhi

17: R d~k

dti

R d~k

dti

+ (1 + hi) (1 hi) Rti

18: R d~k

dWph

R d~k

dWph

+ R^ yih>i

19: R d~k

dWhh

R d~k

dWhh

+ R d~k

dti+1

h>i

20: R d~k

dWhx

R d~k

dWhx

+ R d~k

dti

x>i

21: R d~k

dbh

R d~k

dbh

+ R d~k

dti

22: R d~k

dbp

R d~k

dbp

+ R^ yi

23: end for

24: R d~k

dbinit

h

R d~k

dt1

2

2 Details of the pathological synthetic problems

We begin by describing the experimental setup that was used in all the pathological problems. In every experiment, the RNN had 100 hidden units and a little over 10,000 parameters. It was initialized with a sparse initialization [Martens, 2010]: each weight matrix (Whx, Whh, and Who) is sparsely initialized so that each unit is connected to 15 other units. The nonzero connections of Whh, of Who, and the biases are sampled independently from a Gaussian with mean 0 and variance 1

15 , while the nonzero

connections of Whx are independently sampled from a Gaussian with mean 0 and a variance of 1. These constants were chosen so that the initial gradient wouldn’t vanish or explode too strongly for the optimizer to cope with. The occasional failures for some random seeds of the HF approach on a few of the harder synthetic problems was likely due to a deficiency in this initialization scheme and we are currently investigating ones that may be more robust.

The gradient was computed on 10,000 sequences, of which 1,000 were used to evaluate the curvature...