We present a new algorithm for nding low complexity neural networks with high generalization capability. The algorithm searches for a \ at" minimum of the error function. A at minimum is a large connected region in weight-space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that at minima correspond to \simple" networks and low expected over tting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into under tting and over tting error. Unlike many previous approaches, ours does not require Gaussian assumptions and does not depend on a \good" weight prior { instead we have a prior over input/output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second order derivatives, it has backprop's order of complexity. Automatically, it e ectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, at minimum search outperforms (1) conventional backprop, (2) weight decay, (3) \optimal brain surgeon" / \optimal brain damage". We also provide pseudo code of the algorithm (omitted from the NC-version).The appendix presents a detailed theoretical justi cation of our approach. Using a variant of the Gibbs algorithm, appendix A.1 de nes generalization, under tting and over tting error in a novel way. By de ning an appropriate prior over input-output functions, we postulate that the most probable network is a \ at" one. Appendix A.2 formally justi es the error function minimized by our algorithm. Appendix A.3 describes an e cient implementation of the algorithm. Appendix A.4 nally presents pseudo code of the algorithm.
TASK / ARCHITECTURE / BOXESGeneralization task. The task is to approximate an unknown function f X Y mapping a nite set of possible inputs X R N to a nite set of possible outputs Y R K . A data set D is obtained from f (see appendix A.1). All training information is given by a nite set D 0 D. D 0 is called the training set. The pth element of D 0 is denoted by an input/target pair (x p ; y p ).Architecture/ Net functions. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has N input units, K output units, L weights, and di erentiable activation functions. It maps input vectors x 2 R N to output vectors o(w; x) 2 R K , where w is the L-dimensional weight vector, and the weight on the connection from unit j to i is denoted w ij . The net function induced by w is denoted net(w): for x 2 R N , net(w)(x) = o(w; x) = o 1 (w; x); o 2 (w; x); : : : ; o K 1 (w; x); o K (w; x) , where o i (w; x) denotes the i-th component of o(w; x), corresponding to output unit i. Training error. We use squared error E(net(w); D 0 ) := P (xp;yp)2D0 k y p o(w; x p ) k 2 , where k : k denotes the Euclidean norm.Tolerable error. To
translated by 谷歌翻译