This letter introduces a novel algorithm for training deep neural networks with many nonlinear layers. Our method uses an approximated integrated gradient that is averaged over the range of the weight update to more accurately capture the loss change resulting from parameter updates. Unlike standard gradients, this average gradient improves learning efficiency in certain scenarios. We incorporate the approximated average gradient into RMSProp and compare the resulting algorithm to conventional RMSProp and Adam. We evaluate the approach on deep models lacking skip connections, such as those with many nonlinear activations and no residual structure, where traditional methods typically encounter difficulties. These models that focus on extracting high-order features create a loss landscape more akin to that of a biological brain. Our method requires significantly fewer iterations to reach a target training loss on MNIST, Fashion MNIST, and IMDb benchmarks for both convolutional and fully connected architectures across different initialization schemes. While our approach incurs moderately higher computational and memory costs compared to standard RMSProp, its performance on shallow models remains comparable. Nevertheless, our main contributions are (1) introducing the average gradient concept as an efficient alternative to computing high-order derivatives, (2) offering a novel factorization formula for approximating the average gradient, accompanied by a formal derivation., and (3) showing an example algorithm that leverages this formula to enhance the efficiency of RMSProp for some models, as validated by our evaluation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wolniak et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69d893896c1944d70ce04798 — DOI: https://doi.org/10.1162/neco.a.1514
Rafał Wolniak
Bożena Kostek
Neural Computation
Gdańsk University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...