Efficient Quasi-Newton Methods in Trust-Region Frameworks for Training Deep Neural Networks

Yousefi, Mahsa

Deep Learning (DL), utilizing Deep Neural Networks (DNNs), has gained significant popularity in Machine Learning (ML) due to its wide range of applications in various domains. DL applications typically involve large-scale, highly nonlinear, and non-convex optimization problems. The objective of these optimization problems, often expressed as a finite-sum function, is to minimize the overall prediction error by optimizing the parameters of the neural network. In order to solve a DL optimization problem, interpreted as DNN training, stochastic second-order methods have recently attracted much attention. These methods leverage curvature information from the objective function and employ practical subsampling schemes to approximately evaluate the objective function and its gradient using random subsets of the available (training) data. Within this context, active research is focused on exploring strategies based on Quasi-Newton methods within both line-search and trust-region optimization frameworks. A trust-region approach is often preferred over the former one due to its ability to make progress even when some iterates are rejected, as well as its compatibility with both positive definite and indefinite Hessian approximations. Considering Quasi-Newton Hessian approximations, the thesis studies two classes of second-order trust-region methods in stochastic expansions for training DNNs as follows. In the class of standard trust-region methods, we consider well-known limited memory Quasi-Newton Hessian matrices, namely L-BFGS and L-SR1, and apply a half-overlapping subsampling for computations. We present an extensive experimental study on the resulting methods, discussing the effect of various factors on the training of different DNNs and filling a gap regarding which method yields more effective training. Then, we present a modified L-BFGS trust-region method by introducing a simple modification to the secant condition, which enhances the curvature information of the objective function, and extend it in a stochastic setting for training tasks. Finally, we devise a novel stochastic method that combines a trust-region L-SR1 second-order direction with a first-order variance-reduced stochastic gradient. Our focus in the second class is to develop standard trust-region methods for both non-monotone and stochastic expansions. Using regular fixed sample size subsampling, we investigate the efficiency of a non-monotone L-SR1 trust-region method in training through different approaches for computing the curvature information. We eventually propose a non-monotone trust-region algorithm that involves an additional sampling strategy in order to control the resulting error in function and gradient approximations due to subsampling. This novel method enjoys an adaptive sample size procedure and achieves almost sure convergence under standard assumptions. The efficiency of the algorithms presented in this study, implemented in MATLAB, is assessed by training different DNNs to solve specific problems such as image recognition and regression, and comparing their performance to well-known first- and second-order methods, including Adam and STORM.

Efficient Quasi-Newton Methods in Trust-Region Frameworks for Training Deep Neural Networks / Yousefi, Mahsa. - (2023 Sep 28).