Catalogue Search | MBRL

Optimization Methods for Large-Scale Machine Learning

by Nocedal, Jorge , Curtis, Frank E. , Bottou, Léon in algorithm complexity analysis , machine learning , MATHEMATICS AND COMPUTING

2018

This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, we discuss how optimization problems arise in machine learning and what makes them challenging. A major theme of our study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient (SG) method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, we present a comprehensive theory of a straightforward, yet versatile SG algorithm, discuss its practical behavior, and highlight opportunities for designing algorithms with improved performance. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.

Journal Article

Share this book

Add to My Shelf

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

by Xiao, Lin , Zhang, Tong

2014

Journal Article

Share this book

Add to My Shelf

Minimizing finite sums with the stochastic average gradient

by Le Roux, Nicolas , Bach, Francis , Schmidt, Mark in Algorithms , Calculus of Variations and Optimal Control; Optimization , Combinatorics

2017

We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O ( 1 / k ) to O (1 / k ) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O (1 / k ) to a linear convergence rate of the form O ( ρ k ) for ρ < 1 . Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work Le Roux et al. (Adv Neural Inf Process Syst, 2012 ), which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

Journal Article

Share this book

Add to My Shelf

Deep Learning: An Introduction for Applied Mathematicians

by Higham, Catherine F. , Higham, Desmond J. in EDUCATION

2019

Multilayered artificial neural networks are becoming a pervasive tool in a host of application fields. At the heart of this deep learning revolution are familiar concepts from applied and computational mathematics, notably from calculus, approximation theory, optimization, and linear algebra. This article provides a very brief introduction to the basic ideas that underlie deep learning from an applied mathematics perspective. Our target audience includes postgraduate and final year undergraduate students in mathematics who are keen to learn about the area. The article may also be useful for instructors in mathematics who wish to enliven their classes with references to the application of deep learning techniques. We focus on three fundamental questions: What is a deep neural network? How is a network trained? What is the stochastic gradient method? We illustrate the ideas with a short MATLAB code that sets up and trains a network. We also demonstrate the use of state-of-the-art software on a large scale image classification problem. We finish with references to the current literature.

Journal Article

Share this book

Add to My Shelf

An Effective Optimization Method for Machine Learning Based on ADAM

by Yi, Dokkyun , Ji, Sangmin , Ahn, Jaehyun in adam , Algorithms , Machine learning

2020

A machine is taught by finding the minimum value of the cost function which is induced by learning data. Unfortunately, as the amount of learning increases, the non-liner activation function in the artificial neural network (ANN), the complexity of the artificial intelligence structures, and the cost function’s non-convex complexity all increase. We know that a non-convex function has local minimums, and that the first derivative of the cost function is zero at a local minimum. Therefore, the methods based on a gradient descent optimization do not undergo further change when they fall to a local minimum because they are based on the first derivative of the cost function. This paper introduces a novel optimization method to make machine learning more efficient. In other words, we construct an effective optimization method for non-convex cost function. The proposed method solves the problem of falling into a local minimum by adding the cost function in the parameter update rule of the ADAM method. We prove the convergence of the sequences generated from the proposed method and the superiority of the proposed method by numerical comparison with gradient descent (GD, ADAM, and AdaMax).

Journal Article

Share this book

Add to My Shelf

DAoG: decayed adaptation over gradients for parameter-free step size control

by Pan, Chengwei , Zhang, Yifan , Zhao, Di in Adaptation , Artificial Intelligence , Compression

2025

As the scale of parameters in deep learning models continues to grow, the cost of training such models increases accordingly, posing increasingly significant challenges for stochastic optimization methods. A central issue in gradient-based optimization lies in the selection of the step size, whose appropriateness directly affects training efficiency and model performance. To address this issue, a series of parameter-free optimization methods that do not require manual tuning of the step size have been proposed in recent years. Among them, DoG and its improved variant DoWG are the most representative. Despite demonstrating strong performance across various tasks, DoG and DoWG still suffer from performance instability or slow convergence under certain model architectures or training conditions. This paper introduces Decayed Adaptation over Gradients (DAoG), a novel parameter-free optimization method that systematically addresses these limitations. Our key innovation lies in incorporating a principled step size decay mechanism for the first time within the parameter-free optimization framework, which substantially enhances both optimization stability and model generalization. Additionally, a parameter compression strategy is employed to reduce sensitivity to the initial step size. Theoretical analysis demonstrates that DAoG exhibits favorable convergence properties under L-smooth and G-Lipschitz conditions. Empirical studies across representative tasks in natural language processing and computer vision demonstrate that DAoG outperforms both DoG and DoWG in terms of convergence speed and generalization performance. Notably, it even rivals or surpasses Adam with cosine annealing in several challenging scenarios. These theoretical and experimental results suggest that DAoG effectively mitigates the overly conservative step size issue in DoG and the instability problem in DoWG, thereby advancing the development of parameter-free optimization methods in deep learning.

Journal Article

Share this book

Add to My Shelf

Adjoint-Based Calibration of Nonlinear Stochastic Differential Equations

by Bartsch, Jan , Denk, Robert , Volkwein, Stefan in Applied mathematics , Calibration , Computing time

2024

To study the nonlinear properties of complex natural phenomena, the evolution of the quantity of interest can be often represented by systems of coupled nonlinear stochastic differential equations (SDEs). These SDEs typically contain several parameters which have to be chosen carefully to match the experimental data and to validate the effectiveness of the model. In the present paper the calibration of these parameters is described by nonlinear SDE-constrained optimization problems. In the optimize-before-discretize setting a rigorous analysis is carried out to ensure the existence of optimal solutions and to derive necessary first-order optimality conditions. For the numerical solution a Monte–Carlo method is applied using parallelization strategies to compensate for the high computational time. In the numerical examples an Ornstein–Uhlenbeck and a stochastic Prandtl–Tomlinson bath model are considered.

Journal Article

Share this book

Add to My Shelf

Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization

by Xu, Yibo , Xu, Yangyang in Artificial neural networks , Distance learning , Machine learning

2023

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result O(ε-3) to produce a stochastic ε-stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the O(ε-3) result by using only one or O(1) samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or O(1) new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.

Journal Article

Share this book

Add to My Shelf

Statistical inference for the population landscape via moment-adjusted stochastic gradients

by Su, Weijie J. , Liang, Tengyuan in Acceleration , Asymptotic methods , Asymptotic properties

2019

Modern statistical inference tasks often require iterative optimization methods to compute the solution. Convergence analysis from an optimization viewpoint informs us only how well the solution is approximated numerically but overlooks the sampling nature of the data. In contrast, recognizing the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the solution obtained by using iterative optimization methods. The paper makes progress along this direction by introducing moment-adjusted stochastic gradient descent: a new stochastic optimization method for statistical inference. We establish non-asymptotic theory that characterizes the statistical distribution for certain iterative methods with optimization guarantees. On the statistical front, the theory allows for model mis-specification, with very mild conditions on the data. For optimization, the theory is flexible for both convex and non-convex cases. Remarkably, the moment adjusting idea motivated from ‘error standardization’ in statistics achieves a similar effect to acceleration in first-order optimization methods that are used to fit generalized linear models. We also demonstrate this acceleration effect in the non-convex setting through numerical experiments.

Journal Article

Share this book

Add to My Shelf

Near-optimal stochastic approximation for online principal component estimation

by Li, Chris Junchi , Zhang, Tong , Liu, Han in Algorithms , Approximation , Data analysis

2018

Principal component analysis (PCA) has been a prominent tool for high-dimensional data analysis. Online algorithms that estimate the principal component by processing streaming data are of tremendous practical and theoretical interests. Despite its rich applications, theoretical convergence analysis remains largely open. In this paper, we cast online PCA into a stochastic nonconvex optimization problem, and we analyze the online PCA algorithm as a stochastic approximation iteration. The stochastic approximation iteration processes data points incrementally and maintains a running estimate of the principal component. We prove for the first time a nearly optimal finite-sample error bound for the online PCA algorithm. Under the subgaussian assumption, we show that the finite-sample error bound closely matches the minimax information lower bound.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter