Catalogue Search | MBRL

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

by Loizou, Nicolas , Richtárik, Peter in Algorithms , Ascent , Asymptotic methods

2020

In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent: convex quadratic problems. We prove global non-asymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates, and dual function values. We also show that the primal iterates converge at an accelerated linear rate in a somewhat weaker sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesàro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

Journal Article

Share this book

Add to My Shelf

Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function

by Richtárik, Peter , Takáč, Martin in Blocking , Calculus of Variations and Optimal Control; Optimization , Combinatorics

2014

In this paper we develop a randomized block-coordinate descent method for minimizing the sum of a smooth and a simple nonsmooth block-separable convex function and prove that it obtains an -accurate solution with probability at least in at most iterations, where is the number of blocks. This extends recent results of Nesterov (SIAM J Optim 22(2): 341–362, 2012), which cover the smooth case, to composite minimization, while at the same time improving the complexity by the factor of 4 and removing from the logarithmic term. More importantly, in contrast with the aforementioned work in which the author achieves the results by applying the method to a regularized version of the objective function with an unknown scaling factor, we show that this is not necessary, thus achieving first true iteration complexity bounds. For strongly convex functions the method converges linearly. In the smooth case we also allow for arbitrary probability vectors and non-Euclidean norms. Finally, we demonstrate numerically that the algorithm is able to solve huge-scale -regularized least squares problems with a billion variables.

Journal Article

Share this book

Add to My Shelf

Randomized Iterative Methods for Linear Systems

by Richtárik, Peter , Gower, Robert M.

2015

Journal Article

Share this book

Add to My Shelf

A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments

by Harman, Radoslav , Filová, Lenka , Richtárik, Peter in A-optimality , Algorithms , Approximation

2020

We propose a class of subspace ascent methods for computing optimal approximate designs that covers existing algorithms as well as new and more efficient ones. Within this class of methods, we construct a simple, randomized exchange algorithm (REX). Numerical comparisons suggest that the performance of REX is comparable or superior to that of state-of-the-art methods across a broad range of problem structures and sizes. We focus on the most commonly used criterion of D-optimality, which also has applications beyond experimental design, such as the construction of the minimum-volume ellipsoid containing a given set of data points. For D-optimality, we prove that the proposed algorithm converges to the optimum. We also provide formulas for the optimal exchange of weights in the case of the criterion of A-optimality, which enable one to use REX and some other algorithms for computing A-optimal and I-optimal designs. Supplementary materials for this article are available online.

Journal Article

Share this book

Add to My Shelf

Randomized Quasi-Newton Updates Are Linearly Convergent Matrix Inversion Algorithms

by Richtárik, Peter , Gower, Robert M.

2017

Journal Article

Share this book

Add to My Shelf

Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory

by Richtárik, Peter , Takáč, Martin

2020

Journal Article

Share this book

Add to My Shelf

Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

by Richtárik, Peter , Condat, Laurent , Mishchenko, Konstantin in Algorithms , Convergence , Convexity

2022

We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal–dual algorithm, which we call PDDY, for this problem. It is constructed by applying Davis–Yin splitting to a monotone inclusion in a primal–dual product space, where the operators are monotone under a specific metric depending on L. We show that three existing algorithms (the two forms of the Condat–Vũ algorithm and the PD3O algorithm) have the same structure, so that PDDY is the fourth missing link in this self-consistent class of primal–dual algorithms. This representation eases the convergence analysis: it allows us to derive sublinear convergence rates in general, and linear convergence results in presence of strong convexity. Moreover, within our broad and flexible analysis framework, we propose new stochastic generalizations of the algorithms, in which a variance-reduced random estimate of the gradient of F is used, instead of the true gradient. Furthermore, we obtain, as a special case of PDDY, a linearly converging algorithm for the minimization of a strongly convex function F under a linear constraint; we discuss its important application to decentralized optimization.

Journal Article

Share this book

Add to My Shelf

Accelerated Bregman proximal gradient methods for relatively smooth convex optimization

by Hanzely Filip , Richtárik, Peter , Lin, Xiao in Convergence , Convex analysis , Convexity

2021

We consider the problem of minimizing the sum of two convex functions: one is differentiable and relatively smooth with respect to a reference convex function, and the other can be nondifferentiable but simple to optimize. We investigate a triangle scaling property of the Bregman distance generated by the reference convex function and present accelerated Bregman proximal gradient (ABPG) methods that attain an O(k-γ) convergence rate, where γ∈(0,2] is the triangle scaling exponent (TSE) of the Bregman distance. For the Euclidean distance, we have γ=2 and recover the convergence rate of Nesterov’s accelerated gradient methods. For non-Euclidean Bregman distances, the TSE can be much smaller (say γ≤1), but we show that a relaxed definition of intrinsic TSE is always equal to 2. We exploit the intrinsic TSE to develop adaptive ABPG methods that converge much faster in practice. Although theoretical guarantees on a fast convergence rate seem to be out of reach in general, our methods obtain empirical O(k-2) rates in numerical experiments on several applications and provide posterior numerical certificates for the fast rates.

Journal Article

Share this book

Add to My Shelf

Parallel coordinate descent methods for big data optimization

by Richtárik, Peter , Takáč, Martin in Algorithms , Big Data , Blocking

2016

In this work we show that randomized (block) coordinate descent methods can be accelerated by parallelization when applied to the problem of minimizing the sum of a partially separable smooth convex function and a simple separable convex function. The theoretical speedup, as compared to the serial method, and referring to the number of iterations needed to approximately solve the problem with high probability, is a simple expression depending on the number of parallel processors and a natural and easily computable measure of separability of the smooth component of the objective function. In the worst case, when no degree of separability is present, there may be no speedup; in the best case, when the problem is separable, the speedup is equal to the number of processors. Our analysis also works in the mode when the number of blocks being updated at each iteration is random, which allows for modeling situations with busy or unreliable processors. We show that our algorithm is able to solve a LASSO problem involving a matrix with 20 billion nonzeros in 2 h on a large memory node with 24 cores.

Journal Article

Share this book

Add to My Shelf

Fastest rates for stochastic mirror descent methods

by Hanzely Filip , Richtárik, Peter in Algorithms , Computational geometry , Convex analysis

2021

Relative smoothness—a notion introduced in Birnbaum et al. (Proceedings of the 12th ACM conference on electronic commerce, ACM, pp 127–136, 2011) and recently rediscovered in Bauschke et al. (Math Oper Res 330–348, 2016) and Lu et al. (Relatively-smooth convex optimization by first-order methods, and applications, arXiv:1610.05708, 2016)—generalizes the standard notion of smoothness typically used in the analysis of gradient type methods. In this work we are taking ideas from well studied field of stochastic convex optimization and using them in order to obtain faster algorithms for minimizing relatively smooth functions. We propose and analyze two new algorithms: Relative Randomized Coordinate Descent (relRCD) and Relative Stochastic Gradient Descent (relSGD), both generalizing famous algorithms in the standard smooth setting. The methods we propose can be in fact seen as particular instances of stochastic mirror descent algorithms, which has been usually analyzed under stronger assumptions: Lipschitzness of the objective and strong convexity of the reference function. As a consequence, one of the proposed methods, relRCD corresponds to the first stochastic variant of mirror descent algorithm with linear convergence rate.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter