The magic of proximal operators

Constrained optimization made easy

This note is the first in a series, which summarizes of one of my deep dives and the basic elements of my python package proxmin. The problem at hand is minimizing a function under a non-differentiable constraint. The simplest common case is non-negativity, which one can express by a element-wise penalty function:

so that the problem is to minimize . If the objective function is the likelihood of a linear model with a known coefficient matrix , the problem is known as Non-negative Least Squares, which can be solved with Quadratic Programming. However, we want to find ways to work with other, and potentially several simultaneous, non-differentiable constraints.

Instead, let us find out what makes this problem hard. In short, it stems from two aspects, namely that is not differentiable at 0, otherwise we could simply follow the gradient of , and, for good measure, that doesn't have a finite value for .

Here's where the proximal operators (alternatively called "proximity operators") come to the rescue. In what follows I will paraphrase key statements of the excellent review paper by Parikh & Boyd (2013), which I highly recommend for a more thorough and still readable introduction to this topic. Instead of minimizing directly, we can search for

As long as a is a closed proper convex function, the argument of the minimization is strongly convex and not everywhere infinite, so it has a unique minimizer for every . Even more so, one can show that the fixed points of the proximal operator are precisely the minimizers of . In other words, if and only if minimizes . That's absolutely magical! Let me emphasize it again:

The proximal operator of yields finite results for all , even if doesn't. Its fixed points are minimizers of .

Now you may ask: OK, that's all fine, but haven't we just moved the problem of minimizing into the definition of its proximal operator? Or in other words: how can we solve the problem? That's the next astonishing piece about proximal operators: many of them have analytic forms. Chief among them are the operators of indicator functions

of a closed non-empty convex set . For them, it's quite obvious that the proximal operator is simply the Euclidean projection operator onto the set

and the scaling parameter becomes meaningless. For non-negativity, the proximal operator is thus simply the element-wise projection onto the non-negative numbers, which we'll call . Many other projections are known and analytic (think back to your geometry classes: straight lines, circles, cones; more complicated ones like norms, and ellipsoids … A quite comprehensive list is in section 6 of the Parikh paper and in the appendix of Combettes & Pesquet (2010)).

Sometimes the subset projection becomes very tough, as my summer student this year has found out quickly. We were looking for a projection onto the set of monotonic functions, whose values obey . This is a projection onto a polyhedron, which has no closed-form expression. Active-set or Quadratic Programming methods can solve it iteratively, but we decided to go with a different route (subject of a later post).

Besides indicators, there are the penalty functions, which usually depend on the scaling parameter to set the strength of the penalty. Classic non-differentiable examples are the or sparsity penalty, whose proximal operators even have quite obvious names, namely the "soft thresholding" and "hard thresholding" operators:

More complicated ones can be constructed. To see how we need to introduce another concept that keeps popping up in the constrained optimization literature: the subderivative and its higher-dimensional version, the subgradient. It the generalization of the gradient to non-differentiable functions. Again:

The subderivative generalizes the derivative to functions that are not differentiable.

It's quite straightforward, really: A subgradient of a function at a point with is a vector such that . The set of all such subgradients, which may be empty, is called the subdifferential . It handles infinite values, such as the ones that arise in the indicator functions, by yielding an empty set if . On the other hand, if is convex and differentiable at , then .

The concept of the subdifferential enables existence and convergence proofs. But, in addition, it can be used to validate and even derive the analytic form of a proximal operator. Because of the fixed-point property of the proximal operators, we have

If we were interested to, say, derive the proximal-operator form of the Maximum Entropy regularization, that is we want to minimize , we can use the second equation above because the entropy is differentiable for and find that

where denotes the Lambert-W function.

Proximal minimization

We can now go to the most practically relevant aspects of proximal operators, namely the close relation to functional minimization. If is differentiable, one can write the limit as

If isn't differentiable, the subdifferential will formally replace the gradient and the following argument still holds in principle, but one has to be more careful (see Parikh & Boyd (2013), section 3.2, for what's called "resolvent of the subdifferential operator").

Back to the simpler case of a differentiable . The first-order approximation

is equivalent to the usual gradient update with step size . The second-order form

has a Tikhonov regularization with parameter , also known as the Levenberg-Marquard update.

Gradient and Levenberg-Marquard updates are proximal operators of first- and second-order approximations of .

Now also the fixed-point property makes intuitive sense: just run gradient updates until there's no change, and you'll end up at the minimum. This sets us up with the means to compute the minimum of , both functions being closed proper convex, and being differentiable. Connecting the concepts introduced this note, we'll use the fixed-point optimization of with proximal operator of :

This update sequence (from step to ) is called Proximal Gradient Method (PGM). It constitutes a sequence of two proximal operators and is thus related to the method of Alternating Projections. It's simple and elegant, and works with any constraint (not just simple things like non-negativity). One critical aspect is the upper limit on : It is customarily set as , where is the Lipschitz constant of , but it needs to be smaller than . Otherwise, PGM will blow up rapidly without any warning! There a several tweaks to speed up the convergence, like over-relaxation and Nesterov acceleration, but I won't go into these here.

This is just a ultra-short summary of the rich field of constrained optimization with proximal operators. I am absolutely stoked by what they allow me to do, which I will present in future notes.