The magic of proximal operators

Constrained optimization made easy

This note is the first in a series, which summarizes of one of my deep dives and the basic elements of my python package proxmin. The problem at hand is minimizing a function $f$ under a non-differentiable constraint. The simplest common case is non-negativity, which one can express by a element-wise penalty function:

$g(x) = \begin{cases} 0 & \text{if}\ x \geq 0\\ \infty &\text{else}\end{cases}$

so that the problem is to minimize $f+g$ . If the objective function is the likelihood of a linear model $Ax=b$ with a known coefficient matrix $A$ , the problem is known as Non-negative Least Squares, which can be solved with Quadratic Programming. However, we want to find ways to work with other, and potentially several simultaneous, non-differentiable constraints.

Instead, let us find out what makes this problem hard. In short, it stems from two aspects, namely that $g$ is not differentiable at 0, otherwise we could simply follow the gradient of $f+g$ , and, for good measure, that $g$ doesn't have a finite value for $x<0$ .

Here's where the proximal operators (alternatively called "proximity operators") come to the rescue. In what follows I will paraphrase key statements of the excellent review paper by Parikh & Boyd (2013), which I highly recommend for a more thorough and still readable introduction to this topic. Instead of minimizing $g(x)$ directly, we can search for

$\underset{u}{\text{argmin}}\left\lbrace g(u) + \frac{1}{2\lambda}\lVert x-u\rVert_2^2\right\rbrace \equiv \text{prox}_{\lambda g}(x).$

As long as a $g$ is a closed proper convex function, the argument of the minimization is strongly convex and not everywhere infinite, so it has a unique minimizer for every $x\in\mathbb{R}^N$ . Even more so, one can show that the fixed points of the proximal operator $\text{prox}_{\lambda g}(x)$ are precisely the minimizers of $g$ . In other words, $\text{prox}_{\lambda g}(x^\star)=x^\star$ if and only if $x^\star$ minimizes $g$ . That's absolutely magical! Let me emphasize it again:

The proximal operator of $g$ yields finite results for all $x\in\mathbb{R}^N$ , even if $g$ doesn't. Its fixed points are minimizers of $g$ .

Now you may ask: OK, that's all fine, but haven't we just moved the problem of minimizing $g$ into the definition of its proximal operator? Or in other words: how can we solve the problem? That's the next astonishing piece about proximal operators: many of them have analytic forms. Chief among them are the operators of indicator functions

$I_\mathcal{C}(x) = \begin{cases}0 & \text{if}\ x\in\mathcal{C}\\\infty& \text{else}\end{cases}$

of a closed non-empty convex set $\mathcal{C}$ . For them, it's quite obvious that the proximal operator is simply the Euclidean projection operator onto the set

$\text{prox}_{I_\mathcal{C}}(x) = \underset{u\in\mathcal{C}}{\text{argmin}} \lVert x-u\rVert_2^2,$

and the scaling parameter $\lambda$ becomes meaningless. For non-negativity, the proximal operator is thus simply the element-wise projection onto the non-negative numbers, which we'll call $[.]_+$ . Many other projections are known and analytic (think back to your geometry classes: straight lines, circles, cones; more complicated ones like norms, and ellipsoids … A quite comprehensive list is in section 6 of the Parikh paper and in the appendix of Combettes & Pesquet (2010)).

Sometimes the subset projection becomes very tough, as my summer student this year has found out quickly. We were looking for a projection onto the set of monotonic functions, whose values $y_i = y(x_i)$ obey $\forall i: y_i \leq y_{i+1}\ \text{if}\ x_i \leq x_{i+1}$ . This is a projection onto a polyhedron, which has no closed-form expression. Active-set or Quadratic Programming methods can solve it iteratively, but we decided to go with a different route (subject of a later post).

Besides indicators, there are the penalty functions, which usually depend on the scaling parameter to set the strength of the penalty. Classic non-differentiable examples are the $\ell_1$ or $\ell_0$ sparsity penalty, whose proximal operators even have quite obvious names, namely the "soft thresholding" and "hard thresholding" operators:

$\begin{align} &\text{prox}_{\lambda \ell_0}(x) = \text{Hard}_\lambda(x) = \begin{cases}0 & \text{if}\ |x|<\lambda \\ x & \text{else}\end{cases}\\ &\text{prox}_{\lambda \ell_1}(x) = \text{Soft}_\lambda(x) = \text{sign}(x)\left[|x|-\lambda\right]_+ \end{align}$

More complicated ones can be constructed. To see how we need to introduce another concept that keeps popping up in the constrained optimization literature: the subderivative and its higher-dimensional version, the subgradient. It the generalization of the gradient to non-differentiable functions. Again:

The subderivative generalizes the derivative to functions that are not differentiable.

It's quite straightforward, really: A subgradient of a function $g: \mathbb{R}^N\rightarrow(-\infty,\infty]$ at a point $x_0$ with $g(x_0) < \infty$ is a vector $\tau\in\mathbb{R}^N$ such that $g(x)\geq g(x_0) + \tau^\top (x-x_0)$ . The set of all such subgradients, which may be empty, is called the subdifferential $\partial g(x_0)$ . It handles infinite values, such as the ones that arise in the indicator functions, by yielding an empty set $\partial g(x_0)=\emptyset$ if $g(x_0) = \infty$ . On the other hand, if $g$ is convex and differentiable at $x_0$ , then $\partial g(x_0) = \lbrace\nabla g(x_0)\rbrace$ .

The concept of the subdifferential enables existence and convergence proofs. But, in addition, it can be used to validate and even derive the analytic form of a proximal operator. Because of the fixed-point property of the proximal operators, we have

$\begin{align} &p = \text{prox}_g(x) \Leftrightarrow x-p\in\partial g(p)\\ &p = \text{prox}_g(x) \Leftrightarrow x-p = \nabla g(p)\ \text{if}\ g\ \text{is differentiable at}\ p. \end{align}$

If we were interested to, say, derive the proximal-operator form of the Maximum Entropy regularization, that is we want to minimize $\text{S}(x) = \sum_i^N x_i\ln(x_i)$ , we can use the second equation above because the entropy is differentiable for $x_i>0$ and find that

$\text{prox}_{\lambda S}(x) = \lambda \Re\left[W\left(\frac{1}{\lambda} \exp\left(\frac{x}{\lambda}-1\right)\right)\right],$

where $W$ denotes the Lambert-W function.

Proximal minimization

We can now go to the most practically relevant aspects of proximal operators, namely the close relation to functional minimization. If $f$ is differentiable, one can write the limit $\lambda\rightarrow 0$ as

$\text{prox}_{\lambda f}(x) = (I + \lambda\nabla f)^{-1}(x).$

If $f$ isn't differentiable, the subdifferential will formally replace the gradient and the following argument still holds in principle, but one has to be more careful (see Parikh & Boyd (2013), section 3.2, for what's called "resolvent of the subdifferential operator").

Back to the simpler case of a differentiable $f$ . The first-order approximation

$\text{prox}_{\lambda f}(x) = x- \lambda\nabla f(x)$

is equivalent to the usual gradient update with step size $\lambda$ . The second-order form

$\text{prox}_{\lambda f}(x) = x- \left(\nabla^2f(x) + (1/\lambda)I\right)^{-1} \nabla f(x).$

has a Tikhonov regularization with parameter $\lambda$ , also known as the Levenberg-Marquard update.

Gradient and Levenberg-Marquard updates are proximal operators of first- and second-order approximations of $f$ .

Now also the fixed-point property makes intuitive sense: just run gradient updates until there's no change, and you'll end up at the minimum. This sets us up with the means to compute the minimum of $f+g$ , both functions being closed proper convex, and $f$ being differentiable. Connecting the concepts introduced this note, we'll use the fixed-point optimization of $f$ with proximal operator of $g$ :

$x^{k+1} \leftarrow \text{prox}_{\lambda^k g}\left(x^k- \lambda^k\nabla f(x^k)\right)$

This update sequence (from step $k$ to $k+1$ ) is called Proximal Gradient Method (PGM). It constitutes a sequence of two proximal operators and is thus related to the method of Alternating Projections. It's simple and elegant, and works with any constraint $g$ (not just simple things like non-negativity). One critical aspect is the upper limit on $\lambda^k$ : It is customarily set as $\lambda^k \in (0,1/L]$ , where $L$ is the Lipschitz constant of $\nabla f(x^k)$ , but it needs to be smaller than $2/L$ . Otherwise, PGM will blow up rapidly without any warning! There a several tweaks to speed up the convergence, like over-relaxation and Nesterov acceleration, but I won't go into these here.

This is just a ultra-short summary of the rich field of constrained optimization with proximal operators. I am absolutely stoked by what they allow me to do, which I will present in future notes.