The magic of proximal operators

This note is the first in a series, which summarizes of one of my deep dives and the basic elements of my python package [proxmin](https://github.com/pmelchior/proxmin). The problem at hand is **minimizing a function $f$ under a non-differentiable constraint**. The simplest common case is non-negativity, which one can express by a element-wise penalty function: $$ g(x) = \begin{cases} 0 & \text{if}\ x \geq 0\\ \infty &\text{else}\end{cases} $$ so that the problem is to minimize $f+g$. If the objective function is the likelihood of a linear model $Ax=b$ with a known coefficient matrix $A$, the problem is known as Non-negative Least Squares, which can be solved with Quadratic Programming. However, we want to find ways to work with other, and potentially several simultaneous, non-differentiable constraints. Instead, let us find out what makes this problem hard. In short, it stems from two aspects, namely that $g$ is not differentiable at 0, otherwise we could simply follow the gradient of $f+g$, and, for good measure, that $g$ doesn't have a finite value for $x<0$. Here's where the **proximal operators** (alternatively called "proximity operators") come to the rescue. In what follows I will paraphrase key statements of the excellent review paper by [Parikh & Boyd (2013)](http://web.stanford.edu/~boyd/papers/prox_algs.html), which I highly recommend for a more thorough and still readable introduction to this topic. Instead of minimizing $g(x)$ directly, we can search for $$ \underset{u}{\text{argmin}}\left\lbrace g(u) + \frac{1}{2\lambda}\lVert x-u\rVert_2^2\right\rbrace \equiv \text{prox}_{\lambda g}(x). $$ As long as a $g$ is a [closed proper convex function](https://en.wikipedia.org/wiki/Closed_convex_function), the argument of the minimization is strongly convex and not everywhere infinite, so it has a *unique minimizer* for every $x\in\mathbb{R}^N$. Even more so, one can show that the fixed points of the proximal operator $\text{prox}_{\lambda g}(x)$ are precisely the minimizers of $g$. In other words, $\text{prox}_{\lambda g}(x^\star)=x^\star$ if and only if $x^\star$ minimizes $g$. That's absolutely magical! Let me emphasize it again: > The proximal operator of $g$ yields finite results for all $x\in\mathbb{R}^N$, even if $g$ doesn't. Its fixed points are minimizers of $g$. Now you may ask: OK, that's all fine, but haven't we just moved the problem of minimizing $g​$ into the definition of its proximal operator? Or in other words: how can we solve the problem? That's the next astonishing piece about proximal operators: many of them have analytic forms. Chief among them are the operators of indicator functions $$ I_\mathcal{C}(x) = \begin{cases}0 & \text{if}\ x\in\mathcal{C}\\\infty& \text{else}\end{cases} $$ of a closed non-empty convex set $\mathcal{C}$. For them, it's quite obvious that the proximal operator is simply the *Euclidean projection operator* onto the set $$ \text{prox}_{I_\mathcal{C}}(x) = \underset{u\in\mathcal{C}}{\text{argmin}} \lVert x-u\rVert_2^2, $$ and the scaling parameter $\lambda$ becomes meaningless. For non-negativity, the proximal operator is thus simply the element-wise projection onto the non-negative numbers, which we'll call $[.]_+$. Many other projections are known and analytic (think back to your geometry classes: straight lines, circles, cones; more complicated ones like norms, and ellipsoids … A quite comprehensive list is in section 6 of the Parikh paper and in the appendix of [Combettes & Pesquet (2010)](https://arxiv.org/abs/0912.3522v4)). Sometimes the subset projection becomes very tough, as my summer student this year has found out quickly. We were looking for a projection onto the set of monotonic functions, whose values $y_i = y(x_i)$ obey $\forall i: y_i \leq y_{i+1}\ \text{if}\ x_i \leq x_{i+1}$. This is a projection onto a polyhedron, which has no closed-form expression. Active-set or Quadratic Programming methods can solve it iteratively, but we decided to go with a different route (subject of a later post). Besides indicators, there are the penalty functions, which usually depend on the scaling parameter to set the strength of the penalty. Classic non-differentiable examples are the $\ell_1$ or $\ell_0$ sparsity penalty, whose proximal operators even have quite obvious names, namely the "soft thresholding" and "hard thresholding" operators: $$ \begin{align} &\text{prox}_{\lambda \ell_0}(x) = \text{Hard}_\lambda(x) = \begin{cases}0 & \text{if}\ |x|<\lambda \\ x & \text{else}\end{cases}\\ &\text{prox}_{\lambda \ell_1}(x) = \text{Soft}_\lambda(x) = \text{sign}(x)\left[|x|-\lambda\right]_+ \end{align} $$ More complicated ones can be constructed. To see how we need to introduce another concept that keeps popping up in the constrained optimization literature: the subderivative and its higher-dimensional version, the subgradient. It the generalization of the gradient to non-differentiable functions. Again: > The subderivative generalizes the derivative to functions that are not differentiable. It's quite straightforward, really: A subgradient of a function $g: \mathbb{R}^N\rightarrow(-\infty,\infty]$ at a point $x_0$ with $g(x_0) < \infty$ is a vector $\tau\in\mathbb{R}^N$ such that $g(x)\geq g(x_0) + \tau^\top (x-x_0)$. The *set* of all such subgradients, which may be empty, is called the **subdifferential** $\partial g(x_0)$. It handles infinite values, such as the ones that arise in the indicator functions, by yielding an empty set $\partial g(x_0)=\emptyset$ if $g(x_0) = \infty$. On the other hand, if $g$ is convex and differentiable at $x_0$, then $\partial g(x_0) = \lbrace\nabla g(x_0)\rbrace$. The concept of the subdifferential enables existence and convergence proofs. But, in addition, it can be used to validate and even derive the analytic form of a proximal operator. Because of the fixed-point property of the proximal operators, we have $$ \begin{align} &p = \text{prox}_g(x) \Leftrightarrow x-p\in\partial g(p)\\ &p = \text{prox}_g(x) \Leftrightarrow x-p = \nabla g(p)\ \text{if}\ g\ \text{is differentiable at}\ p. \end{align} $$ If we were interested to, say, derive the proximal-operator form of the **Maximum Entropy regularization**, that is we want to minimize $\text{S}(x) = -\sum_i^N x_i\ln(x_i)$, we can use the second equation above because the entropy is differentiable for $x_i>0$ and find that $$ \text{prox}_{\lambda S}(x) = -\lambda \Re\left[W\left(-\exp\left(-\frac{x+\lambda}{\lambda}\right) \Big/ \lambda\right)\right], $$ with the Lambert-$W$ function. ## Proximal minimization We can now go to the most practically relevant aspects of proximal operators, namely the close relation to functional minimization. If $f$ is differentiable, one can write the limit $\lambda\rightarrow 0$ as $$ \text{prox}_{\lambda f}(x) = (I + \lambda\nabla f)^{-1}(x). $$ If $f$ isn't differentiable, the subdifferential will formally replace the gradient and the following argument still holds in principle, but one has to be more careful (see Parikh & Boyd (2013), section 3.2, for what's called "resolvent of the subdifferential operator"). Back to the simpler case of a differentiable $f$. The first-order approximation $$ \text{prox}_{\lambda f}(x) = x- \lambda\nabla f(x) $$ is equivalent to the usual gradient update with step size $\lambda$. The second-order form $$ \text{prox}_{\lambda f}(x) = x- \left(\nabla^2f(x) + (1/\lambda)I\right)^{-1} \nabla f(x). $$ has a Tikhonov regularization with parameter $\lambda$, also known as the Levenberg-Marquard update. > Gradient and Levenberg-Marquard updates are proximal operators of first- and second-order approximations of $f$. Now also the fixed-point property makes intuitive sense: just run gradient updates until there's no change, and you'll end up at the minimum. This sets us up with the means to compute the minimum of $f+g$, both functions being closed proper convex, and $f$ being differentiable. Connecting the concepts introduced this note, we'll use the fixed-point optimization of $f$ with proximal operator of $g$: $$ x^{k+1} \leftarrow \text{prox}_{\lambda^k g}\left(x^k- \lambda^k\nabla f(x^k)\right) $$ This update sequence (from step $k$ to $k+1$) is called **Proximal Gradient Method** (PGM). It constitutes a sequence of two proximal operators and is thus related to the method of [Alternating Projections](https://en.wikipedia.org/wiki/Projections_onto_convex_sets). It's simple and elegant, and works with *any* constraint $g$ (not just simple things like non-negativity). One critical aspect is the upper limit on $\lambda^k$: It is customarily set as $\lambda^k \in (0,1/L]$, where $L$ is the Lipschitz constant of $ \nabla f(x^k)$, but it *needs* to be smaller than $2/L$. Otherwise, PGM will blow up rapidly without any warning! There a several tweaks to speed up the convergence, like over-relaxation and Nesterov acceleration, but I won't go into these here. This is just a ultra-short summary of the rich field of constrained optimization with proximal operators. I am absolutely stoked by what they allow me to do, which I will present in future notes.