World's most popular travel blog for travel bloggers.

# Project to L1 ball of specified radius

, ,
Problem Detail:

If $x \in \mathbb{R}^d$ is a $d$-dimensional vector, recall that the $\ell_1$ norm of $x$ is given by

$$||x||_1 = |x_1| + |x_2| + \dots + |x_d|.$$

The $\ell_1$-ball of radius $\lambda$ is the set of points $x$ such that $||x||_1 \le \lambda$.

Input: $x \in \mathbb{R}^d$, $\lambda > 0$
Desired output: $y \in \mathbb{R}^d$ such that $||y||_1 \le \lambda$ and $||x-y||_2$ is minimized

In other words, given $x,\lambda$, the goal is to project $x$ to the nearest point in the $\ell_1$-ball of radius $\lambda$.

This problem comes up when we want to learn a linear SVM with L1 regularization using stochastic gradient descent.

## A proposed solution in the literature

The following paper proposes a solution:

D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou. Detecting Adversarial Advertisements in the Wild. KDD 2011.

They propose an iterative algorithm $A$, with the property that iterating $A$ on $x$ eventually converges to the desired $y$: i.e., the sequence $x,A(x),A(A(x)),A(A(A(x))),\dots$ converges to the desired value $y$. Their algorithm is in Figure 2 and is as follows:

Algorithm $A$, with input $x$:
1. $c := \max(\lambda - ||x||_1,0)$
2. $d := ||x||_0$
3. $\tau := c/d$
4. for each non-zero element $i$ of $x$, do:
5. $\qquad s := \text{sign}(x_i)$
6. $\qquad x_i := s \times \max(|x_i - \tau|,0)$
7. return $x$

However, I don't see why this algorithm works. For instance, at minimum we'd want the output of this algorithm to always satisfy $\lambda \le ||A(x)||_1 < ||x||_1$, but this doesn't seem to hold. This algorithm often does nothing to the input if $||x||_1 > \lambda$, and it typically shrinks $x$ if $||x||_1 \le \lambda$; this seems backwards (we want the reverse).

Also, $\max(|x_i - \tau|,0)$ is an odd expression, as we always have $\max(|x_i - \tau|,0) = |x_i - \tau|$, so I don't understand why the max is there.

## My question

Does the algorithm from Sculley et al. actually work? Is there a typo or an error in this algorithm? If so, can it be fixed?

There are two typos in their algorithm. Fixing them yields an algorithm that does work. Here's the fixed algorithm:

Algorithm $A$, with input $x$:
1. $c := \max(||x||_1 - \lambda,0)$
2. $d := ||x||_0$
3. $\tau := c/d$
4. for each non-zero element $i$ of $x$, do:
5. $\qquad s := \text{sign}(x_i)$
6. $\qquad x_i := s \times \max(x_i - \tau,0)$
7. return $x$

The first typo was in line 1: $\lambda - ||x||_1$ should have been $||x||_1 - \lambda$.

The second typo was in line 6: $|x_i - \tau|$ should have been $x_i - \tau$ (no absolute value).

With these changes, everything makes sense and the algorithm now seems to work correctly. For instance:

• If $||x||_1 \le \lambda$, then $A(x)=x$ (since $c=0$ in this case).

• If $||x||_1 > \lambda$, then $\lambda \le ||A(x)||_1 < ||x||_1$ (since each $x_i$ gets mapped to something whose absolute value is smaller).

• The sequence $x,A(x),A(A(x)),\dots$ converges to the desired $y$.

• Convergence should typically be pretty fast. In particular, if $\max_i |x_i|$ is not too much larger than $\lambda / ||x||_0$, then this procedure should converge exponentially fast.

(There are definitely some bad cases where this procedure converges very slowly: e.g., if $x$ has many elements whose absolute value is tiny but non-zero, and a single element whose absolute value is enormous. In this case, the algorithm still converges, but it requires $\Theta(||x||_0)$ iterations, which might be as bad as $\Theta(d)$. However, these bad cases are pretty rare.)