**Problem Detail:**

## The task

If $x \in \mathbb{R}^d$ is a $d$-dimensional vector, recall that the $\ell_1$ norm of $x$ is given by

$$||x||_1 = |x_1| + |x_2| + \dots + |x_d|.$$

The $\ell_1$-ball of radius $\lambda$ is the set of points $x$ such that $||x||_1 \le \lambda$.

Consider the following algorithmic task:

Input: $x \in \mathbb{R}^d$, $\lambda > 0$

Desired output: $y \in \mathbb{R}^d$ such that $||y||_1 \le \lambda$ and $||x-y||_2$ is minimized

In other words, given $x,\lambda$, the goal is to project $x$ to the nearest point in the $\ell_1$-ball of radius $\lambda$.

This problem comes up when we want to learn a linear SVM with L1 regularization using stochastic gradient descent.

## A proposed solution in the literature

The following paper proposes a solution:

D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou. Detecting Adversarial Advertisements in the Wild. KDD 2011.

They propose an iterative algorithm $A$, with the property that iterating $A$ on $x$ eventually converges to the desired $y$: i.e., the sequence $x,A(x),A(A(x)),A(A(A(x))),\dots$ converges to the desired value $y$. Their algorithm is in Figure 2 and is as follows:

Algorithm $A$, with input $x$:

1. $c := \max(\lambda - ||x||_1,0)$

2. $d := ||x||_0$

3. $\tau := c/d$

4. for each non-zero element $i$ of $x$, do:

5. $\qquad s := \text{sign}(x_i)$

6. $\qquad x_i := s \times \max(|x_i - \tau|,0)$

7. return $x$

However, I don't see why this algorithm works. For instance, at minimum we'd want the output of this algorithm to always satisfy $\lambda \le ||A(x)||_1 < ||x||_1$, but this doesn't seem to hold. This algorithm often does nothing to the input if $||x||_1 > \lambda$, and it typically shrinks $x$ if $||x||_1 \le \lambda$; this seems backwards (we want the reverse).

Also, $\max(|x_i - \tau|,0)$ is an odd expression, as we always have $\max(|x_i - \tau|,0) = |x_i - \tau|$, so I don't understand why the max is there.

## My question

Does the algorithm from Sculley et al. actually work? Is there a typo or an error in this algorithm? If so, can it be fixed?

###### Asked By : D.W.

###### Answered By : D.W.

There are two typos in their algorithm. Fixing them yields an algorithm that does work. Here's the fixed algorithm:

Algorithm $A$, with input $x$:

1. $c := \max(||x||_1 - \lambda,0)$

2. $d := ||x||_0$

3. $\tau := c/d$

4. for each non-zero element $i$ of $x$, do:

5. $\qquad s := \text{sign}(x_i)$

6. $\qquad x_i := s \times \max(x_i - \tau,0)$

7. return $x$

The first typo was in line 1: $\lambda - ||x||_1$ should have been $||x||_1 - \lambda$.

The second typo was in line 6: $|x_i - \tau|$ should have been $x_i - \tau$ (no absolute value).

With these changes, everything makes sense and the algorithm now seems to work correctly. For instance:

If $||x||_1 \le \lambda$, then $A(x)=x$ (since $c=0$ in this case).

If $||x||_1 > \lambda$, then $\lambda \le ||A(x)||_1 < ||x||_1$ (since each $x_i$ gets mapped to something whose absolute value is smaller).

The sequence $x,A(x),A(A(x)),\dots$ converges to the desired $y$.

Convergence should typically be pretty fast. In particular, if $\max_i |x_i|$ is not too much larger than $\lambda / ||x||_0$, then this procedure should converge exponentially fast.

(There are definitely some bad cases where this procedure converges very slowly: e.g., if $x$ has many elements whose absolute value is tiny but non-zero, and a single element whose absolute value is enormous. In this case, the algorithm still converges, but it requires $\Theta(||x||_0)$ iterations, which might be as bad as $\Theta(d)$. However, these bad cases are pretty rare.)

Question Source : http://cs.stackexchange.com/questions/59470

**3200 people like this**

## 0 comments:

## Post a Comment

Let us know your responses and feedback