Photo by

1. The L1-norm

Use absolute error:

\[\arg\min_{w\in \mathbb{R}^d}\sum_{i=1}{n}{|w^\top x_i -y_i|}\]

as the target function. Now the model can put less focus on far-away points, which is a form of Robust Regression.

If we write this as matrix form, which is L1-norm:

\[\arg\min_{w\in \mathbb{R}^d}{\|x_w -y\|}_{1}\]

where \(x_w=[w^\top x_0,w^\top x_1,\cdots, ]\).

2. The non-smooth issue

The gradient does not exist at \(x=0\) points, such that it is not a convex function at \(x=0\) point and makes the absolute error minimization harder. One method for this case is to transform this absolute error as linear programming.

3. The constrained problems

We can convert the L1-norm as the maximum of smooth functions:

\[\|w\|\to \max\{w_l -w\}\]
  1. replace maximum with new variable, constrained to upper-bound max:
\[\|w\| \to \arg\min_{w\in R_l, r\in R}{r_i}, r\ge \max{w_i-w}\]

where the residual variable \(r=\|w\|\)

  1. replace individual constraint with constraint for each element of max:
\[\|w\| \to \arg\min_{w\in R_l, r\in R}{r_i}\] \[\text{s.t.}\prod r\ge w\] \[r\ge -w\]