The most fundamental technique in statistical learning is ordinary least squares (OLS) regression. If we have a vector of observations \(y\) and a matrix of features associated with each observation \(X\), then we assume the observations are a linear function of the features plus some (iid) random noise, \(\epsilon\):

\[ y = Xb + \epsilon \]

The maximum likelihood estimate of the regression coefficients \(b\) is that which minimizes the sum of squares error \(e(b)\) in our reconstruction of \(y\) i.e.:

\[ e(b) = (y - Xb)^T (y - Xb) \]

You can minimize \(e\) analytically by setting the derivative with respect to \(b\) equal to zero:

\[ \begin{align} e(b) &= y^T y - 2y^T Xb + b^T X^T Xb \\ \frac{de}{db} &= -2X^T y + 2 X^T Xb \\ &= 0 \\ X^T Xb &= X^T y \\ b &= (X^T X)^{-1} X^T y \end{align} \]

So far so straightforward. Things get more interesting when you consider a small variation on this scheme that is very useful in practice: L1 penalized regression aka the “lasso”. In this scheme, we augment the \(e\) function we are trying to minimize with a small penalty proportional to the size of the \(b\) coefficients:

\[ e(b) = (y - Xb)^T (y - Xb) + \gamma \sum_i |b_i| \]

This has the effect of driving many of the \(e\)-minimizing \(b\) components to exactly zero. Sparse solutions like this have advantages in interpretability and may result in regression models that generalise better out-of-sample than naive OLS estimates do. L1 penalties also have applications beyond simple regression: for example, they are the foundational tool in compressed sensing.

But: how do we find the \(b\) that minimizes this L1-penalised \(e\)? It’s a bit tricky because the \(|b|\) term is not differentiable. However, we know that a global optimum does exist because \(e\) is a convex function: \(e\) is the the sum of two convex components.

This technical post looks at minimizing this convex but non-smooth function through 6 different methods. My developments will be based on Parikh and Boyd’s notes on “Proximal Algorithms”, Boyd and Vandenberghe’s book “Convex Optimization”, Boyd, Xiao and Mutapcic’s notes on “Subgradient Methods” and Ryan Tibshirani’s video lecture series on convex optimization.

Subgradient descent

Even though functions like \(|b|\) are not differentiable, you can still define a subgradient for them. A subgradient \(g\) of \(f\) at \(x\) satisfies: \[ \forall a. f(y) \geq f(x) + g^T(y - x) \]

So the subgradient of \(|b|\) (with respect to \(b\)) at 0 is the set \([-1,1]\). Abusing notation, we can say that the subgradient set for our \(e\) is: \[ 2 X^T Xb - 2X^T y + \gamma ~ \textbf{if}(b = 0, [-1,1], \textbf{sign}(b)) \]

The subgradient method for minimization starts with initial guess \(x^{0}\) and then updates that guess by: \[ x^{(k+1)} = x^{(k)} - \alpha_k g^{(k)} \] Where \(\alpha_k\) is a step size chosen according to some schedule (e.g. \(\alpha_k = a/\sqrt{k}\) for fixed \(a\)), and \(g^{(k)}\) is any subgradient. Subgradient descent is guaranteed to converge to the optimal value for any convex objective, though it might be extremely slow.

In Python:

X = ... N * F matrix ...
y = ... N vector ...
alpha = 0.1
gamma = (2 * N) * alpha

def evaluate_objective(betas):
    errors = y[None, :] - np.dot(betas, X.T)
    sse = (errors**2).sum(axis=-1) + gamma*np.abs(betas).sum(axis=-1)
    return sse

def subgradient_descent():
    betas = np.empty((2000, F))
    betas[0, :] = 0
    
    a = 1e-4
    XTX = X.T @ X
    XTy = np.dot(X.T, y)
    
    eps = np.finfo(betas.dtype).eps
    
    for t in range(1, len(betas)):
        penalty_subg = np.where(
            np.abs(betas[t-1, :]) <= eps,
            np.random.uniform(-1, 1, size=F)/2,
            np.sign(betas[t-1, :])
        )
        
        subg = np.dot(XTX, betas[t-1, :]) - XTy + gamma * penalty_subg
        alpha = a/np.sqrt(t)
        betas[t] = betas[t-1] - alpha*subg

    sse = evaluate_objective(betas)
    return betas[np.argmin(sse)]

Coordinate descent

Coordinate descent iterates over each dimension to be optimized and minimizes the objective with respect to that single variable holding all others fixed. This method is popular with practitioners due to it’s simplicity, though it is of little interest to theorists due to slow convergence.

Let’s optimize our objective, \(e(b)\) over input \(b_k\). Rewriting the objective:

\[ \begin{align} e(b) &= (y - Xb)^T (y - Xb) + \gamma \sum_i \|b_i\| \\ &= \sum_{j} (y_j - \sum_{i \neq k} X_{j,i} b_i - X_{j,k} b_k)^2 + \gamma \sum_{i \neq k} \|b_i\| + \gamma \|b_k\| \end{align} \]

We can minimize this under the assumption that \(b_k>0\):

\[ \begin{align} \frac{de(b)}{db_k} &= -\sum_{j} 2 X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i - X_{j,k} b_k) + \gamma \\ 0 &= \gamma - 2 \sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) + 2 \sum_{j} X_{j,k} X_{j,k} b_k \\ b_k &= \frac{\sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) - \frac{1}{2}\gamma}{\sum_{j} X_{j,k}^2} \end{align} \]

By a similar argument we can show that if \(b_k<0\), the minimizer is:

\[ \begin{align} b_k &= \frac{\sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) + \frac{1}{2}\gamma}{\sum_{j} X_{j,k}^2} \end{align} \]

Finally, if \(b_k=0\) we can use subgradients: 0 will be in the subgradient set of \(e(b)\) with respect to \(b\) if and only if:

\[ \begin{align} \sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) &\in [-\frac{\gamma}{2},\frac{\gamma}{2}] \end{align} \]

Putting it together, we can define the minimizing \(b_k\) by cases:

\[ b_k = \begin{cases} \frac{\sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) - \frac{1}{2}\gamma}{\sum_{j} X_{j,k}^2} & \sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) > \frac{1}{2}\gamma \\ \frac{\sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) + \frac{1}{2}\gamma}{\sum_{j} X_{j,k}^2} & \sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i) < -\frac{1}{2}\gamma \\ 0 & \text{otherwise} \end{cases} \]

This definition can be simplified by defining a “soft thresholding” function as follows (this will also be useful later on):

\[ \begin{align} \text{threshold}(x, \mu) &= \text{sign}(x)\text{max}(|x| - \mu, 0) \\ b_k &= \frac{\text{threshold}(\sum_{j} X_{j,k} (y_j - \sum_{i \neq k} X_{j,i} b_i), \frac{1}{2}\gamma)}{\sum_{j} X_{j,k}^2} \end{align} \]

As a Python program:

def soft_threshold(beta, thold):
    return np.sign(beta)*np.fmax(0, np.abs(beta)-thold)

def coordinate_descent():
    betas = np.empty((20, F))
    betas[0, :] = 0
    
    for r in range(1, len(betas)):
        beta = betas[r-1, :].copy()
        for k in range(F):
            projection = np.dot(
                X[:, k],
                y - np.dot(X[:, :k], beta[:k]) - np.dot(X[:, k+1:], beta[k+1:])
            )
            threshold = 0.5*gamma_01
            scale = np.dot(X[:, k], X[:, k])
            beta[k] = soft_threshold(projection, threshold)/scale
        
        betas[r] = beta
    
    sse = evaluate_objective(betas)
    return betas[np.argmin(sse)]

Proximal gradient descent

To explain proxmial gradient descent, let’s first refresh our memory about standard gradient descent. For a smooth function \(f(k)\), we find the minimum of the function by starting with an initial guess \(x^{(0)}\) and then iteratively updating by subtracting \(t\) times the gradient: \[ x^{(k+1)} = x^{(k)} - t\nabla f(x) \]

Now notice that each of these update steps is equivalent to minimizing kind of a second-order Taylor expansion of \(f\) around \(x^{(k)}\) i.e.:

\[ f(x^+) \approx f(x^{(k)}) + (x^+ - x^{(k)})^T \nabla f(x^{(k)})) + (x^+ - x^{(k)})^T \nabla^2 f(x^{(k)})) (x^+ - x^{(k)}) \]

We approximate further by making the simplifying assumption that \(\nabla^2 f(x) = \frac{1}{2t}I\) then: \[ df/dx^+ \approx \nabla f(x^{(k)})) - \frac{1}{t} x^{(k)} + \frac{1}{t} x^+ \]

And the minimum \(\text{argmin}_{x^+} f(x^+)\) is achieved when this gradient is zero i.e.: \[ x^+ = x^{(k)} - t \nabla f(x^{(k)})) \]

With the proximal gradient method, we begin by rewriting \(e(b)\) as the sum of two parts \(e(b) = f(b) + g(b)\) where \(f\) is differentiable. So for our example L1-penalized objective:

\[ \begin{align} f(b) &= (y - Xb)^T (y - Xb) \\ g(b) &= \gamma \sum_i \|b_i\| \end{align} \]

We now proceed as with gradient descent, but perform the simplified Taylor expansion on the smooth (\(f\)) part only. At each gradient descent step we therefore want to solve the problem:

\begin{align} \text{argmin}_{x^+} e(x^+) &= \text{argmin}_{x^+} h(x^+) + g(x^+) + (x^+-x^{(k)})^T \nabla g(x) + \frac{1}{2t}(x^+-x^{(k)})^T (x^+-x^{(k)}) \\ &= \text{argmin}_{x^+} \frac{1}{2t}\|x^+ - (x^{(k)} - t \nabla f(x^{(k)}))\|_2^2 + g(x^+) \\ &= \text{prox}_{tg} (x^{(k)} - t \nabla f(x^{(k)})) \end{align}

Where \(\text{prox}_{tg}\) is just a compact way of denoting this minimization: this is called the “proxmial operator” of \(g\) with scale parameter \(t\).

Often it’s much easy to analytically solve the minimization being done by the proximal operator, while it might be very hard or impossible to solve the original \(e\) minimization. For our L1-penalized case, we want to be able to solve problems like this:

\begin{align} \text{prox}_{tg}(x) &= \text{argmin}_{z} \frac{1}{2t}\|z - x\|_2^2 + \gamma \sum_i \|z_i\| \end{align}

In our case the minimization in the proximal operator is much easier to solve than the original \(e\) minimization. The proximal operator does have a closed form solution that can be readily derived via subgradients, but trying to use subgradients to minimize the original \(e\) fails. The proximal gradient method has managed to turn a hard problem into an easy one!

The solution to \(\text{prox}_{tg}(x)\) is just the soft-thresholding function again:

\begin{align} [\text{prox}_{tg}(x)]_i &= \text{threshold}(x_i, t\gamma) \end{align}

We can put this method together as a Python program as follows:

def evaluate_sse(beta):
    resid = y - np.dot(X, beta)
    return np.dot(resid, resid)

def proximal_gradient_descent():
    betas = np.empty((20, F))
    betas[0] = 0
    
    XTX = X.T @ X
    XTy = np.dot(X.T, y)
    
    for k in range(1, len(betas)):
        g = np.dot(XTX, betas[k-1]) - XTy
        
        # Line search: iterate to find a step size `t` which results
        # in us reducing the value of the objective function
        t = 1
        while True:
            betas[k] = soft_threshold(betas[k-1] - t * g, t*gamma)
            
            # G is the "generalized gradient" of the objective,
            # so our update looks like a "gradient descent step":
            #   betas[k] = betas[k-1] - tG
            G = (betas[k-1] - betas[k])/t
            e = evaluate_sse(betas[k])
            last_e = evaluate_sse(betas[k-1])
            max_e = last_e - t * np.dot(g, G) + (t/2)*np.dot(G, G)
            if e <= max_e:
                break
            
            t *= 0.8
    
    sse = evaluate_objective(betas)
    return betas[np.argmin(sse)]

Accelerated proximal gradient descent

A popular variant of proximal gradient descent is to use some “Nestorov momentum”:

\begin{align} v &= x^{(k-1)} + \frac{k-2}{k+1}(x^{(k-1)} - x^{(k-2)}) \\ x^{(k)} &= \text{prox}_{tg} (v - t \nabla g(v)) \end{align}

Unlike proximal gradient descent, this is no longer a descent method: some iterations might increase the value of the objective! Nonetheless, the method can be proven to converge so long as the smooth \(g\) function is not too large: technically \(\nabla g\) should be Lipschitz-continuous.

In Python:

def accelerated_proximal_gradient_descent():
    betas = np.empty((20, F))
    betas[0, :] = 0
    
    XTX = X.T @ X
    XTy = np.dot(X.T, y)
    
    for k in range(1, len(betas)):
        g = np.dot(XTX, betas[k-1, :]) - XTy
        
        # Iterate to find a step size `t` which doesn't move us too far
        # from the objective
        t = 1
        while True:
            v = betas[k-1] + ((k-2)/(k+1))*(betas[k-1] - betas[k-2])
            betas[k] = soft_threshold(v - t * g, t*gamma)
            
            G = (betas[k-1, :] - betas[k, :])/t
            e = evaluate_sse(betas[k, :])
            last_e = evaluate_sse(betas[k-1, :])
            max_e = last_e + np.dot(g, betas[k] - v) + (1/(2*t))*((betas[k] - v)**2).sum()
            if e <= max_e:
                break
            
            t *= 0.8
    
    sse = evaluate_objective(betas)
    return betas[np.argmin(sse)]

Alternating direction method of multipliers (ADMM)

ADMM is a method of solving minimization problems of the form \(f(x) + g(z)\) subject to linear constraints \(Ax+Bz=c\). The augmented Lagrangian for this problem is just the standard Lagrangian plus a quadratic penalty for the constraint violation:

\begin{align} L_\rho(x, z, \lambda) &= f(x) + g(z) + \lambda^T(Ax+Bz-c) + (\rho/2) \|Ax+Bz-c\|_2^2 \end{align}

This quadratic penalty will prove to be key, because just like in the proximal gradient descent case it will help ensure that the Lagrangian has a sensible gradient.

To minimize using ADMM, we keep estimates for \(x\), \(z\) and \(\lambda\) which we iteratively update as follows:

\begin{align} x^{(k+1)} &= \text{argmin}_x L_\rho(x, z^{(k)}, \lambda^{(k)}) \\ z^{(k+1)} &= \text{argmin}_z L_\rho(x^{(k+1)}, z, \lambda^{(k)}) \\ \lambda^{(k+1)} &= \lambda^{(k)} + \rho(Ax^{(k+1)} + Bz^{(k+1)} - c) \end{align}

Under weak conditions (mostly just convexity of \(f\) and \(g\)), it can be shown that this procedure converges on the constrained global minimum!

How can we solve the L1 regression problem in this framework? It’s not at first obvious, because in the L1 problem there are no constraints at all, but ADMM seems to be useful only for optimization with linear constraints. The small trick we need is to instantiate the constraint \(Ax+Bz=c\) to enforce that \(x=z\) i.e. \(A=I\), \(B=-I\) and \(c=0\). If \(f\) and \(g\) are the residual error and regularization functions – as they were in the proximal gradient case – then we have:

\begin{align} L_\rho(x, z, \lambda) &= (y - Xx)^T (y - Xx) + \gamma \sum_i \|z_i\| + \lambda^T(x-z) + (\rho/2) \|x-z\|_2^2 \end{align}

Minimizing \(L_\rho(x, z, \lambda)\) with respect to \(x\) is straightforward:

\begin{align} 0 &= 2 X^T X x - 2 X^T y + \lambda + \rho(x - z) \\ 2 X^T X x + \rho x &= 2 X^T y - \lambda + \rho z \\ x &= (2 X^T X + \rho)^{-1}(2 X^T y - \lambda + \rho z) \end{align}

The problem of minimizing \(L_\rho(x, z, y)\) with respect to \(z\) is tricky, but you can show it is equivalent to solving the proximal operator for \(g\), which is helpful because we already know how to evaluate that:

\begin{align} & \text{argmin}_z L_\rho(x, z, \lambda) \\ =& \text{argmin}_z \gamma \sum_i \|z_i\| - \lambda^T z + (\rho/2) (z^T z - 2 x^T z) \\ =& \text{argmin}_z \gamma \sum_i \|z_i\| + \rho/2(z^T z - 2 z^T (x + \lambda/\rho) + (x - \lambda/\rho)^T (x - \lambda/\rho)) \\ =& \text{argmin}_z \gamma \sum_i \|z_i\| + \rho/2 \|z - (x + (1/\rho)\lambda)\|^2_2 \\ =& \text{prox}_{\frac{1}{\rho}g} (x + (1/\rho)\lambda) \end{align}

Putting this together as a Python function:

def admm():
    betas = np.empty((50, F))
    betas[0, :] = 0
    
    XTX = X.T @ X
    XTy = np.dot(X.T, y)
    
    # Works well on my examples, but algorithm can be sensitive to choice of rho: 
    # Section 3.4.1 of https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
    # has some ideas here
    rho = 100
    
    def argmin_x(z, lamb):
        return np.linalg.inv(2*XTX + rho) @ (2*XTy - lamb + rho*z)
    def argmin_z(x, lamb):
        t = 1/rho
        return soft_threshold(x + t*lamb, t*gamma)
    
    x, z, lamb = np.zeros(F), np.zeros(F), np.zeros(F)
    for k in range(len(betas)):
        x = argmin_x(z, lamb)
        z = argmin_z(x, lamb)
        lamb += rho*(x-z)
        betas[k] = z
    
    sse = evaluate_objective(betas)
    return betas[np.argmin(sse)]

LARS

LARS, or “least angle regression” is an algorithm that can be used to efficiently solve the L1 regression problem for a whole range of \(\lambda\) values. I defer to Wikipedia and the very readable paper to describe this approach. There is less value in understanding the details of LARS than of ADMM and the other descent methods, because the descent methods can be applied to many problems, not just the L1 problem.

Conclusion

If you’re trying to use L1 regression in practice, just throw Scikit-Learn at the problem :-). But personally I’m not completely satisfied by just using a black box: it’s fascinating to know something about what is going on under the hood to solve this tricky optimization problem.