Any optimal solution ww^* can be expressed as:

w=j=1Nα(j)x(j)w^* = \sum^N_{j = 1} \alpha^{(j)}x^{(j)}

for some coefficients α(i),,α(N)R\alpha^{(i)}, \dots, \alpha^{(N)} \in \R

considering the regularized empirical risk minimization problem:

minwRdJ(w)=i=1NL(y(i),wx(i))+λw2\min_{w \in \R^d} \mathcal{J}(w) = \sum^N_{i = 1} L(y^{(i)}, w^\intercal x^{(i)}) + \lambda \mid\mid w\mid\mid^2

where

  • L:L: a loss function
  • {x(i),y(i)}i=1N:\{x^{(i)}, y^{(i)}\}_{i = 1}^{N}: the training examples
  • λ>0:\lambda > 0: a regularization parameter
  • wRd:w \in \R^d: the weight vector

Proof

1. Decompose w

Any vector wRdw \in \R^d can be uniquely decomposed into two orthogonal components:

w=w+ww = w_{||} + w_{\perp}

where

  • wspan{x(i),,x(N)}w_{||} \in span\{x^{(i)}, \dots, x^{(N)}\} (parallel component)
  • wspan{x(i),,x(N)}w_{\perp} \perp span\{x^{(i)}, \dots, x^{(N)}\} (parallel component)

2. Rewrite ww_{||}

w=j=1Nα(j)x(j)w_{||} = \sum^N_{j = 1} \alpha^{(j)}x^{(j)}

for some coefficient α(1),,α(N)\alpha^{(1)}, \dots, \alpha^{(N)}

Tip

Full proof can be found in Orthogonal Decomposition Theorem

3. Analyze predictions on training data

wx(i)=(w+w)x(i)w^\intercal x^{(i)} = (w_{||} + w_{\perp})^\intercal x^{(i)}

Given that wx(i)w_\perp\perp x^{(i)}

wx(i)=wx(i)+wx(i)=wx(i)\begin{align*} w^\intercal x^{(i)} &= w_{||}^\intercal x^{(i)} + w_\perp^\intercal x^{(i)} \\ &= w_{||}^\intercal x^{(i)} \end{align*}

4. Compare ERM functions

Given that www_{||} \perp w_{\perp}

w2=w+w2=w2+w2||w||^2 = ||w_{||} + w_\perp ||^2 = ||w_{||}||^2 + ||w_\perp||^2

Then, plug them back to risk minimization functions

J(w)=i=1NL(y(i),wx(i))+λw2+λw2J(w)=i=1NL(y(i),wx(i))+λw2λw20 and J(w)=J(w)+λw2J(w)J(w)    w=j=1Nα(j)x(j)\begin{align*} &\mathcal{J}(w) = \sum^N_{i = 1} L(y^{(i)}, w_{||}^\intercal x^{(i)}) + \lambda||w_{||}||^2 + \lambda||w_{\perp}||^2\\ \\ &\mathcal{J}(w_{||}) = \sum^N_{i = 1} L(y^{(i)}, w_{||}^\intercal x^{(i)}) + \lambda||w_{||}||^2 \\ \\ &\because \lambda ||w_{\perp}||^2 \ge 0 \text{ and } \mathcal{J}(w) = \mathcal{J}(w_{||}) + \lambda ||w_{\perp} ||^2\\ \\ &\therefore \mathcal{J}(w) \ge \mathcal{J}(w_{||}) \\ \\ &\implies w^* = \sum^N_{j = 1} \alpha^{(j)}x^{(j)} \end{align*}