Any optimal solution w∗ can be expressed as:
w∗=j=1∑Nα(j)x(j)
for some coefficients α(i),…,α(N)∈R
considering the regularized empirical risk minimization problem:
w∈RdminJ(w)=i=1∑NL(y(i),w⊺x(i))+λ∣∣w∣∣2
where
- L: a loss function
- {x(i),y(i)}i=1N: the training examples
- λ>0: a regularization parameter
- w∈Rd: the weight vector
Proof
1. Decompose w
Any vector w∈Rd can be uniquely decomposed into two orthogonal components:
w=w∣∣+w⊥
where
- w∣∣∈span{x(i),…,x(N)} (parallel component)
- w⊥⊥span{x(i),…,x(N)} (parallel component)
2. Rewrite w∣∣
w∣∣=j=1∑Nα(j)x(j)
for some coefficient α(1),…,α(N)
3. Analyze predictions on training data
w⊺x(i)=(w∣∣+w⊥)⊺x(i)
Given that w⊥⊥x(i)
w⊺x(i)=w∣∣⊺x(i)+w⊥⊺x(i)=w∣∣⊺x(i)
4. Compare ERM functions
Given that w∣∣⊥w⊥
∣∣w∣∣2=∣∣w∣∣+w⊥∣∣2=∣∣w∣∣∣∣2+∣∣w⊥∣∣2
Then, plug them back to risk minimization functions
J(w)=i=1∑NL(y(i),w∣∣⊺x(i))+λ∣∣w∣∣∣∣2+λ∣∣w⊥∣∣2J(w∣∣)=i=1∑NL(y(i),w∣∣⊺x(i))+λ∣∣w∣∣∣∣2∵λ∣∣w⊥∣∣2≥0 and J(w)=J(w∣∣)+λ∣∣w⊥∣∣2∴J(w)≥J(w∣∣)⟹w∗=j=1∑Nα(j)x(j)