Full SVM Derivation
1) Primal (soft-margin, L1 slack)
w,b,ξmin21∣w∣∗22+C∑∗i=1Nξi s.t.yi(w⊤xi+b)≥1−ξi,ξi≥0(i=1,…,N).
2) Lagrangian and KKT
Introduce multipliers αi≥0 for 1−ξi−yi(w⊤xi+b)≤0 and μi≥0 for −ξi≤0:
L(w,b,ξ,α,μ)=21∣w∣2+Ci∑ξi+i∑αi(1−ξi−yi(w⊤xi+b))−i∑μiξi.
Stationarity:
∂w∂L=0⇒w=i∑αiyixi,∂b∂L=0⇒i∑αiyi=0,
∂ξi∂L=0⇒C−αi−μi=0 ⇒ 0≤αi≤C.
Complementary slackness:
αi(1−ξi−yi(w⊤xi+b))=0,μiξi=0.
3) Dual (linear features)
Eliminate (w,b,ξ) using stationarity:
αmax i=1∑Nαi−21i,j=1∑Nαiαjyiyj,xi⊤xjs.t.i∑αiyi=0,0≤αi≤C.
Matrix form: let K_ij=xi⊤xj, Y=diag(y), α∈RN,
L(α)=1⊤α−21,α⊤(YKY)α,with i∑αiyi=0, 0≤αi≤C.
4) Kernel trick
Replace xi⊤xj by a kernel K(xi,xj)=ϕ(xi)⊤ϕ(xj). Define the Gram matrix
Kij=K(xi,xj).
Then the dual is identical with this (K):
L(α)=i∑αi−21,α⊤(YKY)α,i∑αiyi=0, 0≤αi≤C.
5) Dual gradient (for updates)
∂αi∂L=1−j=1∑NαjyiyjKij=1−yij=1∑NαjyjK(xi,xj).
Vector form:
∇αL=1−(YKY)α.
6) Projected gradient ascent on (\alpha)
Ascent step (step size η>0):
α~←α+η,∇αL.
Project onto the affine constraint ∑iαiyi=0:
α~←α~−y,y⊤yy⊤α~.
Project onto the box 0≤αi≤C:
αi←minmaxα~i,0,,C.
(For hard margin, use only αi≥0.)
7) Recover (w) and (b)
Linear kernel (explicit (w)):
w∗=i=1∑Nαi∗yixi.
For any support vector (k) with 0<αk∗<C, KKT gives
yk(w∗⊤xk+b∗)=1 ⇒b∗=yk−w∗⊤xk.
In practice, average over all such (k).
Kernel case (no explicit (w)):
b∗=yk−j=1∑Nαj∗yjK(xj,xk),average over k with 0<αk∗<C.
8) Decision function and prediction
Linear:
f(x)=w∗⊤x+b∗,y^=sign(f(x)).
Kernel:
f(x)=i=1∑Nαi∗yi,K(xi,x)+b∗,y^=sign(f(x)).
9) If your kernel is a pairwise function (no batching)
Build Gram matrices by pairs:
Kij=K(xi,xj),Kijtest=K(xi,xtest∗j),
then use the same formulas above for (L(α)),(∇∗αL),(b∗),and(f(x)).
That’s the full deduction, start to finish.