5. SVM

Linear Classifiers

Caution

Sometimes, we may have multiple decision boundaries.

Screenshot 2025-10-14 at 20.54.46

Maximizing the margin

clipboard.png

  • γ:\gamma: The distance between a data point to the decision boundary

Decision Rule

Hyperplane

H={x:wx+b=0}H = \{x: w^\intercal x + b = 0\}

w

  • a vector perpendicular to the median line

Screenshot 2025-10-14 at 21.26.04

u

  • an unknown example

Screenshot 2025-10-14 at 21.26.36

wubw^\intercal u \ge -b

Screenshot 2025-10-14 at 21.55.52

  • Project uu onto ww
{positive(wu+b0)negative(otherwise)\begin{cases} positive \quad (w^\intercal u + b \ge 0) \\ negative \quad (otherwise) \end{cases}

How to calculate w and b?

Add more constraints

Screenshot 2025-10-14 at 23.52.01

wx++bawx+baLet y be an activation functiony={+1(for positive sample)1(for negative sample)y+(wx++b)ay(wx+b)a\begin{align*} &w^\intercal x^+ + b \ge a \\ &w^\intercal x^- + b \le -a \\ &\text{Let y be an activation function}\\ &y = \begin{cases} +1 \quad \text{(for positive sample)}\\ -1 \quad \text{(for negative sample)} \end{cases} \\ &y^+(w^\intercal x^+ + b) \ge a \\ &y^-(w^\intercal x^- + b) \ge a \\ \end{align*}     y(i)(wx(i)+b)ay(i)(wx(i)+b)=a( for x(i) in gutter)\begin{align*} \implies &y^{(i)}(w^\intercal x^{(i)} + b) \ge a \\ &y^{(i)}(w^\intercal x^{(i)} + b) = a \quad (\text{ for } x^{(i)} \text{ in gutter}) \end{align*}

Width of the Street

Screenshot 2025-10-15 at 12.29.56

  • Given that y(i)(wx(i)+b)=ay^{(i)}(w^\intercal x^{(i)} + b) = a for xx in gutter
w=ww2(x+x)(x+ and x are both in the gutter)wx+=ab,wx=ab=2aw2\begin{align*} w &= \frac{w^\intercal}{||w||_2} (x^+ - x^-) \\ &\because (x^+ \text{ and } x^- \text{ are both in the gutter}) \\ &\therefore w^\intercal x^+ = a - b, w^\intercal x^- = -a-b\\ &= \frac{2a}{||w||_2} \quad \\ \end{align*} maxa,b2aw2    maxw,b2aw2    maxw,baw2    minw,bw2as.t. y(i)(wx(i)+b)a\begin{align*} &\max_{a, b} \frac{2a}{||w||_2} \\ &\implies \max_{w, b} \frac{2a}{||w||_2} \\ &\implies \max_{w, b} \frac{a}{||w||_2} \\ &\implies \min_{w, b} \frac{||w||_2}{a} \quad s.t.\ y^{(i)}(w^\intercal x^{(i)} + b) \ge a\\ \end{align*}
  • Since aa is arbitrary, can normalize equations by aa
minw,b12w22, s.t.y(i)(wx(i)+b)1\min_{w, b} \frac{1}{2} ||w||^2_2,\ s.t. y^{(i)}(w^\intercal x^{(i)} + b) \ge1

Support Vectors

clipboard.png

  • lies in:
{wx+b=1wx+b=1\begin{cases} w^\intercal x + b = 1 \\ w^\intercal x + b = -1 \end{cases}
  • Under the KKT condition, only a few αi\alpha_i can be non-zero
αigi(w)=0,i=1,,N\alpha_i^* g_i(w^*) = 0, i = 1, \dots, N

where

  • gi(w)=1y(i)(wx(i)+b)g_i(w^*) = 1 - y^{(i)}(w^\intercal x^{(i)} + b)

Screenshot 2025-10-16 at 13.49.50

  • Training data are called support vectors if their αi\alpha_i are non-zero
  • \because
  • only a few αi\alpha_i can be non-zero
  • y(i)(wx(i)+b)=1y^{(i)}(w^\intercal x^{(i)} + b) = 1 when αi\alpha_i is non-zero
  • \therefore
  • b=y(t)wx(t)(αt>0)b = y^{(t)} - w^\intercal x^{(t)} \quad (\alpha_t > 0)

Hard Margin SVM

  • Assume that data are linearly separable

Screenshot 2025-10-15 at 13.52.18

arg minw,b12w22s.t. y(i)(wx(i)+b)1\argmin_{w, b} \frac{1}{2} ||w||^2_2 \\ s.t.\ y^{(i)}(w^\intercal x^{(i)} + b) \ge 1
  • This is a convex quadratic programming problem with linear constraints

Tip

What is Quadratic Programming?

  • A type of mathematical optimization problem
  • Similar to linear programming but minimizes a quadratic function

Soft Margin SVM

  • In real-world data, there may be:
  • Overlapping classes
  • Noisy points
  • Outliers
  • Assume that data are non-linearly separable
  • Introduce a trade-off parameter between error and margin and slack variables to measure how much a point violates the margin constraint

clipboard.png

arg minw,b,ξ12w22+Ci=1Nξis.t. y(i)(wx(i)+b)1ξi(ξi0)\begin{align*} &\argmin_{w, b, \xi} \frac{1}{2} ||w||^2_2 + C \sum^N_{i = 1} \xi_i\\ &s.t.\ y^{(i)}(w^\intercal x^{(i)} + b) \ge 1 - \xi_i \quad (\xi_i \ge 0) \end{align*}

C

clipboard.png

  • regularization parameter
  • tolerance controller for errors
  • controlling the trade-off between large margin (generalization) and few margin violations (accuracy on training data)
  • CC \rightarrow \infty
  • Hard Margin SVM
  • Small C:
  • The optimizer allows larger ξi\xi_i tolerating misclassification to get a smoother margin.
  • Larger C:
  • The optimizer will do everything it can to make ξi=0\xi_i = 0, even if it means a smaller margin

ξi\xi_i

  • slack variables for each data point
  • pay linear penalty if mistake
{1y(i)(wx(i)+b)(y(i)(wx(i))+b<1)0(y(i)(wx(i))1)    max(0,1y(i)(wx(i)+b))\begin{align*} &\begin{cases} 1 - y^{(i)}(w^\intercal x^{(i)} + b) \quad (y^{(i)}(w^\intercal x^{(i)}) + b < 1) \\ 0 \quad (y^{(i)}(w^\intercal x^{(i)}) \ge 1) \end{cases} \\ &\implies max(0, 1 - y^{(i)}(w^\intercal x^{(i)} + b)) \end{align*}
  • Results for correction
{ξi=0(correct, outside/on the margin)0<ξi<1(correct, inside the margin)1>ξi(misclassified, wrong side of the decision boundary)\begin{cases} \xi_i = 0 \quad \text{(correct, outside/on the margin)} \\ 0 < \xi_i < 1 \quad \text{(correct, inside the margin)} \\ 1 > \xi_i \quad \text{(misclassified, wrong side of the decision boundary)} \\ \end{cases}

New Optimization Problem

  • Make the margin as large as possible
  • Keep the total slack small

Margin Support Vector

  • ξi=0,y(i)(wx+b)=1\xi_i = 0, y^{(i)} (w^\intercal x + b) = 1
  • don’t contribute to objective but enforce constraints on solution
  • correctly classified

Digression to Lagrange Duality

Primal Problem vs. Dual Problem

\bullet The Primal Problem

  • The SVM formulation expressed directly in terms of weights (ww), bias (bb), slack variables (ξi\xi_i)
  • It’s primal because it directly optimizes the parameters of the model.

Caution

It’s difficult to compute when the feature space is large as you’re optimizing over wRdw \in \R^d

\bullet The Dual Problem

  • Use Lagrange multipliers αi\alpha_i to handle the constraints
  • Depends only on dot products between data points - allowing the kernel trick.
  • αi\alpha_i correspond to the importance (weights) of each training point

From Primal to Dual

1. The original (Soft-Margin) SVM Primal Problem

minθf(θ)s.t. gi(θ)0,hi(θ)=0\min_{\theta} f(\theta) \quad s.t. \ g_i(\theta) \le 0, h_i(\theta) = 0

where

  • θ=(w,b,ξ)\theta = (w, b, \xi)
  • f(θ)=12w2+CiNξif(\theta) = \frac{1}{2}||w||^2 + C \sum_i^N \xi_i
  • Constraints
  • gi(θ)0i=1,,kg_i(\theta) \le 0 \quad i = 1, \dots, k
  • hi(θ)=0i=1,,lh_i(\theta) = 0 \quad i = 1, \dots, l

2. Generalized Lagrangian:

L(θ,α,β)=f(θ)+i=1kαigi(θ)+i=1lβihi(θ)L(\theta, \alpha, \beta) = f(\theta) + \sum^k_{i = 1} \alpha_i g_i (\theta) + \sum^{l}_{i = 1} \beta_i h_i(\theta)

3. Derive the Re-written Primal

  • Given α(αi0),β\alpha(\alpha_i \ge 0), \beta: Lagrang multipliers.
maxα,β:αi0L(θ,α,β)={f(θ)(if θ satisfies proimal constraints)(otherwise)    minθmaxα,β:αi0L(θ,α,β)(Re-written primal)\begin{align*} &\max_{\alpha, \beta: \alpha_i \ge 0} L(\theta, \alpha, \beta) = \begin{cases} f(\theta) \quad (\text{if }\theta \text{ satisfies proimal constraints}) \\ \infty \quad (otherwise) \end{cases} \\ &\implies \min_\theta \max_{\alpha, \beta: \alpha_i \ge 0} L(\theta, \alpha, \beta) \quad \text{(Re-written primal)} \end{align*}

Lagrange Duality

\bullet A Re-written Primal:

minθmaxα,β:α>0L(θ,α,β)\min_{\theta} \max_{\alpha, \beta: \alpha > 0} L(\theta, \alpha, \beta)

\bullet Dual Optimization Problem

  • The value of the dual problem is always the lower bound of the value of the primal problem
maxα,β:αi0minθL(θ,α,β)\max_{\alpha, \beta: \alpha_i \ge 0} \min_{\theta} L(\theta, \alpha, \beta)

\bullet Connection between Primal and Dual

Weak Duality

  • Always holds
minθmaxα,β:α0L(θ,α,β)maxα,β:αiminθL(θ,α,β)\min_{\theta} \max_{\alpha, \beta: \alpha \ge 0} L(\theta, \alpha, \beta) \ge \max_{\alpha, \beta: \alpha_i \ge} \min_{\theta} L(\theta, \alpha, \beta)
  • When you minimize the parameters first and then find the maximized lagrange multipliers, you can never do worse than the other way round

Strong Duality

minθmaxα,β:α0L(θ,α,β)=maxα,β:αiminθL(θ,α,β)\min_{\theta} \max_{\alpha, \beta: \alpha \ge 0} L(\theta, \alpha, \beta) = \max_{\alpha, \beta: \alpha_i \ge} \min_{\theta} L(\theta, \alpha, \beta)
  • Whenever strong duality holds, the Karush-Kuhn-Tucker (KKT) conditions are true for θ,α,β\theta^*, \alpha^*, \beta^*

Karush-Kuhn-Tucker (KKT)

Stationarity

θi=L(θ,α,β)=0\frac{\partial}{\partial \theta_i} = L(\theta^*, \alpha^*, \beta^*) = 0
  • gradient equilibrium
  • No more improvement condition

Primal Feasibility

gi(θ)0,hj(θ)g_i(\theta^*) \le 0, \quad h_j(\theta^*)
  • obey the original constraints

Dual Feasibility

αi0\alpha_i^* \ge 0
  • obey the dual constraints
  • No negative penalties

Complementary Slackness

αigi(θ)=0\alpha_i^* g_i(\theta^*) = 0
  • If a constraint is active gi(θ)=0g_i(\theta^*) = 0
  • αi\alpha_i^* may be positive
  • If a constraint is inactive gi(θ)<0g_i(\theta^*) < 0
  • αi\alpha^*_i must be zero
KKT ConditionMeaning in SVM
Stationarityw=iαiyixiiαiyi=0\cdot w = \sum_i \alpha_i y_i x_i \\ \cdot \sum_i \alpha_i y_i = 0
Primal feasibilityyi(wxi+b)1+ξi0ξi0\cdot y_i(w^\top x_i + b) - 1 + \xi_i \ge 0 \\ \cdot \xi_i \ge 0
Dual feasibilityαi0\alpha_i \ge 0
Complementary slacknessαiyi(wxi+b)1+ξi=0μiξi=0\cdot \alpha_iy_i(w^\top x_i + b) - 1 + \xi_i = 0 \\ \cdot \mu_i\xi_i = 0

How to Minimize without thinking about the constraints?

L(w,b,α)=12w22+i=1Nαi(1y(i)(wx(i)+b))L(w, b, \alpha) = \frac{1}{2} ||w||^2_2 + \sum^N_{i= 1} \alpha_i (1- y^{(i)}(w^\intercal x^{(i)} + b ))

1. Minimize L(w,b,α)L(w, b, \alpha) w.r.t. w,bw, b

L(w,b,α)w=0w+i=1Nαiy(i)x(i)=0w=i=1Nαiy(i)x(i)\begin{align*} &\frac{\partial L(w, b, \alpha)}{\partial w} = 0 \\ & w + \sum^N_{i = 1} - \alpha_i y^{(i)}x^{(i)} = 0 \\ & w = \sum^N_{i = 1} \alpha_i y^{(i)}x^{(i)} \end{align*} L(w,b,α)b=0i=1Nαiy(i)=0i=1Nαiy(i)=0\begin{align*} &\frac{\partial L(w, b, \alpha)}{\partial b} = 0 \\ & \sum^N_{i = 1} - \alpha_i y^{(i)} = 0 \\ & \sum^N_{i = 1} \alpha_i y^{(i)} = 0 \end{align*}

2. Plug them back to derive SVM Lagrangian

L(w,b,α)=12(i=1Nαy(i)x(i))(j=1Nαy(i)x(i))+i=1Nαi=1Nαiy(i)(j=1Nαjy(j)x(j))i=1Nαiy(i)b=12i=1Nj=1Nαiαjy(i)y(j)x(i)x(j)+i=1Nαi\begin{align*} &L(w, b, \alpha) \\ &= \frac{1}{2} \left(\sum^N_{i = 1} \alpha y^{(i)} x^{(i)} \right)^\intercal \left(\sum^N_{j = 1} \alpha y^{(i)} x^{(i)} \right) + \sum^N_{i = 1} \alpha - \sum^{N}_{i = 1} \alpha_i y^{(i)} \left(\sum^{N}_{j = 1}\alpha_j y^{(j)}x^{(j)}\right) - \sum^N_{i = 1} \alpha_i y^{(i)}b \\ &=-\frac{1}{2} \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y^{(i)}y^{(j)} x^{(i)\intercal}x^{(j)} + \sum^N_{i = 1} \alpha_i \end{align*}

3. Derive Hard Margin SVM Dual

maxαi=1Nαi12i=1Nj=1Nαiαjy(i)y(j)x(i)x(j)s.t.αi0(i=1,,N),i=1Nαjy(j)=0\max_{\alpha} \sum^N_{i = 1} \alpha_i - \frac{1}{2} \sum^N_{i = 1} \sum^N_{j = 1} \alpha_i \alpha_j y^{(i)}y^{(j)}x^{(i)\intercal}x^{(j)} \\ s.t.\quad \alpha_i \ge 0 \quad (i = 1, \dots, N), \quad \sum^N_{i = 1} \alpha_j y^{(j)} = 0
  • Dual problem is also QP. Its solution gives a global maximum of α\alpha^*
  • ww^* can be recovered
w=i=1Nαiy(i)x(i)w^* = \sum^{N}_{i = 1} \alpha_i^* y^{(i)}x^{(i)}
  • Dependence
{dependent on (x(i),y(i))(α>0)independent of other samples(αi=0)\begin{cases} \text{dependent on }(x^{(i)}, y^{(i)}) \quad (\alpha^* > 0) \\ \text{independent of other samples} \quad (\alpha_i^* = 0) \end{cases}

4. Derive Soft Margin SVM Dual

maxα0i=1Nαi12i=1Nj=1Nαiαjy(i)y(j)x(i)x(j)s.t.Cαi0(i=1,,N),i=1Nαjy(j)=0\max_{\alpha \ge 0} \sum^N_{i = 1} \alpha_i - \frac{1}{2} \sum^N_{i = 1} \sum^N_{j = 1} \alpha_i \alpha_j y^{(i)}y^{(j)}x^{(i)\intercal}x^{(j)} \\ s.t.\quad C \ge \alpha_i \ge 0 \quad (i = 1, \dots, N), \quad \sum^N_{i = 1} \alpha_j y^{(j)} = 0
  • It’s pretty much the same but we introduce an upper bound CC for the αi\alpha_i term

Results: Primal ↔ Dual (Hard vs. Soft) + Prediction

Summary Table

VariantObjective FunctionConstraintsSolution / PredictionKKT Interpretation
Hard-Margin (Primal)minw,b12w2\min_{w,b} \tfrac{1}{2}\|w\|^2yi(wxi+b)1y_i(w^\top x_i + b) \ge 1Margin =2w= \tfrac{2}{\|w\|}
Hard-Margin (Dual)maxα0iαi12i,jαiαjyiyjxixj\max_{\alpha \ge 0} \sum_i \alpha_i - \tfrac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\, x_i^\top x_jiαiyi=0\sum_i \alpha_i y_i = 0w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_iαi>0yi(wxi+b)=1\alpha_i > 0 \Rightarrow y_i(w^\top x_i + b) = 1
Soft-Margin (Primal)minw,b,ξ12w2+Ciξi\min_{w,b,\xi} \tfrac{1}{2}\|w\|^2 + C\sum_i \xi_iyi(wxi+b)1ξi, ξi0y_i(w^\top x_i + b) \ge 1 - \xi_i,\ \xi_i \ge 0Equivalent to minimizing hinge loss 12w2+Cimax(0,1yif(xi))\tfrac{1}{2}\|w\|^2 + C\sum_i \max(0,1-y_i f(x_i))
Soft-Margin (Dual)max0αiCiαi12i,jαiαjyiyjxixj\max_{0 \le \alpha_i \le C} \sum_i \alpha_i - \tfrac{1}{2}\sum_{i,j}\alpha_i\alpha_j y_i y_j\, x_i^\top x_jiαiyi=0\sum_i \alpha_i y_i = 0w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_iαi=0yif(xi)>1\alpha_i = 0 \Rightarrow y_i f(x_i) > 1
0<αi<Cyif(xi)=10<\alpha_i<C \Rightarrow y_i f(x_i) = 1
αi=Cyif(xi)<1\alpha_i=C \Rightarrow y_i f(x_i) < 1
Kernelized Dual (Both)Replace xixjx_i^\top x_j with K(xi,xj)K(x_i, x_j)Same constraintsf(x)=sign ⁣(iαiyiK(xi,x)+b)f(x) = \text{sign}\!\big(\sum_i \alpha_i^* y_i\, K(x_i, x) + b^*\big)Only support vectors (αi>0\alpha_i^* > 0) contribute

Recovering ww and bb

  • Weight vector (linear case):
w=iαiyixiw^* = \sum_i \alpha_i^* y_i x_i
  • Bias term: For any support vector xtx_t with 0<αt<C0 < \alpha_t^* < C:
b=ytjαjyjK(xj,xt)b^* = y_t - \sum_j \alpha_j^* y_j K(x_j, x_t)

If multiple margin SVs exist, take the average bb^*.


KKT Conditions (Soft-Margin)

ConditionMathematical FormMeaning in SVM
Stationarityw=iαiyixiw = \sum_i \alpha_i y_i x_i, iαiyi=0\sum_i \alpha_i y_i = 0, Cαiμi=0C - \alpha_i - \mu_i = 0Gradient equilibrium
Primal Feasibilityyi(wxi+b)1ξiy_i(w^\top x_i + b) \ge 1 - \xi_i, ξi0\xi_i \ge 0Must satisfy primal constraints
Dual Feasibilityαi0\alpha_i \ge 0, μi0\mu_i \ge 0Non-negative multipliers
Complementary Slacknessαi[1ξiyi(wxi+b)]=0\alpha_i [1 - \xi_i - y_i(w^\top x_i + b)] = 0, μiξi=0\mu_i \xi_i = 0Active constraints ↔ positive multipliers

Intepretation

Conditionyif(xi)y_i f(x_i)ClassificationMargin StatusSlack Variable ξi\xi_iInterpretation
Outside margin>1> 1✅ CorrectOutside marginξi=0\xi_i = 0Confidently correct (no penalty)
On margin=1= 1✅ CorrectOn boundaryξi=0\xi_i = 0Support vector (defines margin)
Inside margin (correct side)0<yif(xi)<10 < y_i f(x_i) < 1✅ CorrectInside margin0<ξi<10 < \xi_i < 1Correct but too close to boundary (penalized)
Misclassified (wrong side)yif(xi)<0y_i f(x_i) < 0❌ IncorrectInside margin (wrong side)ξi>1\xi_i > 1Misclassified (heavily penalized)

Final Decision Function

  • Linear Case:
f(x)=sign ⁣(iαiyixix+b)f(x) = \text{sign}\!\left(\sum_i \alpha_i^* y_i\, x_i^\top x + b^*\right)
  • Kernelized Case:
f(x)=sign ⁣(iαiyiK(xi,x)+b)f(x) = \text{sign}\!\left(\sum_i \alpha_i^* y_i\, K(x_i, x) + b^*\right)

Margin and Limit Behavior

  • Margin width: 2w\frac{2}{\|w^*\|}
  • As CC \to \infty: Soft-margin SVM → Hard-margin SVM (if data are separable)
  • As C0C \to 0: Margin dominates, high tolerance for error (underfitting)

Intuitive Summary

  • Hard-Margin: No misclassification; all points must be correctly separated.
  • Soft-Margin: Allows some violations; balances accuracy and generalization via CC.
  • Support Vectors: Points with αi>0\alpha_i > 0 define the decision boundary.
  • Dual Form: Enables kernel trick; optimization depends only on dot products or K(xi,xj)K(x_i, x_j).

Implementation

Helper

def poly_implementation(x, y, degree):
 assert x.size() == y.size(), 'The dimensions of inputs do not match!'
 with torch.no_grad():
 return (1 + (x * y).sum()).pow(degree)

def poly(degree):
 return lambda x, y: poly_implementation(x, y, degree)

def rbf_implementation(x, y, sigma):
 assert x.size() == y.size(), 'The dimensions of inputs do not match!'
 with torch.no_grad():
 return (-(x - y).norm().pow(2) / 2 / sigma / sigma).exp()

def rbf(sigma):
 return lambda x, y: rbf_implementation(x, y, sigma)

def xor_data():
 x = torch.tensor([[1, 1], [-1, 1], [-1, -1], [1, -1]], dtype=torch.float)
 y = torch.tensor([1, -1, 1, -1], dtype=torch.float)
 return x, y

SVM

def svm_solver(
 x_train, y_train, lr, num_iters, kernel=hw3_utils.poly(degree=1), c=None
):
 """
 Computes an SVM given a training set, training labels, the number of
 iterations to perform projected gradient descent, a kernel, and a trade-off
 parameter for soft-margin SVM.

 Arguments:
 x_train: 2d tensor with shape (N, d).
 y_train: 1d tensor with shape (N,), whose elememnts are +1 or -1.
 lr: The learning rate.
 num_iters: The number of gradient descent steps.
 kernel: The kernel function.
 The default kernel function is 1 + <x, y>.
 c: The trade-off parameter in soft-margin SVM.
 The default value is None, referring to the basic, hard-margin SVM.

 Returns:
 alpha: a 1d tensor with shape (N,), denoting an optimal dual solution.
 Initialize alpha to be 0.
 Return alpha.detach() could possibly help you save some time
 when you try to use alpha in other places.
 """
 # TODO
 N = x_train.shape[0]
 alpha = torch.zeros(N, requires_grad=True)
 Y = torch.diag(y_train)

 # Build the kernel beforehand
 K = create_kernel(x_train, x_train, N, N, kernel)
 for _ in range(num_iters):
 # The matrix form of the original dual problem: sum(alpha) - alpha^T(YKY)alpha
 L = alpha.sum() - 0.5 * alpha @ (Y @ K @ Y) @ alpha
 L.backward()

 # update alpha
 with torch.no_grad():
 alpha += lr * alpha.grad

 # clamp the range
 if c:
 alpha = torch.clamp_(
 alpha,
 min=0,
 max=c,
 )
 else:
 alpha = torch.clamp_(
 alpha,
 min=0,
 )
 alpha.grad.zero_()
 alpha.requires_grad_()

 return alpha.detach()


def svm_predictor(alpha, x_train, y_train, x_test, kernel=hw3_utils.poly(degree=1)):
 """
 Returns the kernel SVM's predictions for x_test using the SVM trained on
 x_train, y_train with computed dual variables alpha.

 Arguments:
 alpha: 1d tensor with shape (N,), denoting an optimal dual solution.
 x_train: 2d tensor with shape (N, d), denoting the training set.
 y_train: 1d tensor with shape (N,), whose elements are +1 or -1.
 x_test: 2d tensor with shape (M, d), denoting the test set.
 kernel: The kernel function.
 The default kernel function is 1 + <x, y>.

 Return:
 A 1d tensor with shape (M,), the outputs of SVM on the test set.
 """
 N = x_train.shape[0]
 M = x_test.shape[0]

 K_train = create_kernel(x_train, x_train, N, N, kernel)
 K_test = create_kernel(x_train, x_test, N, M, kernel)

 support_idx = torch.where(alpha > _EPS)[0]
 if len(support_idx) == 0:
 return torch.zeros(M)

 min_alpha_idx = support_idx[torch.argmin(alpha[support_idx])]

 # y_support - \sum^{N}_{j = 1} \alpha_j * y_j * K(x_j, x_k)
 b = y_train[min_alpha_idx] - (alpha * y_train * K_train[:, min_alpha_idx]).sum()

 # Compute predictions: f(x) = \sum_{i=1}^{N} \alpha_i * y_i * K(x_i, x) + b
 f = torch.sum((alpha * y_train).unsqueeze(1) * K_test, dim=0) + b
 return torch.sign(f)