Problem 1: Support Vector Machines (25 pt)

1.1 Soft Margin with Hinge Loss

Given dataset:

  • x(1)=[00],y(1)=+1x^{(1)} = \begin{bmatrix}0\\0\end{bmatrix}, y^{(1)} = +1
  • x(2)=[20],y(2)=+1x^{(2)} = \begin{bmatrix}2\\0\end{bmatrix}, y^{(2)} = +1
  • x(3)=[11],y(3)=1x^{(3)} = \begin{bmatrix}1\\1\end{bmatrix}, y^{(3)} = -1

Functional margin: γi:=y(i)(wx(i)+b)\gamma_i := y^{(i)}(w^\top x^{(i)}+b) Hinge loss: ξi:=max{0,1γi}\xi_i := \max\{0, 1-\gamma_i\}

(i) With w=[10],b=0,C=1w = \begin{bmatrix}1\\0\end{bmatrix}, b = 0, C = 1

Computations:

  • f(x(1))=wx(1)+b=0f(x^{(1)}) = w^\intercal x^{(1)} + b = 0
  • f(x(2))=2f(x^{(2)}) = 2
  • f(x(3))=1f(x^{(3)}) = 1

Functional margins:

  • γ1=0\gamma_1 = 0
  • γ2=2\gamma_2 = 2
  • γ3=1\gamma_3 = -1

Slack variables:

  • ξ1=max(0,10)=1\xi_1 = \max(0, 1-0) = 1
  • ξ2=max(0,12)=0\xi_2 = \max(0, 1-2) = 0
  • ξ3=max(0,1(1))=2\xi_3 = \max(0, 1-(-1)) = 2

Objective value:

12w22+Ci=1Nξi=12(1)+1(1+0+2)=3.5\frac{1}{2}\|w\|^2_2 + C\sum^N_{i=1}\xi_i = \frac{1}{2}(1) + 1(1 + 0 + 2) = 3.5

(ii) With w=[11],b=0,C=1w = \begin{bmatrix}1\\-1\end{bmatrix}, b = 0, C = 1

Computations:

  • f(x(1))=0f(x^{(1)}) = 0
  • f(x(2))=2f(x^{(2)}) = 2
  • f(x(3))=0f(x^{(3)}) = 0

Functional margins:

  • γ1=0\gamma_1 = 0
  • γ2=2\gamma_2 = 2
  • γ3=0\gamma_3 = 0

Slack variables:

  • ξ1=1\xi_1 = 1
  • ξ2=0\xi_2 = 0
  • ξ3=1\xi_3 = 1

Objective value:

12w22+Ci=1Nξi=12(2)+1(1+0+1)=3.0\frac{1}{2}\|w\|^2_2 + C\sum^N_{i=1}\xi_i = \frac{1}{2}(2) + 1(1 + 0 + 1) = 3.0

Answer: The second parameter choice (w=[11]w = \begin{bmatrix}1\\-1\end{bmatrix}) has the smaller objective value (3.0 < 3.5).

(iii) Effect of C on Regularization Trade-off

With C=0.5C = 0.5:

  • Case (i): 12+0.5×3=2.0\frac{1}{2} + 0.5 \times 3 = 2.0
  • Case (ii): 1.0+0.5×2=2.01.0 + 0.5 \times 2 = 2.0

With C=2C = 2:

  • Case (i): 12+2×3=6.5\frac{1}{2} + 2 \times 3 = 6.5
  • Case (ii): 1.0+2×2=5.01.0 + 2 \times 2 = 5.0

Analysis:

Increasing CC leads to smaller ξi\xi_i (fewer violations) and a smaller margin, whereas decreasing CC tolerates larger ξi\xi_i for misclassifications, giving rise to a larger margin. When CC \to \infty, the SVM becomes hard-margin.

  • Higher C: Emphasizes minimizing training violations (forces ξi0\xi_i \to 0), resulting in smaller margins
  • Lower C: Allows more violations, prioritizes larger margins
  • Trade-off: Margin width vs. training error

1.2 Importance Weighted Soft Margin SVMs

(a) Primal Optimization

Answer:

minw,b,ξ12w22+Ci=1Np(i)ξis.t. y(i)(wx(i)+b)1ξi,ξi0,i=1,,N\begin{aligned} & \min_{w, b, \xi} \frac{1}{2} \|w\|^2_2 + C\sum^N_{i = 1} p^{(i)}\xi_i \\ & \text{s.t. } y^{(i)}(w^\intercal x^{(i)} + b) \ge 1 - \xi_i, \quad \xi_i \ge 0, \quad i = 1, \dots, N \end{aligned}

Each example’s slack penalty is scaled by its importance weight p(i)p^{(i)}.

(b) Dual Problem

Answer: The weight p(i)p^{(i)} adds an upper bound for the dual variables: 0αiCp(i)0 \le \alpha_i \le Cp^{(i)}

Introduce Lagrangian multipliers αi0\alpha_i \ge 0 and μi0\mu_i \ge 0:

L(w,b,ξ,α,μ)=12w22+Ci=1Np(i)ξi+i=1Nαi(1ξiy(i)(wx(i)+b))i=1NμiξiL(w, b, \xi, \alpha, \mu) = \frac{1}{2} \|w\|^2_2 + C\sum^{N}_{i = 1}p^{(i)} \xi_i + \sum^{N}_{i = 1}\alpha_i (1 - \xi_i - y^{(i)}(w^\intercal x^{(i)} + b)) - \sum^{N}_{i=1} \mu_i \xi_i

Taking partial derivatives and setting to zero:

Lw=0    w=i=1Nαiy(i)x(i)Lb=0    i=1Nαiy(i)=0Lξ=0    Cp(i)=αi+μi\begin{aligned} & \frac{\partial L}{\partial w} = 0 \implies w = \sum^{N}_{i = 1} \alpha_i y^{(i)}x^{(i)} \\ & \frac{\partial L}{\partial b} = 0 \implies \sum^{N}_{i = 1} \alpha_i y^{(i)} = 0 \\ & \frac{\partial L}{\partial \xi} = 0 \implies Cp^{(i)} = \alpha_i + \mu_i \end{aligned}

Since αi0\alpha_i \ge 0 and μi0\mu_i \ge 0, we have 0αiCp(i)0 \le \alpha_i \le Cp^{(i)}.

The dual problem becomes:

maxαi=1Nαi12i=1Nj=1Nαiαjy(i)y(j)x(i)x(j)s.t. i=1Nαiy(i)=0,0αiCp(i),i=1,,N\begin{aligned} & \max_{\alpha} \sum^N_{i = 1} \alpha_i - \frac{1}{2} \sum^N_{i=1}\sum^N_{j=1} \alpha_i \alpha_j y^{(i)}y^{(j)} x^{(i)\intercal}x^{(j)} \\ & \text{s.t. } \sum^{N}_{i = 1} \alpha_i y^{(i)} = 0, \quad 0 \le \alpha_i \le Cp^{(i)}, \quad i = 1, \dots, N \end{aligned}

(c) Feasible Sets with Given Weights

Given p(1)=1,p(2)=12,p(3)=0p^{(1)} = 1, p^{(2)} = \frac{1}{2}, p^{(3)} = 0 and C=2C = 2:

Answer:

  • 0α120 \le \alpha_1 \le 2
  • 0α210 \le \alpha_2 \le 1
  • α3=0\alpha_3 = 0

(d) L2 Soft-Margin SVM Dual

For the objective:

minw,b,ξ12w22+C2i=1Nξi2s.t. y(i)(wx(i)+b)1ξi,ξi0\min_{w, b, \xi}\frac{1}{2}\|w\|^2_2+\frac{C}{2}\sum_{i=1}^{N}\xi_i^2 \quad \text{s.t. } y^{(i)}(w^\intercal x^{(i)} + b) \ge 1-\xi_i, \xi_i \ge 0

Answer:

Following similar steps, the Lagrangian is:

L(w,b,ξ,α,μ)=12w22+C2i=1Nξi2+i=1Nαi(1ξiy(i)(wx(i)+b))i=1NμiξiL(w, b, \xi, \alpha, \mu) = \frac{1}{2} \|w\|^2_2 + \frac{C}{2}\sum^{N}_{i = 1}\xi_i^2 + \sum^{N}_{i = 1}\alpha_i (1 - \xi_i - y^{(i)}(w^\intercal x^{(i)} + b)) - \sum^{N}_{i=1} \mu_i \xi_i

Taking partial derivatives:

Lw=0    w=i=1Nαiy(i)x(i)Lb=0    i=1Nαiy(i)=0Lξ=0    ξi=αi+μiCp(i)\begin{aligned} & \frac{\partial L}{\partial w} = 0 \implies w = \sum^{N}_{i = 1} \alpha_i y^{(i)}x^{(i)} \\ & \frac{\partial L}{\partial b} = 0 \implies \sum^{N}_{i = 1} \alpha_i y^{(i)} = 0 \\ & \frac{\partial L}{\partial \xi} = 0 \implies \xi_i = \frac{\alpha_i + \mu_i}{Cp^{(i)}} \end{aligned}

Substituting back and maximizing over μi0\mu_i \ge 0 gives μi=0\mu_i^* = 0:

maxαi=1Nαi12i=1Nj=1Nαiαjy(i)y(j)x(i)x(j)12Ci=1Nαi2p(i)s.t.i=1Nαiy(i)=0,αi0,i=1,,N\begin{aligned} \max_{\alpha} \quad & \sum_{i=1}^N \alpha_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N \alpha_i \alpha_j y^{(i)} y^{(j)} x^{(i)\top} x^{(j)} - \frac{1}{2C}\sum_{i=1}^N \frac{\alpha_i^2}{p^{(i)}} \\ \text{s.t.} \quad & \sum_{i=1}^N \alpha_i y^{(i)} = 0, \quad \alpha_i \ge 0, \quad i = 1, \dots, N \end{aligned}

Problem 2: Implementing Support Vector Machine (25 pt)

2.1 Projection Operators

Prove (Π[0,)N[α])i=max{αi,0}(\Pi_{[0,\infty)^N}[\alpha])_i = \max\{\alpha_i, 0\}

For each coordinate ii:

αi=arg minαi0(αiαi)2\alpha_i' = \argmin_{\alpha_i' \ge 0} (\alpha_i' - \alpha_i)^2

The minimizer is:

αi={αiif αi00if αi<0=max{αi,0}\alpha_i' = \begin{cases} \alpha_i & \text{if } \alpha_i \ge 0 \\ 0 & \text{if } \alpha_i < 0 \end{cases} = \max\{\alpha_i, 0\}

Prove (Π[0,C]N[α])i=min{max{αi,0},C}(\Pi_{[0,C]^N}[\alpha])_i = \min\{\max\{\alpha_i, 0\}, C\}

For each coordinate ii:

αi=arg min0αiC(αiαi)2\alpha_i' = \argmin_{0 \le \alpha_i' \le C} (\alpha_i' - \alpha_i)^2

The minimizer is:

αi={Cif αiCαiif 0αiC0if αi<0=min{max{αi,0},C}\alpha_i' = \begin{cases} C & \text{if } \alpha_i \ge C \\ \alpha_i & \text{if } 0 \le \alpha_i \le C \\ 0 & \text{if } \alpha_i < 0 \end{cases} = \min\{\max\{\alpha_i, 0\}, C\}

Problem 3: Linear Regression and ERM (25 pt)

3.1 Robustness of Linear Regression

(a) Dataset Without Outlier

Given {(x(i),y(i))}i=15={(1,2),(2,3),(3,6),(4,7),(5,10)}\{(x^{(i)}, y^{(i)})\}^{5}_{i=1} = \{(1,2), (2,3), (3,6), (4,7), (5,10)\} with w0=1w_0 = 1.

Answer: w1=89551.618w_1 = \frac{89}{55} \approx 1.618

Minimizing (w)=i=1N(y(i)w1x(i)1)2\ell(w) = \sum^{N}_{i = 1}(y^{(i)} - w_1 x^{(i)} - 1)^2:

w1=2i=1N(y(i)w1x(i)1)x(i)=0\frac{\partial \ell}{\partial w_1} = -2 \sum^{N}_{i = 1}(y^{(i)} - w_1 x^{(i)} - 1)x^{(i)} = 0 w1=i=1N(y(i)1)x(i)i=1Nx(i)2=8955w_1 = \frac{\sum^N_{i = 1} (y^{(i)} - 1) x^{(i)}}{\sum^N_{i = 1} x^{(i)2}} = \frac{89}{55}

(b) Dataset With Outlier

Adding outlier (6,180)(6, 180):

Answer: w1=1,1639112.78w_1 = \frac{1,163}{91} \approx 12.78

The outlier significantly increases the slope, demonstrating that L2 loss is sensitive to outliers.

(c) L1 Loss With Outlier

Using (w)=i=1Ny(i)w1x(i)w01\ell(w) = \sum_{i=1}^N|y^{(i)} - w_1 x^{(i)} - w_0|_1:

Answer: The L1 loss is more robust to outliers as it doesn’t square the residuals. The optimal w1w_1 would be closer to the value without the outlier, showing L1’s robustness.

3.2 Lasso Regression

Given XTX=IX^TX = I, solve:

w=arg minwyXw22+λw1w^* = \argmin_w \|y - Xw\|_2^2 + \lambda\|w\|_1

(a) Show wiw^*_i Depends Only on Xi,y,λX_{\cdot i}, y, \lambda

Answer: wi=wi22(Xiy)wi+λwiw_i^* = w_i^2 - 2(X^\intercal_i y)w_i + \lambda |w_i|

Expanding the objective:

w=yy2yXw+wXXw+λw1w^* = y^\intercal y - 2y^\intercal Xw + w^\intercal X^\intercal X w + \lambda \|w\|_1

With XTX=IX^TX = I, this becomes:

w=i=1Nwi22(Xiy)wi+λwi+constw^* = \sum^N_{i = 1} w_i^2 - 2(X_i^\intercal y)w_i + \lambda |w_i| + \text{const}

Each wiw_i can be optimized independently.

(b) Case wi>0w^*_i > 0

Answer: wi=Xiyλ2w_i^* = X^\intercal_i y - \frac{\lambda}{2} where Xiy>λ2X_i^\intercal y > \frac{\lambda}{2}

For wi>0w_i > 0:

wi(wi22(Xiy)wi+λwi)=0\frac{\partial}{\partial w_i}(w_i^2 - 2(X^\intercal_i y)w_i + \lambda w_i) = 0 2wi2Xiy+λ=0    wi=Xiyλ22w_i^* - 2X^\intercal_i y + \lambda = 0 \implies w_i^* = X^\intercal_i y - \frac{\lambda}{2}

(c) Case wi<0w^*_i < 0

Answer: wi=Xiy+λ2w_i^* = X^\intercal_i y + \frac{\lambda}{2} where Xiy<λ2X_i^\intercal y < -\frac{\lambda}{2}

For wi<0w_i < 0:

2wi2Xiyλ=0    wi=Xiy+λ22w_i^* - 2X^\intercal_i y - \lambda = 0 \implies w_i^* = X^\intercal_i y + \frac{\lambda}{2}

(d) Condition for wi=0w^*_i = 0

Answer: wi=0    λ2Xiyλ2w_i^* = 0 \iff -\frac{\lambda}{2} \le X_i^\intercal y \le \frac{\lambda}{2}

From (b) and (c):

  • wi>0    Xiy>λ2w_i^* > 0 \implies X_i^\intercal y > \frac{\lambda}{2}
  • wi<0    Xiy<λ2w_i^* < 0 \implies X_i^\intercal y < -\frac{\lambda}{2}

Therefore: wi=0    λ2Xiyλ2w_i^* = 0 \iff -\frac{\lambda}{2} \le X_i^\intercal y \le \frac{\lambda}{2}

Interpretation: Features with small correlation to the target (within the threshold λ/2\lambda/2) are set to zero, performing automatic feature selection.

3.3 Ridge Regression

(a) Ridge for d=1d = 1

Answer: w=i=1Nx(i)y(i)i=1N(x(i))2+Nλw^* = \frac{\sum_{i=1}^{N}x^{(i)}y^{(i)}}{\sum_{i=1}^{N}(x^{(i)})^2 + N\lambda}, w0=0w_0^* = 0

With centered data (y(i)=0,x(i)=0\sum y^{(i)} = 0, \sum x^{(i)} = 0):

w0=2Ni=1N(y(i)wx(i)w0)=0    w0=0\frac{\partial}{\partial w_0} = -\frac{2}{N} \sum^N_{i = 1} (y^{(i)} - wx^{(i)} - w_0) = 0 \implies w_0^* = 0 w=2Ni=1Nx(i)(y(i)wx(i))+2λw=0\frac{\partial}{\partial w} = -\frac{2}{N} \sum_{i=1}^N x^{(i)}(y^{(i)} - wx^{(i)}) + 2\lambda w = 0 w=i=1Nx(i)y(i)i=1N(x(i))2+Nλw^* = \frac{\sum_{i=1}^{N}x^{(i)}y^{(i)}}{\sum_{i=1}^{N}(x^{(i)})^2 + N\lambda}

(b) Ridge for d>1d > 1

Answer: w=(1NXX+λI)11NXyw^* = \left(\frac{1}{N}X^\intercal X + \lambda I\right)^{-1}\frac{1}{N}X^\intercal y, w0=0w_0^* = 0

w0=2N1(yXww01)=0    w0=0\frac{\partial}{\partial w_0} = -\frac{2}{N}\mathbf{1}^\intercal(y - Xw - w_0\mathbf{1}) = 0 \implies w_0^* = 0 w=2NX(yXw)+2λw=0\frac{\partial}{\partial w} = -\frac{2}{N}X^\intercal(y - Xw) + 2\lambda w = 0 w=(1NXX+λI)11NXyw^* = \left(\frac{1}{N}X^\intercal X + \lambda I\right)^{-1}\frac{1}{N}X^\intercal y

Code Implementation

SVM Solver (hw3_q2.py)

def svm_solver(x_train, y_train, lr, num_iters, kernel=hw3_utils.poly(degree=1), c=None):
    """Solve SVM using projected gradient descent on the dual problem"""
    N = x_train.shape[0]
    alpha = torch.zeros(N, requires_grad=True)
    Y = torch.diag(y_train)

    # Precompute kernel matrix
    K = create_kernel(x_train, x_train, N, N, kernel)

    for _ in range(num_iters):
        # Dual objective: sum(alpha) - 0.5 * alpha^T(YKY)alpha
        L = alpha.sum() - 0.5 * alpha @ (Y @ K @ Y) @ alpha
        L.backward()

        with torch.no_grad():
            alpha += lr * alpha.grad

            # Project onto feasible set
            if c:
                alpha = torch.clamp_(alpha, min=0, max=c)
            else:
                alpha = torch.clamp_(alpha, min=0)

            alpha.grad.zero_()
        alpha.requires_grad_()

    return alpha.detach()

def svm_predictor(alpha, x_train, y_train, x_test, kernel=hw3_utils.poly(degree=1)):
    """Predict using trained SVM"""
    N = x_train.shape[0]
    M = x_test.shape[0]

    K_train = create_kernel(x_train, x_train, N, N, kernel)
    K_test = create_kernel(x_train, x_test, N, M, kernel)

    # Find support vectors
    support_idx = torch.where(alpha > 1e-5)[0]
    min_alpha_idx = support_idx[torch.argmin(alpha[support_idx])]

    # Calculate bias: b = y_k - sum(alpha_j * y_j * K(x_j, x_k))
    b = y_train[min_alpha_idx] - (alpha * y_train * K_train[:, min_alpha_idx]).sum()

    # Predictions: f(x) = sum(alpha_i * y_i * K(x_i, x)) + b
    return torch.sum((alpha * y_train).unsqueeze(1) * K_test, dim=0) + b

Linear Regression Implementations (hw3_q4.py)

def ols_solve_predict(X_train_b, y_train, X_test_b):
    """OLS using normal equation"""
    w_ols = np.linalg.pinv(X_train_b.T @ X_train_b) @ X_train_b.T @ y_train
    y_pred = X_test_b @ w_ols
    return w_ols, y_pred

def ridge_precompute(X_train_b, y_train):
    """Precompute matrices for Ridge regression"""
    n_features = X_train_b.shape[1]
    I = np.eye(n_features)
    I[0, 0] = 0  # Don't penalize bias

    XtX = X_train_b.T @ X_train_b
    Xty = X_train_b.T @ y_train
    return I, XtX, Xty

def ridge_solve_predict(XtX, Xty, I, lambda_val, X_test_b):
    """Ridge closed-form solution"""
    w_ridge = np.linalg.pinv(XtX + lambda_val * I) @ Xty
    y_pred_ridge = X_test_b @ w_ridge
    return w_ridge, y_pred_ridge

def ista_lasso(X, y, alpha, n_iterations, lambda_val, tol=1e-8):
    """ISTA for Lasso with early stopping"""
    n, d = X.shape
    w = np.zeros(d, dtype=np.float64)
    thr = alpha * lambda_val

    for _ in range(n_iterations):
        w_prev = w.copy()

        # Gradient step
        grad = 2 / n * (X.T @ (X @ w - y))
        w -= alpha * grad

        # Soft-thresholding (skip bias)
        z = w[1:].copy()
        w[1:] = np.sign(z) * np.maximum(np.abs(z) - thr, 0.0)

        # Early stopping
        if np.linalg.norm(w - w_prev, ord=np.inf) < tol:
            break

    return w

def smear_back_transform(y_train_log, train_pred_log, X_test_b, w):
    """Apply Duan's smearing estimator for log transform"""
    residuals = y_train_log - train_pred_log
    smear = np.mean(np.exp(residuals))

    y_pred_log_test = X_test_b @ w
    y_pred_back = np.expm1(y_pred_log_test) * smear

    return smear, y_pred_back