Problem 1: Naive Bayes (25 pt)

1.1 Conditional Independence Properties

(a) Conditional Independence Implies Joint Conditional Factorization

Question: If XYZX \perp Y \mid Z, can we conclude that P(X,YZ)=P(XZ)P(YZ)P(X, Y \mid Z) = P(X \mid Z)P(Y \mid Z)?

Answer: Yes

P(X,YZ)=P(X,Y,Z)P(Z)=P(Z)P(X,YZ)P(Z)=P(Z)P(YZ)P(XY,Z)P(Z)(chain rule)=P(Z)P(YZ)P(XZ)P(Z)(YXZ)=P(YZ)P(XZ)\begin{aligned} P(X, Y \mid Z) &= \frac{P(X, Y, Z)}{P(Z)} \\ &= \frac{P(Z)P(X, Y \mid Z)}{P(Z)} \\ &= \frac{P(Z)P(Y \mid Z)P(X \mid Y, Z)}{P(Z)} \quad \text{(chain rule)} \\ &= \frac{P(Z)P(Y \mid Z)P(X \mid Z)}{P(Z)} \quad (Y \perp X \mid Z) \\ &= P(Y \mid Z)P(X \mid Z) \end{aligned}

(b) Conditional Independence Does Not Imply Marginal Independence

Question: If XYZX \perp Y \mid Z, can we conclude that P(X,Y)=P(X)P(Y)P(X, Y) = P(X)P(Y)?

Answer: No

P(X,Y)=P(X)P(Y)    XY, and we only know that XYZP(X,Y)P(X)P(Y)\begin{aligned} & \because P(X, Y) = P(X)P(Y) \iff X \perp Y \text{, and we only know that } X \perp Y \mid Z \\ & \therefore P(X, Y) \neq P(X)P(Y) \end{aligned}

Conditional independence does not imply marginal independence.

(c) Number of Parameters for Boolean Features

Question: How many independent θjc\theta_{jc} parameters must be estimated for dd Boolean attributes and CC classes?

Answer: C×dC \times d independent θjc\theta_{jc} parameters

  • XX has dd attributes
  • For each class cc, we estimate θjc\theta_{jc} for each attribute XjX_j
  • Therefore, the total number of independent parameters is C×dC \times d

(d) Number of Parameters for Gaussian Features

Question: How many distinct μjc\mu_{jc} and σjc\sigma_{jc} parameters must be estimated for Gaussian features?

Answer: 2×C×d2 \times C \times d distinct parameters (both μjc\mu_{jc} and σjc\sigma_{jc} have C×dC \times d parameters each)

  • Each attribute XjX_j requires a distinct set of μjc\mu_{jc} and σjc\sigma_{jc}
  • The respective numbers of μjc\mu_{jc} and σjc\sigma_{jc} are both C×dC \times d
  • Therefore, the total number of distinct parameters are 2×C×d2 \times C \times d

(e) Why Can We Omit the Denominator?

Question: Why is omitting the denominator in the classification rule acceptable?

Answer: The YY variable is not dependent on the denominator as the denominator is just a constant for normalization.

When computing argmaxcP(Y=cX)\arg\max_c P(Y = c \mid X), the denominator v=1CP(Y=v)k=1dP(XknewY=v)\sum_{v=1}^C P(Y = v)\prod_{k=1}^{d} P(X_k^{\text{new}}\mid Y = v) is the same for all classes cc, so it doesn’t affect which class achieves the maximum.

(f) Computing P(X) from Naive Bayes Parameters

Question: Is it possible to calculate P(X)P(X) with Naive Bayes?

Answer: Yes

Using the law of total probability and the Naive Bayes likelihood:

P(X=x)=c=1CP(X=xY=c)P(Y=c)=c=1CP(Y=c)j=1dP(XjY=c)P(X = x) = \sum^C_{c = 1} P(X = x \mid Y = c)P(Y = c) = \sum^C_{c = 1}P(Y = c) \prod^d_{j = 1} P(X_j \mid Y = c)

1.2 Joint Distribution Factorization

Question: Express the joint distribution P(X1,X2,X3,Y)P(X_1, X_2, X_3, Y) as a product of simpler conditional probabilities.

Given:

  • P(X1X2,X3,Y)=P(X1Y)X1X2,X3YP(X_1 \mid X_2, X_3, Y) = P(X_1 \mid Y) \Rightarrow X_1 \perp X_2, X_3 \mid Y
  • P(X2X1,X3,Y)=P(X2Y)X2X1,X3YP(X_2 \mid X_1, X_3, Y) = P(X_2 \mid Y) \Rightarrow X_2 \perp X_1, X_3 \mid Y
  • P(X3X1,X2,Y)=P(X3X1)X3X2,YX1P(X_3 \mid X_1, X_2, Y) = P(X_3 \mid X_1) \Rightarrow X_3 \perp X_2, Y \mid X_1

Answer: P(Y)P(X1Y)P(X3X1)P(X2Y)P(Y)P(X_1 \mid Y)P(X_3 \mid X_1)P(X_2 \mid Y)

P(X1,X2,X3,Y)=P(Y)P(X1,X2,X3Y)=P(Y)P(X1Y)P(X2,X3X1,Y)=P(Y)P(X1Y)P(X3X1,Y)=P(X3X1)P(X2X3,X1,Y)=P(X2Y)=P(Y)P(X1Y)P(X3X1)P(X2Y)\begin{aligned} P(X_1, X_2, X_3, Y) &= P(Y)P(X_1, X_2, X_3 \mid Y) \\ &= P(Y)P(X_1 \mid Y)P(X_2, X_3 \mid X_1, Y) \\ &= P(Y)P(X_1 \mid Y)\underbrace{P(X_3 \mid X_1, Y)}_{= P(X_3 \mid X_1)}\underbrace{P(X_2 \mid X_3, X_1, Y)}_{= P(X_2 \mid Y)} \\ &= P(Y)P(X_1 \mid Y)P(X_3 \mid X_1)P(X_2 \mid Y) \end{aligned}

Maximum Likelihood Estimators

Let I{e}\mathbb{I}\{e\} be the indicator function: I{e}={1if e is true0if e is false\mathbb{I}\{e\} = \begin{cases}1 & \text{if } e \text{ is true} \\ 0 & \text{if } e \text{ is false}\end{cases}

(i) P(Y=1)P(Y = 1)

Answer: i=1NI{y(i)=1}N\frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = 1\}}{N}

(ii) P(X1=1Y=y)P(X_1 = 1 \mid Y = y) for y{0,1}y \in \{0, 1\}

Answer: i=1NI{x1(i)=1,y(i)=y}i=1NI{y(i)=y}\frac{\sum^{N}_{i = 1}\mathbb{I}\{x_1^{(i)} = 1, y^{(i)} = y\}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y\}}

(iii) P(X3=1Y=y)P(X_3 = 1 \mid Y = y) for y{0,1}y \in \{0, 1\}

Answer: x1=01i=0NI{x1(i)=x1,x3(i)=1}i=0NI{x1(i)=x1}i=1NI{y(i)=y,x1(i)=x1}i=1NI{y(i)=y}\sum^{1}_{x_1 = 0} \frac{\sum^{N}_{i = 0} \mathbb{I}\{x^{(i)}_1 = x_1, x_3^{(i)} = 1\}}{\sum^{N}_{i = 0} \mathbb{I}\{x^{(i)}_1 = x_1\}} \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y, x^{(i)}_1 = x_1 \}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y\}}

Using marginalization:

P(X3Y)=x1=01P(X3,X1=x1Y)P(X3Y)=x1=01P(X3X1=x1,Y)P(X1=x1Y)P(X3=1Y=y)=x1=01P(X3=1X1=x1)P(X1=x1Y)\begin{aligned} & P(X_3 \mid Y) = \sum^{1}_{x_1 = 0} P(X_3, X_1 = x_1 \mid Y) \\ & P(X_3 \mid Y) = \sum^{1}_{x_1 = 0} P(X_3 \mid X_1 = x_1, Y) P(X_1 = x_1 \mid Y) \\ & \Rightarrow P(X_3 = 1 \mid Y = y) = \sum^{1}_{x_1 = 0} P(X_3 = 1 \mid X_1 = x_1)P(X_1 = x_1 \mid Y) \end{aligned}

1.3 MAP Rule for Gaussian Naive Bayes

Given:

  • P(Y=0)=P(Y=1)=0.5P(Y=0)=P(Y=1)=0.5
  • (X1Y=y)N(μ1y,1)(X_1 \mid Y=y) \sim \mathcal{N}(\mu_{1y},1), where μ10=0,μ11=1\mu_{10}=0, \mu_{11}=1
  • (X2Y=y)N(μ2y,1)(X_2 \mid Y=y) \sim \mathcal{N}(\mu_{2y},1), where μ20=0,μ21=1\mu_{20}=0, \mu_{21}=1
  • (X3X1=x1)N(2x1,1)(X_3 \mid X_1=x_1) \sim \mathcal{N}(2x_1,1) (independent of YY given X1X_1)

(a) Derive the MAP Rule

Answer: y^(x)={1if x1+x2>10otherwise\hat{y}(x) = \begin{cases} 1 & \text{if } x_1 + x_2 > 1 \\ 0 & \text{otherwise} \end{cases}

Using the same conditional independence structure:

P(X1,X2,X3,Y)=P(Y)P(X1Y)P(X2Y)P(X3X1)P(X_1, X_2, X_3, Y) = P(Y)P(X_1 \mid Y)P(X_2 \mid Y)P(X_3 \mid X_1)

For classification, we choose Y=1Y = 1 when:

P(Y=1X1,X2,X3)>P(Y=0X1,X2,X3)    P(X1Y=1)P(X2Y=1)>P(X1Y=0)P(X2Y=0)    (μ11μ10)x112(μ112μ102)+(μ21μ20)x212(μ212μ202)>0\begin{aligned} & P(Y=1 \mid X_1,X_2,X_3) > P(Y=0 \mid X_1,X_2,X_3) \\ & \iff P(X_1\mid Y=1)P(X_2\mid Y=1) > P(X_1\mid Y=0)P(X_2\mid Y=0) \\ & \iff (\mu_{11}-\mu_{10})x_1 - \tfrac{1}{2}(\mu_{11}^2-\mu_{10}^2) + (\mu_{21}-\mu_{20})x_2 - \tfrac{1}{2}(\mu_{21}^2-\mu_{20}^2) > 0 \end{aligned}

Substituting μ\mu values:

x112+x212>0    x1+x2>1x_1 - \tfrac{1}{2} + x_2 - \tfrac{1}{2} > 0 \implies x_1 + x_2 > 1

(b) Classify Two Points

Points to classify:

  • X(a)=0.2,0.7,10X^{(a)} = \langle 0.2, 0.7, -10 \rangle
  • X(b)=0.2,0.7,10X^{(b)} = \langle 0.2, 0.7, 10 \rangle

Answer: Both X(a)X^{(a)} and X(b)X^{(b)} are classified with the label y=0y = 0 as the sum of their X1X_1 and X2X_2 equals 0.9 (< 1), and their X3X_3 values are not used by the classifier.

The X3X_3 feature doesn’t affect the classification because it’s conditionally independent of YY given X1X_1.

Problem 2: Gaussian Naive Bayes Implementation (25 pt)

Show Linear Classifier Equivalence

Question: Show that Gaussian Naive Bayes with equal class priors and equal variances across classes is equivalent to a linear classifier.

Answer:

For classification, we choose Y=1Y = 1 when P(Y=1x)>P(Y=0x)P(Y = 1 \mid x) > P(Y = 0 \mid x):

P(Y=1x)>P(Y=0x)    P(xY=1)>P(xY=0)(equal priors)    j=1dlogP(Xj=xjY=1)>j=1dlogP(Xj=xjY=0)    j=1d(xjμ0j)2(xjμ1j)22σj2>0    j=1d(μ1jμ0jσj2)xj>j=1dμ0j2μ1j22σj2\begin{aligned} & P(Y=1 \mid x) > P(Y=0 \mid x) \\ & \iff P(x \mid Y=1) > P(x \mid Y=0) \quad \text{(equal priors)} \\ & \iff \sum_{j=1}^d \log P(X_j=x_j \mid Y=1) > \sum_{j=1}^d \log P(X_j=x_j \mid Y=0) \\ & \iff \sum_{j=1}^d \frac{(x_j-\mu_{0j})^2 - (x_j-\mu_{1j})^2}{2\sigma_j^2} > 0 \\ & \iff \sum_{j=1}^d \left(\frac{\mu_{1j}-\mu_{0j}}{\sigma_j^2}\right) x_j > -\sum_{j=1}^d \frac{\mu_{0j}^2 - \mu_{1j}^2}{2\sigma^2_j} \end{aligned}

Let wj=μ1jμ0jσj2w_j = \frac{\mu_{1j}-\mu_{0j}}{\sigma_j^2} and τ=j=1dμ1j2μ0j22σj2\tau = \sum_{j=1}^d \frac{\mu_{1j}^2 - \mu_{0j}^2}{2\sigma^2_j}. Then:

y^(x)={1if j=1dwjxj>τ0if j=1dwjxj<τ\hat{y}(x) = \begin{cases} 1 & \text{if } \sum^{d}_{j = 1} w_j x_j > \tau \\ 0 & \text{if } \sum^{d}_{j = 1} w_j x_j < \tau \end{cases}

This shows the decision boundary is linear in the feature space.

Problem 3: Logistic Regression Theory (25 pt)

3.1 Sigmoid Function Properties

(a) Prove σ(s)=1σ(s)\sigma(-s) = 1 - \sigma(s)

σ(s)=11+es=1+es1+eses1+es=11es+1=1σ(s)\begin{aligned} \sigma(-s) &= \frac{1}{1 + e^s} \\ &= \frac{1 + e^s}{1 + e^s} - \frac{e^s}{1 + e^s} \\ &= 1 - \frac{1}{e^{-s} + 1} \\ &= 1 - \sigma(s) \end{aligned}

This proves P(y(i)=1x(i))+P(y(i)=1x(i))=σ(wx(i))+σ(wx(i))=1P(y^{(i)} = -1 \mid x^{(i)}) + P(y^{(i)} = 1 \mid x^{(i)}) = \sigma(-w^\intercal x^{(i)}) + \sigma(w^\intercal x^{(i)}) = 1

(b) Prove σ(s)=σ(s)(1σ(s))\sigma'(s) = \sigma(s)(1 - \sigma(s))

σ(s)=(1+es)2(es)=(1+es)1(1+es)1(es)=σ(s)es1+es=σ(s)(1σ(s))\begin{aligned} \sigma'(s) &= -(1 + e^{-s})^{-2} (-e^{-s}) \\ &= (1 + e^{-s})^{-1} (1 + e^{-s})^{-1}(e^{-s}) \\ &= \sigma(s) \frac{e^{-s}}{1 + e^{-s}} \\ &= \sigma(s)(1 - \sigma(s)) \end{aligned}

3.2 Gradient of Log-Likelihood

Answer: X(σ(yXw)y)X^\intercal(\sigma(-y\odot Xw) \odot y)

where \odot denotes element-wise multiplication.

logP(yX,w)=i=1Nlogσ(y(i)wx(i))wlogP(yX,w)=i=1N(1σ(y(i)wx(i)))y(i)x(i)=i=1Nσ(y(i)wx(i))y(i)x(i)=X(σ(yXw)y)\begin{aligned} \log P(y \mid X, w) &= \sum^{N}_{i = 1} \log \sigma(y^{(i)}w^\intercal x^{(i)}) \\ \nabla_w \log P(y \mid X, w) &= \sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))y^{(i)}x^{(i)} \\ &= \sum^{N}_{i = 1} \sigma(- y^{(i)}w^\intercal x^{(i)})y^{(i)}x^{(i)} \\ &= X^\intercal(\sigma(-y\odot Xw) \odot y) \end{aligned}

3.3 Hessian Matrix

Answer: XDX-X D X^\intercal where DD is a diagonal matrix with Dii=σ(y(i)wx(i))(1σ(y(i)wx(i)))D_{ii} = \sigma(y^{(i)}w^\intercal x^{(i)})(1 - \sigma(y^{(i)}w^\intercal x^{(i)}))

Taking the derivative of the gradient:

2logP(yX,w)=i=1N(1σ(y(i)wx(i)))σ(y(i)wx(i))(x(i))x(i)=XDX\begin{aligned} \nabla^2 \log P(y \mid X, w) &= \sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))\sigma(y^{(i)}w^\intercal x^{(i)})(-x^{(i)})x^{(i)\intercal} \\ &= -X D X^\intercal \end{aligned}

3.4 Prove Hessian is Negative Semi-Definite

For any vector zz:

zHz=zXDXz=(Xz)D(Xz)=i=1N(1σ(y(i)wx(i)))σ(y(i)wx(i))ai2\begin{aligned} z^\intercal H z &= -z^\intercal X D X^\intercal z \\ &= -(X^\intercal z)^\intercal D(X^\intercal z) \\ &= -\sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))\sigma(y^{(i)}w^\intercal x^{(i)}) a_i^2 \end{aligned}

where a=Xza = X^\intercal z. Since both terms in the product are non-negative (probabilities), we have zHz0z^\intercal Hz \le 0, proving the Hessian is negative semi-definite. This means the log-likelihood is concave with only a global maximum.

3.5 Gradient Descent Update Rule

Answer: wt+1=wt+αX(σ(yXwt)y)w_{t + 1} = w_{t} + \alpha X^\intercal(\sigma(-y\odot Xw_t) \odot y) at iteration tt

3.6 Newton’s Method Update Rule

Answer: wt+1=wt+(XDX)1l(wt)w_{t+1} = w_{t} + (XDX^\intercal)^{-1} \nabla l(w_t)

Newton’s method approximates the gradient using Taylor expansion:

l(wt+1)l(wt)+2l(wt)(wt+1wt)\nabla l(w_{t+1}) \approx \nabla l(w_{t}) + \nabla^2 l(w_t)(w_{t+1} - w_t)

Setting this to zero and solving for wt+1w_{t+1}:

wt+1=wtH1g=wt(XDX)1l(wt)=wt+(XDX)1l(wt)w_{t + 1} = w_t - H^{-1}g = w_{t} - (-XDX^\intercal)^{-1} \nabla l(w_t) = w_{t} + (XDX^\intercal)^{-1} \nabla l(w_t)

Problem 4: Programming - Optimization (25 pt)

4.1 Logistic Regression Implementation

Analysis of Feature Transformation

Given that the input data are in circular distribution which is linearly inseparable, the Logistic Regression model, a linear model, cannot classify the input data with a linear decision boundary.

However, by mapping the existing data points to higher dimensions using polynomial features (x12,x22,x1x2x_1^2, x_2^2, x_1x_2), we enable our linear model to capture the non-linear patterns in the data. The transformation creates a feature space where the classes become linearly separable.

4.2 L1 vs. L2 Regularization

Both L1 and L2 regularizations reduce overfitting by penalizing large weights:

L1 Regularization (Lasso): Adds penalty λmw1\frac{\lambda}{m}\|w\|_1

  • Gradient penalty: λsign(w)\lambda \cdot \text{sign}(w)
  • Drives insignificant features to exactly zero (feature selection)
  • Preferred when only certain features matter

L2 Regularization (Ridge): Adds penalty λ2mw22\frac{\lambda}{2m}\|w\|_2^2

  • Gradient penalty: λw\lambda w
  • Shrinks weights proportionally to their current values
  • Preserves all features with reduced magnitudes
  • Preferred when all features need to be considered

The key difference is that L1 produces sparse solutions (many weights become zero), while L2 produces dense solutions (all weights remain non-zero but small).

4.3 Hyperparameter Tuning Results

Best Performing Hyperparameter Combination:

  • Learning rate: 0.1
  • Lambda: 0.0001
  • Best validation accuracy: 100.00%
  • Test Set Accuracy: 98.67%

4.4 Gradient Descent Variants Comparison

Convergence Speed: Stochastic GD > Mini-batch GD > Batch GD

  • More parameter updates per epoch leads to faster convergence

Stability: Batch GD > Mini-batch GD > Stochastic GD

  • Batch GD provides the most stable and predictable convergence
  • Stochastic and Mini-batch GD have noisier updates but converge faster

The plot shows:

  • Batch GD: Smooth convergence curve, slowest to converge
  • Mini-batch GD: Moderate noise, balanced speed and stability
  • Stochastic GD: Fastest convergence but most oscillation in individual updates

Code Implementation

Gaussian Naive Bayes (hw2_q2.py)

def gaussian_theta(X: List[FloatTensor], y: LongTensor) -> Tuple[Tensor, Tensor]:
    """Calculate MLE for mu and sigma2 in Gaussian Naive Bayes"""
    mu = torch.stack([
        X[y == 0].mean(dim=0),
        X[y == 1].mean(dim=0),
    ], dim=0)

    sigma2 = torch.stack([
        ((X[y == 0] - mu[0]) ** 2).mean(dim=0),
        ((X[y == 1] - mu[1]) ** 2).mean(dim=0),
    ], dim=0)

    return mu, sigma2

def gaussian_p(y: LongTensor):
    """Calculate MLE for P(Y=0)"""
    return (y == 0).sum() / len(y)

def gaussian_classify(mu: Tensor, sigma2: Tensor, p: FloatTensor, X: Tensor):
    """Classify using Gaussian Naive Bayes"""
    log_p = torch.tensor([torch.log(p), torch.log(1 - p)])

    posteriors = []
    for c in range(2):
        log_likelihoods = (
            -0.5 * torch.log(2 * torch.pi * sigma2[c])
            - (X - mu[c]) ** 2 / (2 * sigma2[c])
        ).sum(dim=1)
        posteriors.append(log_likelihoods + log_p[c])

    posteriors = torch.stack(posteriors, dim=1)
    return torch.argmax(posteriors, dim=1)

Logistic Regression Core Functions (hw2_q4.py)

def _sigmoid(self, z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def transform(self, X):
    """Transform features to polynomial space"""
    col_v1 = X[:, 0].reshape(-1, 1)
    col_v2 = X[:, 1].reshape(-1, 1)
    return np.hstack([
        col_v1, col_v2,
        col_v1**2, col_v2**2,
        col_v1 * col_v2,
    ])

def _compute_cost(self, y, y_pred):
    """Binary cross-entropy with regularization"""
    m = len(y)
    y_pred_clipped = np.clip(y_pred, EPSILON, 1 - EPSILON)
    cost = -np.sum(y * np.log(y_pred_clipped) +
                   (1 - y) * np.log(1 - y_pred_clipped)) / m

    # Add L2 or L1 regularization
    reg_cost = 0
    if self.regularization == "L2":
        reg_cost = self.lambda_val / (2 * m) * np.linalg.norm(self.weights) ** 2
    elif self.regularization == "L1":
        reg_cost = self.lambda_val / m * np.linalg.norm(self.weights, ord=1)

    return cost + reg_cost