Machine Learning Homework 1 Solutions

This post contains my solutions to the first homework assignment for CS 446/ECE 449 Machine Learning. The problems cover fundamental concepts in K-Nearest Neighbors (K-NN), the Perceptron Algorithm, and Maximum Likelihood Estimation (MLE) / Maximum A Posteriori (MAP) estimation.

K-Nearest Neighbors (K-NN)

Problem 1: Binary Search for Class Distribution

Question: Given a dataset with two non-overlapping circles (blue labeled 0, green labeled 1), we have N training samples total and a single test point from the green circle. What is the minimum number of queries to a K-NN classifier needed to determine the number of training points from each class?

Solution: The minimum number of queries is logN\log N.

Since the two circles do not overlap, data points are certainly closer to their neighbors within the same circle, leading the classifier to output the majority label when K=NK = N. Given that:

  • The number of samples in the green circle is less than N/2N/2
  • We can query with any value of K

We can perform a binary search on the size of the green circle by:

  1. Starting with K=N/2K = N/2
  2. Dividing K by 2 in each iteration
  3. The output tells us whether there are more or fewer than K green points

This effectively performs binary search on the class distribution, minimizing queries to logN\log N.

Part (b): Can the 1-NN classifier fail in any setting?

Answer: Yes, if the data points are imbalanced between the two circles. If the blue circle is dense while the green circle is sparse, a green test point may end up closer to a point in the blue circle and be misclassified as blue.

Problem 2: K-NN with Specific Data Points

Given the following dataset in R3\mathbb{R}^3:

Indexx1x_1x2x_2x3x_3Label
11111
20011
3000-1
4100-1
5010-1

Part (a): For K=1K = 1 and test point x=(0.4,0.4,1.5)\mathbf{x} = (0.4, 0.4, 1.5), what is the predicted label?

Answer: Label 1

The data point with index 2 is the closest to the input x\mathbf{x} with Euclidean distance:

(00.4)2+(00.4)2+(11.5)2=0.570.755\sqrt{(0 - 0.4)^2 + (0 - 0.4)^2 + (1 - 1.5)^2} = \sqrt{0.57} \approx 0.755

Part (b): When there is more noise in the training dataset, should we choose smaller or larger K?

Answer: Larger K.

To avoid overfitting, a larger K value should be adopted so the input is not affected by the labels of outliers. As the K-NN algorithm classifies an input based on the most common labels among K nearest neighbors, the more neighbors considered, the less influence a single neighbor has. Thus, increasing K suppresses the negative impact of noise.

Problem 3: Probability Analysis with 1-NN

Setting: Input space S={x(i)}i=1NS = \{x^{(i)}\}_{i=1}^N with all distinct points. For each x(i)x^{(i)}, the label y(i){1,2}y^{(i)} \in \{1, 2\} is drawn independently with P(y(i)=1x(i))=0.9\mathbb{P}(y^{(i)} = 1 | x^{(i)}) = 0.9.

Question: What is the probability of correct prediction for a test sample xSx \in S using 1-NN?

Answer: 0.82

Since the test point xx is in the input space SS, there exists a training sample x(j)=xx^{(j)} = x. The probability of correct prediction requires that two independent draws of the labels (one for the true label yy and one for the classifier’s prediction y^\hat{y}) match:

P(y^=y)=P(y=1,y^=1)+P(y=2,y^=2)=0.92+0.12=0.82\begin{align*} \mathbb{P}(\hat{y} = y) &= \mathbb{P}(y = 1, \hat{y} = 1) + \mathbb{P}(y = 2, \hat{y} = 2) \\ &= 0.9^2 + 0.1^2 = 0.82 \end{align*}

Problem 4: 3-NN with Replicated Labels

Setting: Same as Problem 3, but now each x(i)x^{(i)} has three independent labeled replicas (x(i),y(i,1))(x^{(i)}, y^{(i,1)}), (x(i),y(i,2))(x^{(i)}, y^{(i,2)}), (x(i),y(i,3))(x^{(i)}, y^{(i,3)}).

Question: What is the probability of correct prediction using 3-NN? How does it compare to 1-NN?

Answer: 0.8776 (increased by 0.0576)

The 3-NN classifier’s prediction distribution:

{P(y^=1)=(32)0.920.1+(33)0.93=0.972P(y^=2)=1P(y^=1)=0.028\begin{cases} \mathbb{P}(\hat{y} = 1) = \binom{3}{2} \cdot 0.9^2 \cdot 0.1 + \binom{3}{3} \cdot 0.9^3 = 0.972 \\ \mathbb{P}(\hat{y} = 2) = 1 - \mathbb{P}(\hat{y} = 1) = 0.028 \end{cases}

Updated probability of correct predictions:

P(y^=y)=P(y=1,y^=1)+P(y=2,y^=2)=0.9×0.972+0.1×0.028=0.8776\begin{align*} \mathbb{P}(\hat{y} = y) &= \mathbb{P}(y = 1, \hat{y} = 1) + \mathbb{P}(y = 2, \hat{y} = 2) \\ &= 0.9 \times 0.972 + 0.1 \times 0.028 = 0.8776 \end{align*}

The 3-NN classifier improves accuracy by approximately 5.76% compared to 1-NN.

Problem 5: Handling Missing Features

Question: Is it possible to apply K-NN when test data points have missing features?

Answer: Yes, using K-NN imputation. The algorithm:

  1. Finds the K nearest neighbors
  2. Replaces missing values with an aggregation (mean, median, or mode) of the neighbors’ values

This approach allows K-NN to handle incomplete data effectively.

Perceptron Algorithm

Problem 1: Perceptron on 3D Data

Using the same dataset from the K-NN section with the “hacked notation” (augmented vectors):

Part (a): What is wR4\mathbf{w} \in \mathbb{R}^4 after the first iteration over the first data point?

Solution:

  • x(1)=[1,1,1]x^{(1)} = [1, 1, 1], y(1)=1y^{(1)} = 1, w(1)=[0,0,0,0]w^{(1)} = [0, 0, 0, 0]
  • Augmented: x^(1)=[1,1,1,1]\hat{x}^{(1)} = [1, 1, 1, 1]
  • Since y(1)(w(1)x^(1))=00y^{(1)}(w^{(1)\top}\hat{x}^{(1)}) = 0 \leq 0, update: w(1)w(1)+y(1)x^(1)w^{(1)} \leftarrow w^{(1)} + y^{(1)}\hat{x}^{(1)}

Answer: w=[1,1,1,1]\mathbf{w} = [1, 1, 1, 1]

Part (b): Compute w\mathbf{w} after the algorithm converges.

Answer: w=[2,0,0,3]\mathbf{w} = [-2, 0, 0, 3]

Part (c): If we switch labels of points 2 and 3, is the perceptron algorithm still applicable?

Answer: No, the data becomes linearly inseparable.

After swapping the labels:

  • Point 2: (0,0,1)(0, 0, 1) with label -1
  • Point 3: (0,0,0)(0, 0, 0) with label 1

The updated dataset cannot be separated by any hyperplane in 3D space.

Part (d): Will K-NN and perceptron always have the same results on the test set?

Answer: No.

The perceptron algorithm is designed for linearly separable data and finds a global linear boundary. K-NN is an instance-based method that assigns labels based on local neighborhood information. When data is not linearly separable, the perceptron may fail to find a suitable hyperplane and misclassify points, while K-NN can often handle such cases more effectively by relying on local structure.

Problem 2: Adapted Perceptron Convergence Proof

This problem analyzes an adapted perceptron algorithm with margin condition γ\gamma.

Assumptions:

  1. w\exists \mathbf{w}^* such that y(i)(w)x(i)>0y^{(i)}(\mathbf{w}^{*\top})\mathbf{x}^{(i)} > 0 for all training samples
  2. w=1\|\mathbf{w}^*\| = 1 and x(i)1\|\mathbf{x}^{(i)}\| \leq 1
  3. Margin γ=minwx(i)\gamma = \min |\mathbf{w}^{*\top}\mathbf{x}^{(i)}|

Part (a): Prove that wnewwww+γ\mathbf{w}_{new}^\top\mathbf{w}^* \geq \mathbf{w}^\top\mathbf{w}^* + \gamma.

Proof:

wneww=(w+yx)w=ww+ywxww+γ\begin{align*} w_{new}^\top w^* &= (w + yx)^\top w^* \\ &= w^\top w^* + yw^{*\top}x \\ &\geq w^\top w^* + \gamma \end{align*}

Since the update only occurs when misclassification happens, sign(y)=sign(wx)\text{sign}(y) = \text{sign}(w^{*\top}x), which means ywx=wxγyw^{*\top}x = |w^{*\top}x| \geq \gamma.

Part (b): Prove that if a0a \geq 0, b0b \geq 0, γ0\gamma \geq 0 and a2b2+bγ+1a^2 \leq b^2 + b\gamma + 1, then ab+12b+γ2a \leq b + \frac{1}{2b} + \frac{\gamma}{2}.

Proof:

a2b2+bγ+1(b+γ2)2+1γ24(b+γ2)2+1\begin{align*} a^2 &\leq b^2 + b\gamma + 1 \\ &\leq (b + \frac{\gamma}{2})^2 + 1 - \frac{\gamma^2}{4} \\ &\leq (b + \frac{\gamma}{2})^2 + 1 \end{align*}

Since (b+12b+γ2)2=(b+γ2)2+2(b+γ2)12b+14b2>(b+γ2)2+1(b + \frac{1}{2b} + \frac{\gamma}{2})^2 = (b + \frac{\gamma}{2})^2 + 2(b + \frac{\gamma}{2})\frac{1}{2b} + \frac{1}{4b^2} > (b + \frac{\gamma}{2})^2 + 1, we have ab+12b+γ2a \leq b + \frac{1}{2b} + \frac{\gamma}{2}.

Part (c): Prove that wneww+12w+γ2\|\mathbf{w}_{new}\| \leq \|\mathbf{w}\| + \frac{1}{2\|\mathbf{w}\|} + \frac{\gamma}{2}.

Proof:

wnew2=w+yx2=w2+2ywx+yx2\begin{align*} \|w_{new}\|^2 &= \|w + yx\|^2 \\ &= \|w\|^2 + 2yw^\top x + \|yx\|^2 \end{align*}

The algorithm only updates when ywxγ2wyw^\top x \leq \frac{\gamma}{2}\|w\| and x1\|x\| \leq 1, so:

wnew2w2+γw+1\|w_{new}\|^2 \leq \|w\|^2 + \gamma\|w\| + 1

Applying the result from part (b) with a=wnewa = \|w_{new}\| and b=wb = \|w\|:

wneww+12w+γ2\|w_{new}\| \leq \|w\| + \frac{1}{2\|w\|} + \frac{\gamma}{2}

Part (d): Prove that if M2γ2M \geq \frac{2}{\gamma^2}, then tM\exists t \leq M such that after tt updates, w2γ\|\mathbf{w}\| \geq \frac{2}{\gamma}.

Proof:

From part (a), after tt updates:

wtwtγw_t^\top w^* \geq t\gamma

By Cauchy-Schwarz inequality:

wtwwtw=wtw_t^\top w^* \leq \|w_t\| \cdot \|w^*\| = \|w_t\|

Therefore: tγwtt\gamma \leq \|w_t\|

Assume for contradiction that wM<2γ\|w_M\| < \frac{2}{\gamma} when t=Mt = M:

MγwM<2γM\gamma \leq \|w_M\| < \frac{2}{\gamma}

This implies M<2γ2M < \frac{2}{\gamma^2}, contradicting our assumption that M2γ2M \geq \frac{2}{\gamma^2}. Therefore, w2γ\|w\| \geq \frac{2}{\gamma} for some tMt \leq M.

Part (e): Prove that M8γ2M \leq \frac{8}{\gamma^2}.

Proof:

From the analysis above:

MγwMwwMM\gamma \leq w_M^\top w^* \leq \|w_M\|

From part (c), we can show that:

wM22M\|w_M\|^2 \leq 2M

(Since each update increases w2\|w\|^2 by at most γ+12\gamma + 1 \leq 2 when γ1\gamma \leq 1)

Therefore:

MγwM2MM\gamma \leq \|w_M\| \leq \sqrt{2M}

Squaring both sides:

M2γ22MM^2\gamma^2 \leq 2M M2γ2M \leq \frac{2}{\gamma^2}

Since the adapted algorithm has a tighter condition, the bound extends to:

M8γ2M \leq \frac{8}{\gamma^2}

Python Implementation

Here’s a Python implementation of the perceptron algorithm for the 3D data:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Data points from the table
X = np.array([
    [1, 1, 1],  # index 1
    [0, 0, 1],  # index 2
    [0, 0, 0],  # index 3
    [1, 0, 0],  # index 4
    [0, 1, 0],  # index 5
])

y = np.array([1, 1, -1, -1, -1])  # labels

# Add bias term -> augmented vector [1, x1, x2, x3]
X_aug = np.hstack([np.ones((X.shape[0], 1)), X])

# Initialize weight vector w ∈ R^4
w = np.zeros(X_aug.shape[1])

def predict(x):
    return 1 if np.dot(w, x) > 0 else -1

# Perceptron learning
changed = True
iteration = 0
while changed:
    changed = False
    for i in range(len(X_aug)):
        if y[i] * np.dot(w, X_aug[i]) <= 0:
            w = w + y[i] * X_aug[i]
            changed = True
    iteration += 1
    if iteration > 100:  # safeguard
        break

print("Final weight vector w:", w)
print("Iterations:", iteration)

# Plotting 3D decision boundary
pos = X[y == 1]
neg = X[y == -1]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pos[:, 0], pos[:, 1], pos[:, 2],
           c='b', marker='o', label='Positive (+1)')
ax.scatter(neg[:, 0], neg[:, 1], neg[:, 2],
           c='r', marker='x', label='Negative (-1)')

xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 10),
                      np.linspace(-0.5, 1.5, 10))
zz = (-w[0] - w[1]*xx - w[2]*yy) / w[3]
ax.plot_surface(xx, yy, zz, alpha=0.3, color='green')

ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('x3')
ax.set_title('3D Data + Perceptron Decision Boundary')
ax.legend()

plt.show()

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)

Problem 1: Triangle Distribution

Part (a): Find the MLE estimate b^\hat{b} for the Triangle(a,b)(a, b) distribution.

Given PDF:

p(xa,b)={2(xa)(ba)2for x[a,b]0otherwisep(x|a,b) = \begin{cases} \frac{2(x-a)}{(b-a)^2} & \text{for } x \in [a,b] \\ 0 & \text{otherwise} \end{cases}

Solution:

The likelihood function:

L(b)=i=1Np(x(i)a,b)=(ba)2Ni=1N2(x(i)a)\begin{align*} L(b) &= \prod_{i=1}^N p(x^{(i)} | a, b) \\ &= (b - a)^{-2N} \prod_{i=1}^N 2(x^{(i)} - a) \end{align*}

Taking the logarithm:

lnL(b)=2Nln(ba)+i=1Nln2(x(i)a)\ln L(b) = -2N \ln(b - a) + \sum_{i=1}^N \ln 2(x^{(i)} - a)

The derivative:

lnL(b)b=2Nba<0\frac{\partial \ln L(b)}{\partial b} = -\frac{2N}{b - a} < 0

Since the derivative is always negative, L(b)L(b) decreases with bb. The maximum occurs at the smallest feasible bb:

Answer: b^MLE=maxix(i)\hat{b}_{MLE} = \max_i x^{(i)} (the maximum observed value)

Problem 2: Discrete Distribution Estimation

Given 100 samples with observations:

X = 1X = 2X = 3
Z = 11876
Z = 29123
Z = 31029
Z = 43192

Part (a): Given measurement Z=3Z = 3, what is the MLE for XX?

Solution:

Column totals: nX=1=40n_{X=1} = 40, nX=2=40n_{X=2} = 40, nX=3=20n_{X=3} = 20

Estimated probabilities:

{P^(Z=3X=1)=1040=0.25P^(Z=3X=2)=240=0.05P^(Z=3X=3)=920=0.45\begin{cases} \hat{P}(Z = 3 | X = 1) = \frac{10}{40} = 0.25 \\ \hat{P}(Z = 3 | X = 2) = \frac{2}{40} = 0.05 \\ \hat{P}(Z = 3 | X = 3) = \frac{9}{20} = 0.45 \end{cases}

Since MLE ignores priors:

Answer: XMLE=3X_{MLE} = 3

Part (b): What is the MAP estimate given priors P(X=1)=0.2P(X=1) = 0.2, P(X=2)=0.5P(X=2) = 0.5, P(X=3)=0.3P(X=3) = 0.3?

Solution:

Joint probabilities:

{P(Z=3,X=1)=0.25×0.20=0.05P(Z=3,X=2)=0.05×0.50=0.025P(Z=3,X=3)=0.45×0.30=0.135\begin{cases} P(Z = 3, X = 1) = 0.25 \times 0.20 = 0.05 \\ P(Z = 3, X = 2) = 0.05 \times 0.50 = 0.025 \\ P(Z = 3, X = 3) = 0.45 \times 0.30 = 0.135 \end{cases}

Normalizing by P(Z=3)=0.21P(Z = 3) = 0.21:

{P(X=1Z=3)0.238P(X=2Z=3)0.119P(X=3Z=3)0.643\begin{cases} P(X = 1 | Z = 3) \approx 0.238 \\ P(X = 2 | Z = 3) \approx 0.119 \\ P(X = 3 | Z = 3) \approx 0.643 \end{cases}

Answer: XMAP=3X_{MAP} = 3

Problem 3: Poisson Distribution

Part (a): Derive MLE estimate for Poisson distribution.

Given PMF: P(kλ)=λkeλk!P(k|\lambda) = \frac{\lambda^k e^{-\lambda}}{k!}

Solution:

Likelihood:

L(λ)=i=1Nλk(i)eλk(i)!=eλNλNkˉC\begin{align*} L(\lambda) &= \prod_{i=1}^N \frac{\lambda^{k^{(i)}}e^{-\lambda}}{k^{(i)}!} \\ &= e^{-\lambda N} \cdot \lambda^{N\bar{k}} \cdot C \end{align*}

where kˉ=1Ni=1Nk(i)\bar{k} = \frac{1}{N}\sum_{i=1}^N k^{(i)} and C=1k(i)!C = \frac{1}{\prod k^{(i)}!}

Log-likelihood:

lnL(λ)=λN+Nkˉlnλ+lnC\ln L(\lambda) = -\lambda N + N\bar{k}\ln \lambda + \ln C

Derivative:

lnL(λ)λ=N+Nkˉλ\frac{\partial \ln L(\lambda)}{\partial \lambda} = -N + \frac{N\bar{k}}{\lambda}

Setting to zero:

Answer: λ^MLE=kˉ\hat{\lambda}_{MLE} = \bar{k} (sample mean)

Part (b): Derive MAP estimate with Gamma prior.

Given prior: p(λ)=λα1eλ/βΓ(α)βαp(\lambda) = \frac{\lambda^{\alpha-1}e^{-\lambda/\beta}}{\Gamma(\alpha)\beta^\alpha}

Solution:

Posterior kernel:

p(λk)λα+Nkˉ1e(N+1/β)λp(\lambda | k) \propto \lambda^{\alpha + N\bar{k} - 1} e^{-(N + 1/\beta)\lambda}

Log-posterior:

l(λ)=(α+Nkˉ1)lnλ(N+1/β)λl(\lambda) = (\alpha + N\bar{k} - 1)\ln \lambda - (N + 1/\beta)\lambda

Derivative:

l(λ)=α+Nkˉ1λ(N+1/β)=0l'(\lambda) = \frac{\alpha + N\bar{k} - 1}{\lambda} - (N + 1/\beta) = 0

Answer: λ^MAP=α+Nkˉ1N+1/β\hat{\lambda}_{MAP} = \frac{\alpha + N\bar{k} - 1}{N + 1/\beta}

Part (c): What happens as NN \to \infty?

Solution:

λ^MAPλ^MLE=α1(1/β)kˉN+1/β0 as N\begin{align*} \hat{\lambda}_{MAP} - \hat{\lambda}_{MLE} &= \frac{\alpha - 1 - (1/\beta)\bar{k}}{N + 1/\beta} \\ &\to 0 \text{ as } N \to \infty \end{align*}

Answer: As NN \to \infty, both λ^MAP\hat{\lambda}_{MAP} and λ^MLE\hat{\lambda}_{MLE} converge to the true population mean.

Problem 4: Gaussian Sensor Measurements

Part (a): Derive MLE for 1D position from sonar measurements.

Given: p(z(i)x)=1σi2πe(z(i)x)22σi2p(z^{(i)}|x) = \frac{1}{\sigma_i\sqrt{2\pi}}e^{-\frac{(z^{(i)} - x)^2}{2\sigma_i^2}}

Solution:

Log-likelihood:

lnL(x)=ln(σi2π)i=1N(z(i)x)22σi2\ln L(x) = -\sum \ln(\sigma_i\sqrt{2\pi}) - \sum_{i=1}^N \frac{(z^{(i)} - x)^2}{2\sigma_i^2}

Derivative:

dlnL(x)dx=i=1Nz(i)xσi2\frac{d \ln L(x)}{dx} = \sum_{i=1}^N \frac{z^{(i)} - x}{\sigma_i^2}

Setting to zero (assuming equal variances):

Answer: x^MLE=zˉ=1Ni=1Nz(i)\hat{x}_{MLE} = \bar{z} = \frac{1}{N}\sum_{i=1}^N z^{(i)} (sample mean)

Part (b): Derive MAP estimate with Gaussian prior xN(θ0,σ02)x \sim \mathcal{N}(\theta_0, \sigma_0^2).

Solution:

Log-posterior:

l(x)=12[(xθ0)2σ02+i=1N(z(i)x)2σi2]l(x) = -\frac{1}{2}\left[\frac{(x - \theta_0)^2}{\sigma_0^2} + \sum_{i=1}^N \frac{(z^{(i)} - x)^2}{\sigma_i^2}\right]

Derivative:

l(x)=xθ0σ02+i=1Nz(i)xσi2=0l'(x) = \frac{x - \theta_0}{\sigma_0^2} + \sum_{i=1}^N \frac{z^{(i)} - x}{\sigma_i^2} = 0

Solving:

Answer: x^MAP=θ0σ02+i=1Nz(i)σi21σ02+i=1N1σi2\hat{x}_{MAP} = \frac{\frac{\theta_0}{\sigma_0^2} + \sum_{i=1}^N \frac{z^{(i)}}{\sigma_i^2}}{\frac{1}{\sigma_0^2} + \sum_{i=1}^N \frac{1}{\sigma_i^2}}

This is a weighted average of the prior mean and the measurements, with weights inversely proportional to variances.

Part (c): What happens as NN \to \infty?

Solution:

Let W=i=1N1σi2W = \sum_{i=1}^N \frac{1}{\sigma_i^2}. Rewrite:

x^MAP=WW+1σ02x^MLE+1σ02W+1σ02θ0\hat{x}_{MAP} = \frac{W}{W + \frac{1}{\sigma_0^2}}\hat{x}_{MLE} + \frac{\frac{1}{\sigma_0^2}}{W + \frac{1}{\sigma_0^2}}\theta_0

As NN \to \infty, WW \to \infty, so WW+1σ021\frac{W}{W + \frac{1}{\sigma_0^2}} \to 1.

Answer: As NN \to \infty, x^MAPx^MLE\hat{x}_{MAP} \to \hat{x}_{MLE}. The influence of the prior diminishes as we gather more data.

Key Takeaways

  1. K-NN: Binary search can optimize query complexity; larger K reduces overfitting; replication improves robustness
  2. Perceptron: Converges for linearly separable data with bounded number of updates; fails on non-separable data
  3. MLE vs MAP: MLE maximizes likelihood alone; MAP incorporates prior knowledge; both converge as sample size increases

These fundamental concepts form the foundation for more advanced machine learning algorithms.