This post contains my solutions to the first homework assignment for CS 446/ECE 449 Machine Learning. The problems cover fundamental concepts in K-Nearest Neighbors (K-NN), the Perceptron Algorithm, and Maximum Likelihood Estimation (MLE) / Maximum A Posteriori (MAP) estimation.
K-Nearest Neighbors (K-NN)
Problem 1: Binary Search for Class Distribution
Question: Given a dataset with two non-overlapping circles (blue labeled 0, green labeled 1), we have N training samples total and a single test point from the green circle. What is the minimum number of queries to a K-NN classifier needed to determine the number of training points from each class?
Solution: The minimum number of queries is logN.
Since the two circles do not overlap, data points are certainly closer to their neighbors within the same circle, leading the classifier to output the majority label when K=N. Given that:
The number of samples in the green circle is less than N/2
We can query with any value of K
We can perform a binary search on the size of the green circle by:
Starting with K=N/2
Dividing K by 2 in each iteration
The output tells us whether there are more or fewer than K green points
This effectively performs binary search on the class distribution, minimizing queries to logN.
Part (b): Can the 1-NN classifier fail in any setting?
Answer: Yes, if the data points are imbalanced between the two circles. If the blue circle is dense while the green circle is sparse, a green test point may end up closer to a point in the blue circle and be misclassified as blue.
Problem 2: K-NN with Specific Data Points
Given the following dataset in R3:
Index
x1
x2
x3
Label
1
1
1
1
1
2
0
0
1
1
3
0
0
0
-1
4
1
0
0
-1
5
0
1
0
-1
Part (a): For K=1 and test point x=(0.4,0.4,1.5), what is the predicted label?
Answer: Label 1
The data point with index 2 is the closest to the input x with Euclidean distance:
(0−0.4)2+(0−0.4)2+(1−1.5)2=0.57≈0.755
Part (b): When there is more noise in the training dataset, should we choose smaller or larger K?
Answer: Larger K.
To avoid overfitting, a larger K value should be adopted so the input is not affected by the labels of outliers. As the K-NN algorithm classifies an input based on the most common labels among K nearest neighbors, the more neighbors considered, the less influence a single neighbor has. Thus, increasing K suppresses the negative impact of noise.
Problem 3: Probability Analysis with 1-NN
Setting: Input space S={x(i)}i=1N with all distinct points. For each x(i), the label y(i)∈{1,2} is drawn independently with P(y(i)=1∣x(i))=0.9.
Question: What is the probability of correct prediction for a test sample x∈S using 1-NN?
Answer: 0.82
Since the test point x is in the input space S, there exists a training sample x(j)=x. The probability of correct prediction requires that two independent draws of the labels (one for the true label y and one for the classifier’s prediction y^) match:
The 3-NN classifier improves accuracy by approximately 5.76% compared to 1-NN.
Problem 5: Handling Missing Features
Question: Is it possible to apply K-NN when test data points have missing features?
Answer: Yes, using K-NN imputation. The algorithm:
Finds the K nearest neighbors
Replaces missing values with an aggregation (mean, median, or mode) of the neighbors’ values
This approach allows K-NN to handle incomplete data effectively.
Perceptron Algorithm
Problem 1: Perceptron on 3D Data
Using the same dataset from the K-NN section with the “hacked notation” (augmented vectors):
Part (a): What is w∈R4 after the first iteration over the first data point?
Solution:
x(1)=[1,1,1], y(1)=1, w(1)=[0,0,0,0]
Augmented: x^(1)=[1,1,1,1]
Since y(1)(w(1)⊤x^(1))=0≤0, update: w(1)←w(1)+y(1)x^(1)
Answer:w=[1,1,1,1]
Part (b): Compute w after the algorithm converges.
Answer:w=[−2,0,0,3]
Part (c): If we switch labels of points 2 and 3, is the perceptron algorithm still applicable?
Answer: No, the data becomes linearly inseparable.
After swapping the labels:
Point 2: (0,0,1) with label -1
Point 3: (0,0,0) with label 1
The updated dataset cannot be separated by any hyperplane in 3D space.
Part (d): Will K-NN and perceptron always have the same results on the test set?
Answer: No.
The perceptron algorithm is designed for linearly separable data and finds a global linear boundary. K-NN is an instance-based method that assigns labels based on local neighborhood information. When data is not linearly separable, the perceptron may fail to find a suitable hyperplane and misclassify points, while K-NN can often handle such cases more effectively by relying on local structure.
Problem 2: Adapted Perceptron Convergence Proof
This problem analyzes an adapted perceptron algorithm with margin condition γ.
Assumptions:
∃w∗ such that y(i)(w∗⊤)x(i)>0 for all training samples
∥w∗∥=1 and ∥x(i)∥≤1
Margin γ=min∣w∗⊤x(i)∣
Part (a): Prove that wnew⊤w∗≥w⊤w∗+γ.
Proof:
wnew⊤w∗=(w+yx)⊤w∗=w⊤w∗+yw∗⊤x≥w⊤w∗+γ
Since the update only occurs when misclassification happens, sign(y)=sign(w∗⊤x), which means yw∗⊤x=∣w∗⊤x∣≥γ.
Part (b): Prove that if a≥0, b≥0, γ≥0 and a2≤b2+bγ+1, then a≤b+2b1+2γ.
Proof:
a2≤b2+bγ+1≤(b+2γ)2+1−4γ2≤(b+2γ)2+1
Since (b+2b1+2γ)2=(b+2γ)2+2(b+2γ)2b1+4b21>(b+2γ)2+1, we have a≤b+2b1+2γ.
Part (c): Prove that ∥wnew∥≤∥w∥+2∥w∥1+2γ.
Proof:
∥wnew∥2=∥w+yx∥2=∥w∥2+2yw⊤x+∥yx∥2
The algorithm only updates when yw⊤x≤2γ∥w∥ and ∥x∥≤1, so:
∥wnew∥2≤∥w∥2+γ∥w∥+1
Applying the result from part (b) with a=∥wnew∥ and b=∥w∥:
∥wnew∥≤∥w∥+2∥w∥1+2γ
Part (d): Prove that if M≥γ22, then ∃t≤M such that after t updates, ∥w∥≥γ2.
Proof:
From part (a), after t updates:
wt⊤w∗≥tγ
By Cauchy-Schwarz inequality:
wt⊤w∗≤∥wt∥⋅∥w∗∥=∥wt∥
Therefore: tγ≤∥wt∥
Assume for contradiction that ∥wM∥<γ2 when t=M:
Mγ≤∥wM∥<γ2
This implies M<γ22, contradicting our assumption that M≥γ22. Therefore, ∥w∥≥γ2 for some t≤M.
Part (e): Prove that M≤γ28.
Proof:
From the analysis above:
Mγ≤wM⊤w∗≤∥wM∥
From part (c), we can show that:
∥wM∥2≤2M
(Since each update increases ∥w∥2 by at most γ+1≤2 when γ≤1)
Therefore:
Mγ≤∥wM∥≤2M
Squaring both sides:
M2γ2≤2MM≤γ22
Since the adapted algorithm has a tighter condition, the bound extends to:
M≤γ28
Python Implementation
Here’s a Python implementation of the perceptron algorithm for the 3D data:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Data points from the table
X = np.array([[1,1,1],# index 1[0,0,1],# index 2[0,0,0],# index 3[1,0,0],# index 4[0,1,0],# index 5])
y = np.array([1,1,-1,-1,-1])# labels# Add bias term -> augmented vector [1, x1, x2, x3]
X_aug = np.hstack([np.ones((X.shape[0],1)), X])# Initialize weight vector w ∈ R^4
w = np.zeros(X_aug.shape[1])defpredict(x):return1if np.dot(w, x)>0else-1# Perceptron learning
changed =True
iteration =0while changed:
changed =Falsefor i inrange(len(X_aug)):if y[i]* np.dot(w, X_aug[i])<=0:
w = w + y[i]* X_aug[i]
changed =True
iteration +=1if iteration >100:# safeguardbreakprint("Final weight vector w:", w)print("Iterations:", iteration)# Plotting 3D decision boundary
pos = X[y ==1]
neg = X[y ==-1]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pos[:,0], pos[:,1], pos[:,2],
c='b', marker='o', label='Positive (+1)')
ax.scatter(neg[:,0], neg[:,1], neg[:,2],
c='r', marker='x', label='Negative (-1)')
xx, yy = np.meshgrid(np.linspace(-0.5,1.5,10),
np.linspace(-0.5,1.5,10))
zz =(-w[0]- w[1]*xx - w[2]*yy)/ w[3]
ax.plot_surface(xx, yy, zz, alpha=0.3, color='green')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('x3')
ax.set_title('3D Data + Perceptron Decision Boundary')
ax.legend()
plt.show()
Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)
Problem 1: Triangle Distribution
Part (a): Find the MLE estimate b^ for the Triangle(a,b) distribution.
Given PDF:
p(x∣a,b)={(b−a)22(x−a)0for x∈[a,b]otherwise
Solution:
The likelihood function:
L(b)=i=1∏Np(x(i)∣a,b)=(b−a)−2Ni=1∏N2(x(i)−a)
Taking the logarithm:
lnL(b)=−2Nln(b−a)+i=1∑Nln2(x(i)−a)
The derivative:
∂b∂lnL(b)=−b−a2N<0
Since the derivative is always negative, L(b) decreases with b. The maximum occurs at the smallest feasible b:
Answer:b^MLE=maxix(i) (the maximum observed value)
Problem 2: Discrete Distribution Estimation
Given 100 samples with observations:
X = 1
X = 2
X = 3
Z = 1
18
7
6
Z = 2
9
12
3
Z = 3
10
2
9
Z = 4
3
19
2
Part (a): Given measurement Z=3, what is the MLE for X?