Problem 1: Naive Bayes (25 pt)
1.1 Conditional Independence Properties
(a) Conditional Independence Implies Joint Conditional Factorization
Question: If X ⊥ Y ∣ Z X \perp Y \mid Z X ⊥ Y ∣ Z , can we conclude that P ( X , Y ∣ Z ) = P ( X ∣ Z ) P ( Y ∣ Z ) P(X, Y \mid Z) = P(X \mid Z)P(Y \mid Z) P ( X , Y ∣ Z ) = P ( X ∣ Z ) P ( Y ∣ Z ) ?
Answer: Yes
P ( X , Y ∣ Z ) = P ( X , Y , Z ) P ( Z ) = P ( Z ) P ( X , Y ∣ Z ) P ( Z ) = P ( Z ) P ( Y ∣ Z ) P ( X ∣ Y , Z ) P ( Z ) (chain rule) = P ( Z ) P ( Y ∣ Z ) P ( X ∣ Z ) P ( Z ) ( Y ⊥ X ∣ Z ) = P ( Y ∣ Z ) P ( X ∣ Z ) \begin{aligned}
P(X, Y \mid Z) &= \frac{P(X, Y, Z)}{P(Z)} \\
&= \frac{P(Z)P(X, Y \mid Z)}{P(Z)} \\
&= \frac{P(Z)P(Y \mid Z)P(X \mid Y, Z)}{P(Z)} \quad \text{(chain rule)} \\
&= \frac{P(Z)P(Y \mid Z)P(X \mid Z)}{P(Z)} \quad (Y \perp X \mid Z) \\
&= P(Y \mid Z)P(X \mid Z)
\end{aligned} P ( X , Y ∣ Z ) = P ( Z ) P ( X , Y , Z ) = P ( Z ) P ( Z ) P ( X , Y ∣ Z ) = P ( Z ) P ( Z ) P ( Y ∣ Z ) P ( X ∣ Y , Z ) (chain rule) = P ( Z ) P ( Z ) P ( Y ∣ Z ) P ( X ∣ Z ) ( Y ⊥ X ∣ Z ) = P ( Y ∣ Z ) P ( X ∣ Z )
(b) Conditional Independence Does Not Imply Marginal Independence
Question: If X ⊥ Y ∣ Z X \perp Y \mid Z X ⊥ Y ∣ Z , can we conclude that P ( X , Y ) = P ( X ) P ( Y ) P(X, Y) = P(X)P(Y) P ( X , Y ) = P ( X ) P ( Y ) ?
Answer: No
∵ P ( X , Y ) = P ( X ) P ( Y ) ⟺ X ⊥ Y , and we only know that X ⊥ Y ∣ Z ∴ P ( X , Y ) ≠ P ( X ) P ( Y ) \begin{aligned}
& \because P(X, Y) = P(X)P(Y) \iff X \perp Y \text{, and we only know that } X \perp Y \mid Z \\
& \therefore P(X, Y) \neq P(X)P(Y)
\end{aligned} ∵ P ( X , Y ) = P ( X ) P ( Y ) ⟺ X ⊥ Y , and we only know that X ⊥ Y ∣ Z ∴ P ( X , Y ) = P ( X ) P ( Y )
Conditional independence does not imply marginal independence.
(c) Number of Parameters for Boolean Features
Question: How many independent θ j c \theta_{jc} θ j c parameters must be estimated for d d d Boolean attributes and C C C classes?
Answer: C × d C \times d C × d independent θ j c \theta_{jc} θ j c parameters
X X X has d d d attributes
For each class c c c , we estimate θ j c \theta_{jc} θ j c for each attribute X j X_j X j
Therefore, the total number of independent parameters is C × d C \times d C × d
(d) Number of Parameters for Gaussian Features
Question: How many distinct μ j c \mu_{jc} μ j c and σ j c \sigma_{jc} σ j c parameters must be estimated for Gaussian features?
Answer: 2 × C × d 2 \times C \times d 2 × C × d distinct parameters (both μ j c \mu_{jc} μ j c and σ j c \sigma_{jc} σ j c have C × d C \times d C × d parameters each)
Each attribute X j X_j X j requires a distinct set of μ j c \mu_{jc} μ j c and σ j c \sigma_{jc} σ j c
The respective numbers of μ j c \mu_{jc} μ j c and σ j c \sigma_{jc} σ j c are both C × d C \times d C × d
Therefore, the total number of distinct parameters are 2 × C × d 2 \times C \times d 2 × C × d
(e) Why Can We Omit the Denominator?
Question: Why is omitting the denominator in the classification rule acceptable?
Answer: The Y Y Y variable is not dependent on the denominator as the denominator is just a constant for normalization.
When computing arg max c P ( Y = c ∣ X ) \arg\max_c P(Y = c \mid X) arg max c P ( Y = c ∣ X ) , the denominator ∑ v = 1 C P ( Y = v ) ∏ k = 1 d P ( X k new ∣ Y = v ) \sum_{v=1}^C P(Y = v)\prod_{k=1}^{d} P(X_k^{\text{new}}\mid Y = v) ∑ v = 1 C P ( Y = v ) ∏ k = 1 d P ( X k new ∣ Y = v ) is the same for all classes c c c , so it doesn’t affect which class achieves the maximum.
(f) Computing P(X) from Naive Bayes Parameters
Question: Is it possible to calculate P ( X ) P(X) P ( X ) with Naive Bayes?
Answer: Yes
Using the law of total probability and the Naive Bayes likelihood:
P ( X = x ) = ∑ c = 1 C P ( X = x ∣ Y = c ) P ( Y = c ) = ∑ c = 1 C P ( Y = c ) ∏ j = 1 d P ( X j ∣ Y = c ) P(X = x) = \sum^C_{c = 1} P(X = x \mid Y = c)P(Y = c) = \sum^C_{c = 1}P(Y = c) \prod^d_{j = 1} P(X_j \mid Y = c) P ( X = x ) = c = 1 ∑ C P ( X = x ∣ Y = c ) P ( Y = c ) = c = 1 ∑ C P ( Y = c ) j = 1 ∏ d P ( X j ∣ Y = c )
1.2 Joint Distribution Factorization
Question: Express the joint distribution P ( X 1 , X 2 , X 3 , Y ) P(X_1, X_2, X_3, Y) P ( X 1 , X 2 , X 3 , Y ) as a product of simpler conditional probabilities.
Given:
P ( X 1 ∣ X 2 , X 3 , Y ) = P ( X 1 ∣ Y ) ⇒ X 1 ⊥ X 2 , X 3 ∣ Y P(X_1 \mid X_2, X_3, Y) = P(X_1 \mid Y) \Rightarrow X_1 \perp X_2, X_3 \mid Y P ( X 1 ∣ X 2 , X 3 , Y ) = P ( X 1 ∣ Y ) ⇒ X 1 ⊥ X 2 , X 3 ∣ Y
P ( X 2 ∣ X 1 , X 3 , Y ) = P ( X 2 ∣ Y ) ⇒ X 2 ⊥ X 1 , X 3 ∣ Y P(X_2 \mid X_1, X_3, Y) = P(X_2 \mid Y) \Rightarrow X_2 \perp X_1, X_3 \mid Y P ( X 2 ∣ X 1 , X 3 , Y ) = P ( X 2 ∣ Y ) ⇒ X 2 ⊥ X 1 , X 3 ∣ Y
P ( X 3 ∣ X 1 , X 2 , Y ) = P ( X 3 ∣ X 1 ) ⇒ X 3 ⊥ X 2 , Y ∣ X 1 P(X_3 \mid X_1, X_2, Y) = P(X_3 \mid X_1) \Rightarrow X_3 \perp X_2, Y \mid X_1 P ( X 3 ∣ X 1 , X 2 , Y ) = P ( X 3 ∣ X 1 ) ⇒ X 3 ⊥ X 2 , Y ∣ X 1
Answer: P ( Y ) P ( X 1 ∣ Y ) P ( X 3 ∣ X 1 ) P ( X 2 ∣ Y ) P(Y)P(X_1 \mid Y)P(X_3 \mid X_1)P(X_2 \mid Y) P ( Y ) P ( X 1 ∣ Y ) P ( X 3 ∣ X 1 ) P ( X 2 ∣ Y )
P ( X 1 , X 2 , X 3 , Y ) = P ( Y ) P ( X 1 , X 2 , X 3 ∣ Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 2 , X 3 ∣ X 1 , Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 3 ∣ X 1 , Y ) ⏟ = P ( X 3 ∣ X 1 ) P ( X 2 ∣ X 3 , X 1 , Y ) ⏟ = P ( X 2 ∣ Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 3 ∣ X 1 ) P ( X 2 ∣ Y ) \begin{aligned}
P(X_1, X_2, X_3, Y) &= P(Y)P(X_1, X_2, X_3 \mid Y) \\
&= P(Y)P(X_1 \mid Y)P(X_2, X_3 \mid X_1, Y) \\
&= P(Y)P(X_1 \mid Y)\underbrace{P(X_3 \mid X_1, Y)}_{= P(X_3 \mid X_1)}\underbrace{P(X_2 \mid X_3, X_1, Y)}_{= P(X_2 \mid Y)} \\
&= P(Y)P(X_1 \mid Y)P(X_3 \mid X_1)P(X_2 \mid Y)
\end{aligned} P ( X 1 , X 2 , X 3 , Y ) = P ( Y ) P ( X 1 , X 2 , X 3 ∣ Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 2 , X 3 ∣ X 1 , Y ) = P ( Y ) P ( X 1 ∣ Y ) = P ( X 3 ∣ X 1 ) P ( X 3 ∣ X 1 , Y ) = P ( X 2 ∣ Y ) P ( X 2 ∣ X 3 , X 1 , Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 3 ∣ X 1 ) P ( X 2 ∣ Y )
Maximum Likelihood Estimators
Let I { e } \mathbb{I}\{e\} I { e } be the indicator function: I { e } = { 1 if e is true 0 if e is false \mathbb{I}\{e\} = \begin{cases}1 & \text{if } e \text{ is true} \\ 0 & \text{if } e \text{ is false}\end{cases} I { e } = { 1 0 if e is true if e is false
(i) P ( Y = 1 ) P(Y = 1) P ( Y = 1 )
Answer: ∑ i = 1 N I { y ( i ) = 1 } N \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = 1\}}{N} N ∑ i = 1 N I { y ( i ) = 1 }
(ii) P ( X 1 = 1 ∣ Y = y ) P(X_1 = 1 \mid Y = y) P ( X 1 = 1 ∣ Y = y ) for y ∈ { 0 , 1 } y \in \{0, 1\} y ∈ { 0 , 1 }
Answer: ∑ i = 1 N I { x 1 ( i ) = 1 , y ( i ) = y } ∑ i = 1 N I { y ( i ) = y } \frac{\sum^{N}_{i = 1}\mathbb{I}\{x_1^{(i)} = 1, y^{(i)} = y\}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y\}} ∑ i = 1 N I { y ( i ) = y } ∑ i = 1 N I { x 1 ( i ) = 1 , y ( i ) = y }
(iii) P ( X 3 = 1 ∣ Y = y ) P(X_3 = 1 \mid Y = y) P ( X 3 = 1 ∣ Y = y ) for y ∈ { 0 , 1 } y \in \{0, 1\} y ∈ { 0 , 1 }
Answer: ∑ x 1 = 0 1 ∑ i = 0 N I { x 1 ( i ) = x 1 , x 3 ( i ) = 1 } ∑ i = 0 N I { x 1 ( i ) = x 1 } ∑ i = 1 N I { y ( i ) = y , x 1 ( i ) = x 1 } ∑ i = 1 N I { y ( i ) = y } \sum^{1}_{x_1 = 0} \frac{\sum^{N}_{i = 0} \mathbb{I}\{x^{(i)}_1 = x_1, x_3^{(i)} = 1\}}{\sum^{N}_{i = 0} \mathbb{I}\{x^{(i)}_1 = x_1\}} \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y, x^{(i)}_1 = x_1 \}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = y\}} ∑ x 1 = 0 1 ∑ i = 0 N I { x 1 ( i ) = x 1 } ∑ i = 0 N I { x 1 ( i ) = x 1 , x 3 ( i ) = 1 } ∑ i = 1 N I { y ( i ) = y } ∑ i = 1 N I { y ( i ) = y , x 1 ( i ) = x 1 }
Using marginalization:
P ( X 3 ∣ Y ) = ∑ x 1 = 0 1 P ( X 3 , X 1 = x 1 ∣ Y ) P ( X 3 ∣ Y ) = ∑ x 1 = 0 1 P ( X 3 ∣ X 1 = x 1 , Y ) P ( X 1 = x 1 ∣ Y ) ⇒ P ( X 3 = 1 ∣ Y = y ) = ∑ x 1 = 0 1 P ( X 3 = 1 ∣ X 1 = x 1 ) P ( X 1 = x 1 ∣ Y ) \begin{aligned}
& P(X_3 \mid Y) = \sum^{1}_{x_1 = 0} P(X_3, X_1 = x_1 \mid Y) \\
& P(X_3 \mid Y) = \sum^{1}_{x_1 = 0} P(X_3 \mid X_1 = x_1, Y) P(X_1 = x_1 \mid Y) \\
& \Rightarrow P(X_3 = 1 \mid Y = y) = \sum^{1}_{x_1 = 0} P(X_3 = 1 \mid X_1 = x_1)P(X_1 = x_1 \mid Y)
\end{aligned} P ( X 3 ∣ Y ) = x 1 = 0 ∑ 1 P ( X 3 , X 1 = x 1 ∣ Y ) P ( X 3 ∣ Y ) = x 1 = 0 ∑ 1 P ( X 3 ∣ X 1 = x 1 , Y ) P ( X 1 = x 1 ∣ Y ) ⇒ P ( X 3 = 1 ∣ Y = y ) = x 1 = 0 ∑ 1 P ( X 3 = 1 ∣ X 1 = x 1 ) P ( X 1 = x 1 ∣ Y )
1.3 MAP Rule for Gaussian Naive Bayes
Given:
P ( Y = 0 ) = P ( Y = 1 ) = 0.5 P(Y=0)=P(Y=1)=0.5 P ( Y = 0 ) = P ( Y = 1 ) = 0.5
( X 1 ∣ Y = y ) ∼ N ( μ 1 y , 1 ) (X_1 \mid Y=y) \sim \mathcal{N}(\mu_{1y},1) ( X 1 ∣ Y = y ) ∼ N ( μ 1 y , 1 ) , where μ 10 = 0 , μ 11 = 1 \mu_{10}=0, \mu_{11}=1 μ 10 = 0 , μ 11 = 1
( X 2 ∣ Y = y ) ∼ N ( μ 2 y , 1 ) (X_2 \mid Y=y) \sim \mathcal{N}(\mu_{2y},1) ( X 2 ∣ Y = y ) ∼ N ( μ 2 y , 1 ) , where μ 20 = 0 , μ 21 = 1 \mu_{20}=0, \mu_{21}=1 μ 20 = 0 , μ 21 = 1
( X 3 ∣ X 1 = x 1 ) ∼ N ( 2 x 1 , 1 ) (X_3 \mid X_1=x_1) \sim \mathcal{N}(2x_1,1) ( X 3 ∣ X 1 = x 1 ) ∼ N ( 2 x 1 , 1 ) (independent of Y Y Y given X 1 X_1 X 1 )
(a) Derive the MAP Rule
Answer: y ^ ( x ) = { 1 if x 1 + x 2 > 1 0 otherwise \hat{y}(x) = \begin{cases} 1 & \text{if } x_1 + x_2 > 1 \\ 0 & \text{otherwise} \end{cases} y ^ ( x ) = { 1 0 if x 1 + x 2 > 1 otherwise
Using the same conditional independence structure:
P ( X 1 , X 2 , X 3 , Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 2 ∣ Y ) P ( X 3 ∣ X 1 ) P(X_1, X_2, X_3, Y) = P(Y)P(X_1 \mid Y)P(X_2 \mid Y)P(X_3 \mid X_1) P ( X 1 , X 2 , X 3 , Y ) = P ( Y ) P ( X 1 ∣ Y ) P ( X 2 ∣ Y ) P ( X 3 ∣ X 1 )
For classification, we choose Y = 1 Y = 1 Y = 1 when:
P ( Y = 1 ∣ X 1 , X 2 , X 3 ) > P ( Y = 0 ∣ X 1 , X 2 , X 3 ) ⟺ P ( X 1 ∣ Y = 1 ) P ( X 2 ∣ Y = 1 ) > P ( X 1 ∣ Y = 0 ) P ( X 2 ∣ Y = 0 ) ⟺ ( μ 11 − μ 10 ) x 1 − 1 2 ( μ 11 2 − μ 10 2 ) + ( μ 21 − μ 20 ) x 2 − 1 2 ( μ 21 2 − μ 20 2 ) > 0 \begin{aligned}
& P(Y=1 \mid X_1,X_2,X_3) > P(Y=0 \mid X_1,X_2,X_3) \\
& \iff P(X_1\mid Y=1)P(X_2\mid Y=1) > P(X_1\mid Y=0)P(X_2\mid Y=0) \\
& \iff (\mu_{11}-\mu_{10})x_1 - \tfrac{1}{2}(\mu_{11}^2-\mu_{10}^2) + (\mu_{21}-\mu_{20})x_2 - \tfrac{1}{2}(\mu_{21}^2-\mu_{20}^2) > 0
\end{aligned} P ( Y = 1 ∣ X 1 , X 2 , X 3 ) > P ( Y = 0 ∣ X 1 , X 2 , X 3 ) ⟺ P ( X 1 ∣ Y = 1 ) P ( X 2 ∣ Y = 1 ) > P ( X 1 ∣ Y = 0 ) P ( X 2 ∣ Y = 0 ) ⟺ ( μ 11 − μ 10 ) x 1 − 2 1 ( μ 11 2 − μ 10 2 ) + ( μ 21 − μ 20 ) x 2 − 2 1 ( μ 21 2 − μ 20 2 ) > 0
Substituting μ \mu μ values:
x 1 − 1 2 + x 2 − 1 2 > 0 ⟹ x 1 + x 2 > 1 x_1 - \tfrac{1}{2} + x_2 - \tfrac{1}{2} > 0 \implies x_1 + x_2 > 1 x 1 − 2 1 + x 2 − 2 1 > 0 ⟹ x 1 + x 2 > 1
(b) Classify Two Points
Points to classify:
X ( a ) = ⟨ 0.2 , 0.7 , − 10 ⟩ X^{(a)} = \langle 0.2, 0.7, -10 \rangle X ( a ) = ⟨ 0.2 , 0.7 , − 10 ⟩
X ( b ) = ⟨ 0.2 , 0.7 , 10 ⟩ X^{(b)} = \langle 0.2, 0.7, 10 \rangle X ( b ) = ⟨ 0.2 , 0.7 , 10 ⟩
Answer: Both X ( a ) X^{(a)} X ( a ) and X ( b ) X^{(b)} X ( b ) are classified with the label y = 0 y = 0 y = 0 as the sum of their X 1 X_1 X 1 and X 2 X_2 X 2 equals 0.9 (< 1), and their X 3 X_3 X 3 values are not used by the classifier.
The X 3 X_3 X 3 feature doesn’t affect the classification because it’s conditionally independent of Y Y Y given X 1 X_1 X 1 .
Problem 2: Gaussian Naive Bayes Implementation (25 pt)
Show Linear Classifier Equivalence
Question: Show that Gaussian Naive Bayes with equal class priors and equal variances across classes is equivalent to a linear classifier.
Answer:
For classification, we choose Y = 1 Y = 1 Y = 1 when P ( Y = 1 ∣ x ) > P ( Y = 0 ∣ x ) P(Y = 1 \mid x) > P(Y = 0 \mid x) P ( Y = 1 ∣ x ) > P ( Y = 0 ∣ x ) :
P ( Y = 1 ∣ x ) > P ( Y = 0 ∣ x ) ⟺ P ( x ∣ Y = 1 ) > P ( x ∣ Y = 0 ) (equal priors) ⟺ ∑ j = 1 d log P ( X j = x j ∣ Y = 1 ) > ∑ j = 1 d log P ( X j = x j ∣ Y = 0 ) ⟺ ∑ j = 1 d ( x j − μ 0 j ) 2 − ( x j − μ 1 j ) 2 2 σ j 2 > 0 ⟺ ∑ j = 1 d ( μ 1 j − μ 0 j σ j 2 ) x j > − ∑ j = 1 d μ 0 j 2 − μ 1 j 2 2 σ j 2 \begin{aligned}
& P(Y=1 \mid x) > P(Y=0 \mid x) \\
& \iff P(x \mid Y=1) > P(x \mid Y=0) \quad \text{(equal priors)} \\
& \iff \sum_{j=1}^d \log P(X_j=x_j \mid Y=1) > \sum_{j=1}^d \log P(X_j=x_j \mid Y=0) \\
& \iff \sum_{j=1}^d \frac{(x_j-\mu_{0j})^2 - (x_j-\mu_{1j})^2}{2\sigma_j^2} > 0 \\
& \iff \sum_{j=1}^d \left(\frac{\mu_{1j}-\mu_{0j}}{\sigma_j^2}\right) x_j > -\sum_{j=1}^d \frac{\mu_{0j}^2 - \mu_{1j}^2}{2\sigma^2_j}
\end{aligned} P ( Y = 1 ∣ x ) > P ( Y = 0 ∣ x ) ⟺ P ( x ∣ Y = 1 ) > P ( x ∣ Y = 0 ) (equal priors) ⟺ j = 1 ∑ d log P ( X j = x j ∣ Y = 1 ) > j = 1 ∑ d log P ( X j = x j ∣ Y = 0 ) ⟺ j = 1 ∑ d 2 σ j 2 ( x j − μ 0 j ) 2 − ( x j − μ 1 j ) 2 > 0 ⟺ j = 1 ∑ d ( σ j 2 μ 1 j − μ 0 j ) x j > − j = 1 ∑ d 2 σ j 2 μ 0 j 2 − μ 1 j 2
Let w j = μ 1 j − μ 0 j σ j 2 w_j = \frac{\mu_{1j}-\mu_{0j}}{\sigma_j^2} w j = σ j 2 μ 1 j − μ 0 j and τ = ∑ j = 1 d μ 1 j 2 − μ 0 j 2 2 σ j 2 \tau = \sum_{j=1}^d \frac{\mu_{1j}^2 - \mu_{0j}^2}{2\sigma^2_j} τ = ∑ j = 1 d 2 σ j 2 μ 1 j 2 − μ 0 j 2 . Then:
y ^ ( x ) = { 1 if ∑ j = 1 d w j x j > τ 0 if ∑ j = 1 d w j x j < τ \hat{y}(x) = \begin{cases} 1 & \text{if } \sum^{d}_{j = 1} w_j x_j > \tau \\ 0 & \text{if } \sum^{d}_{j = 1} w_j x_j < \tau \end{cases} y ^ ( x ) = { 1 0 if ∑ j = 1 d w j x j > τ if ∑ j = 1 d w j x j < τ
This shows the decision boundary is linear in the feature space.
Problem 3: Logistic Regression Theory (25 pt)
3.1 Sigmoid Function Properties
(a) Prove σ ( − s ) = 1 − σ ( s ) \sigma(-s) = 1 - \sigma(s) σ ( − s ) = 1 − σ ( s )
σ ( − s ) = 1 1 + e s = 1 + e s 1 + e s − e s 1 + e s = 1 − 1 e − s + 1 = 1 − σ ( s ) \begin{aligned}
\sigma(-s) &= \frac{1}{1 + e^s} \\
&= \frac{1 + e^s}{1 + e^s} - \frac{e^s}{1 + e^s} \\
&= 1 - \frac{1}{e^{-s} + 1} \\
&= 1 - \sigma(s)
\end{aligned} σ ( − s ) = 1 + e s 1 = 1 + e s 1 + e s − 1 + e s e s = 1 − e − s + 1 1 = 1 − σ ( s )
This proves P ( y ( i ) = − 1 ∣ x ( i ) ) + P ( y ( i ) = 1 ∣ x ( i ) ) = σ ( − w ⊺ x ( i ) ) + σ ( w ⊺ x ( i ) ) = 1 P(y^{(i)} = -1 \mid x^{(i)}) + P(y^{(i)} = 1 \mid x^{(i)}) = \sigma(-w^\intercal x^{(i)}) + \sigma(w^\intercal x^{(i)}) = 1 P ( y ( i ) = − 1 ∣ x ( i ) ) + P ( y ( i ) = 1 ∣ x ( i ) ) = σ ( − w ⊺ x ( i ) ) + σ ( w ⊺ x ( i ) ) = 1
(b) Prove σ ′ ( s ) = σ ( s ) ( 1 − σ ( s ) ) \sigma'(s) = \sigma(s)(1 - \sigma(s)) σ ′ ( s ) = σ ( s ) ( 1 − σ ( s ))
σ ′ ( s ) = − ( 1 + e − s ) − 2 ( − e − s ) = ( 1 + e − s ) − 1 ( 1 + e − s ) − 1 ( e − s ) = σ ( s ) e − s 1 + e − s = σ ( s ) ( 1 − σ ( s ) ) \begin{aligned}
\sigma'(s) &= -(1 + e^{-s})^{-2} (-e^{-s}) \\
&= (1 + e^{-s})^{-1} (1 + e^{-s})^{-1}(e^{-s}) \\
&= \sigma(s) \frac{e^{-s}}{1 + e^{-s}} \\
&= \sigma(s)(1 - \sigma(s))
\end{aligned} σ ′ ( s ) = − ( 1 + e − s ) − 2 ( − e − s ) = ( 1 + e − s ) − 1 ( 1 + e − s ) − 1 ( e − s ) = σ ( s ) 1 + e − s e − s = σ ( s ) ( 1 − σ ( s ))
3.2 Gradient of Log-Likelihood
Answer: X ⊺ ( σ ( − y ⊙ X w ) ⊙ y ) X^\intercal(\sigma(-y\odot Xw) \odot y) X ⊺ ( σ ( − y ⊙ Xw ) ⊙ y )
where ⊙ \odot ⊙ denotes element-wise multiplication.
log P ( y ∣ X , w ) = ∑ i = 1 N log σ ( y ( i ) w ⊺ x ( i ) ) ∇ w log P ( y ∣ X , w ) = ∑ i = 1 N ( 1 − σ ( y ( i ) w ⊺ x ( i ) ) ) y ( i ) x ( i ) = ∑ i = 1 N σ ( − y ( i ) w ⊺ x ( i ) ) y ( i ) x ( i ) = X ⊺ ( σ ( − y ⊙ X w ) ⊙ y ) \begin{aligned}
\log P(y \mid X, w) &= \sum^{N}_{i = 1} \log \sigma(y^{(i)}w^\intercal x^{(i)}) \\
\nabla_w \log P(y \mid X, w) &= \sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))y^{(i)}x^{(i)} \\
&= \sum^{N}_{i = 1} \sigma(- y^{(i)}w^\intercal x^{(i)})y^{(i)}x^{(i)} \\
&= X^\intercal(\sigma(-y\odot Xw) \odot y)
\end{aligned} log P ( y ∣ X , w ) ∇ w log P ( y ∣ X , w ) = i = 1 ∑ N log σ ( y ( i ) w ⊺ x ( i ) ) = i = 1 ∑ N ( 1 − σ ( y ( i ) w ⊺ x ( i ) )) y ( i ) x ( i ) = i = 1 ∑ N σ ( − y ( i ) w ⊺ x ( i ) ) y ( i ) x ( i ) = X ⊺ ( σ ( − y ⊙ Xw ) ⊙ y )
3.3 Hessian Matrix
Answer: − X D X ⊺ -X D X^\intercal − X D X ⊺ where D D D is a diagonal matrix with D i i = σ ( y ( i ) w ⊺ x ( i ) ) ( 1 − σ ( y ( i ) w ⊺ x ( i ) ) ) D_{ii} = \sigma(y^{(i)}w^\intercal x^{(i)})(1 - \sigma(y^{(i)}w^\intercal x^{(i)})) D ii = σ ( y ( i ) w ⊺ x ( i ) ) ( 1 − σ ( y ( i ) w ⊺ x ( i ) ))
Taking the derivative of the gradient:
∇ 2 log P ( y ∣ X , w ) = ∑ i = 1 N ( 1 − σ ( y ( i ) w ⊺ x ( i ) ) ) σ ( y ( i ) w ⊺ x ( i ) ) ( − x ( i ) ) x ( i ) ⊺ = − X D X ⊺ \begin{aligned}
\nabla^2 \log P(y \mid X, w) &= \sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))\sigma(y^{(i)}w^\intercal x^{(i)})(-x^{(i)})x^{(i)\intercal} \\
&= -X D X^\intercal
\end{aligned} ∇ 2 log P ( y ∣ X , w ) = i = 1 ∑ N ( 1 − σ ( y ( i ) w ⊺ x ( i ) )) σ ( y ( i ) w ⊺ x ( i ) ) ( − x ( i ) ) x ( i ) ⊺ = − X D X ⊺
3.4 Prove Hessian is Negative Semi-Definite
For any vector z z z :
z ⊺ H z = − z ⊺ X D X ⊺ z = − ( X ⊺ z ) ⊺ D ( X ⊺ z ) = − ∑ i = 1 N ( 1 − σ ( y ( i ) w ⊺ x ( i ) ) ) σ ( y ( i ) w ⊺ x ( i ) ) a i 2 \begin{aligned}
z^\intercal H z &= -z^\intercal X D X^\intercal z \\
&= -(X^\intercal z)^\intercal D(X^\intercal z) \\
&= -\sum^{N}_{i = 1} (1 - \sigma(y^{(i)}w^\intercal x^{(i)}))\sigma(y^{(i)}w^\intercal x^{(i)}) a_i^2
\end{aligned} z ⊺ Hz = − z ⊺ X D X ⊺ z = − ( X ⊺ z ) ⊺ D ( X ⊺ z ) = − i = 1 ∑ N ( 1 − σ ( y ( i ) w ⊺ x ( i ) )) σ ( y ( i ) w ⊺ x ( i ) ) a i 2
where a = X ⊺ z a = X^\intercal z a = X ⊺ z . Since both terms in the product are non-negative (probabilities), we have z ⊺ H z ≤ 0 z^\intercal Hz \le 0 z ⊺ Hz ≤ 0 , proving the Hessian is negative semi-definite. This means the log-likelihood is concave with only a global maximum.
3.5 Gradient Descent Update Rule
Answer: w t + 1 = w t + α X ⊺ ( σ ( − y ⊙ X w t ) ⊙ y ) w_{t + 1} = w_{t} + \alpha X^\intercal(\sigma(-y\odot Xw_t) \odot y) w t + 1 = w t + α X ⊺ ( σ ( − y ⊙ X w t ) ⊙ y ) at iteration t t t
3.6 Newton’s Method Update Rule
Answer: w t + 1 = w t + ( X D X ⊺ ) − 1 ∇ l ( w t ) w_{t+1} = w_{t} + (XDX^\intercal)^{-1} \nabla l(w_t) w t + 1 = w t + ( X D X ⊺ ) − 1 ∇ l ( w t )
Newton’s method approximates the gradient using Taylor expansion:
∇ l ( w t + 1 ) ≈ ∇ l ( w t ) + ∇ 2 l ( w t ) ( w t + 1 − w t ) \nabla l(w_{t+1}) \approx \nabla l(w_{t}) + \nabla^2 l(w_t)(w_{t+1} - w_t) ∇ l ( w t + 1 ) ≈ ∇ l ( w t ) + ∇ 2 l ( w t ) ( w t + 1 − w t )
Setting this to zero and solving for w t + 1 w_{t+1} w t + 1 :
w t + 1 = w t − H − 1 g = w t − ( − X D X ⊺ ) − 1 ∇ l ( w t ) = w t + ( X D X ⊺ ) − 1 ∇ l ( w t ) w_{t + 1} = w_t - H^{-1}g = w_{t} - (-XDX^\intercal)^{-1} \nabla l(w_t) = w_{t} + (XDX^\intercal)^{-1} \nabla l(w_t) w t + 1 = w t − H − 1 g = w t − ( − X D X ⊺ ) − 1 ∇ l ( w t ) = w t + ( X D X ⊺ ) − 1 ∇ l ( w t )
Problem 4: Programming - Optimization (25 pt)
4.1 Logistic Regression Implementation
Analysis of Feature Transformation
Given that the input data are in circular distribution which is linearly inseparable, the Logistic Regression model, a linear model, cannot classify the input data with a linear decision boundary.
However, by mapping the existing data points to higher dimensions using polynomial features (x 1 2 , x 2 2 , x 1 x 2 x_1^2, x_2^2, x_1x_2 x 1 2 , x 2 2 , x 1 x 2 ), we enable our linear model to capture the non-linear patterns in the data. The transformation creates a feature space where the classes become linearly separable.
4.2 L1 vs. L2 Regularization
Both L1 and L2 regularizations reduce overfitting by penalizing large weights:
L1 Regularization (Lasso): Adds penalty λ m ∥ w ∥ 1 \frac{\lambda}{m}\|w\|_1 m λ ∥ w ∥ 1
Gradient penalty: λ ⋅ sign ( w ) \lambda \cdot \text{sign}(w) λ ⋅ sign ( w )
Drives insignificant features to exactly zero (feature selection)
Preferred when only certain features matter
L2 Regularization (Ridge): Adds penalty λ 2 m ∥ w ∥ 2 2 \frac{\lambda}{2m}\|w\|_2^2 2 m λ ∥ w ∥ 2 2
Gradient penalty: λ w \lambda w λ w
Shrinks weights proportionally to their current values
Preserves all features with reduced magnitudes
Preferred when all features need to be considered
The key difference is that L1 produces sparse solutions (many weights become zero), while L2 produces dense solutions (all weights remain non-zero but small).
4.3 Hyperparameter Tuning Results
Best Performing Hyperparameter Combination:
Learning rate: 0.1
Lambda: 0.0001
Best validation accuracy: 100.00%
Test Set Accuracy: 98.67%
4.4 Gradient Descent Variants Comparison
Convergence Speed: Stochastic GD > Mini-batch GD > Batch GD
More parameter updates per epoch leads to faster convergence
Stability: Batch GD > Mini-batch GD > Stochastic GD
Batch GD provides the most stable and predictable convergence
Stochastic and Mini-batch GD have noisier updates but converge faster
The plot shows:
Batch GD: Smooth convergence curve, slowest to converge
Mini-batch GD: Moderate noise, balanced speed and stability
Stochastic GD: Fastest convergence but most oscillation in individual updates
Code Implementation
Gaussian Naive Bayes (hw2_q2.py)
def gaussian_theta ( X: List[ FloatTensor] , y: LongTensor) - > Tuple[ Tensor, Tensor] :
"""Calculate MLE for mu and sigma2 in Gaussian Naive Bayes"""
mu = torch. stack( [
X[ y == 0 ] . mean( dim= 0 ) ,
X[ y == 1 ] . mean( dim= 0 ) ,
] , dim= 0 )
sigma2 = torch. stack( [
( ( X[ y == 0 ] - mu[ 0 ] ) ** 2 ) . mean( dim= 0 ) ,
( ( X[ y == 1 ] - mu[ 1 ] ) ** 2 ) . mean( dim= 0 ) ,
] , dim= 0 )
return mu, sigma2
def gaussian_p ( y: LongTensor) :
"""Calculate MLE for P(Y=0)"""
return ( y == 0 ) . sum ( ) / len ( y)
def gaussian_classify ( mu: Tensor, sigma2: Tensor, p: FloatTensor, X: Tensor) :
"""Classify using Gaussian Naive Bayes"""
log_p = torch. tensor( [ torch. log( p) , torch. log( 1 - p) ] )
posteriors = [ ]
for c in range ( 2 ) :
log_likelihoods = (
- 0.5 * torch. log( 2 * torch. pi * sigma2[ c] )
- ( X - mu[ c] ) ** 2 / ( 2 * sigma2[ c] )
) . sum ( dim= 1 )
posteriors. append( log_likelihoods + log_p[ c] )
posteriors = torch. stack( posteriors, dim= 1 )
return torch. argmax( posteriors, dim= 1 )
Logistic Regression Core Functions (hw2_q4.py)
def _sigmoid ( self, z) :
"""Sigmoid activation function"""
return 1 / ( 1 + np. exp( - z) )
def transform ( self, X) :
"""Transform features to polynomial space"""
col_v1 = X[ : , 0 ] . reshape( - 1 , 1 )
col_v2 = X[ : , 1 ] . reshape( - 1 , 1 )
return np. hstack( [
col_v1, col_v2,
col_v1** 2 , col_v2** 2 ,
col_v1 * col_v2,
] )
def _compute_cost ( self, y, y_pred) :
"""Binary cross-entropy with regularization"""
m = len ( y)
y_pred_clipped = np. clip( y_pred, EPSILON, 1 - EPSILON)
cost = - np. sum ( y * np. log( y_pred_clipped) +
( 1 - y) * np. log( 1 - y_pred_clipped) ) / m
reg_cost = 0
if self. regularization == "L2" :
reg_cost = self. lambda_val / ( 2 * m) * np. linalg. norm( self. weights) ** 2
elif self. regularization == "L1" :
reg_cost = self. lambda_val / m * np. linalg. norm( self. weights, ord = 1 )
return cost + reg_cost