3. Naive Bayes

MLE and MAP

Components

  • Dataset D\mathcal{D}: i.i.d (independent and identically distributed) drawn from some unknonw distribution
  • Pθ(X,Y)P_\theta(X, Y) approximates this known distribution

MLE (Maximum Likelihood Estimate)

  • Choose θ\theta that maximizes probability of observed data D\mathcal{D}
θ^MLE=arg maxθP(Dθ)\hat{\theta}_{MLE} = \argmax_{\theta}P( \mathcal{D} | \theta)

MAP (Maximum A Priori)

  • Choose θ\theta that is most probable given prior proability and observed data.
θ^MAP=arg maxθP(θD)=arg maxθP(θ)P(Dθ)P(D)arg maxθP(Dθ)P(θ)(P(D) does not depend on θ)\begin{align*} \hat{\theta}_{MAP} &= \argmax_{\theta} P(\theta | \mathcal{D}) \\ &= \argmax_{\theta} \frac{P(\theta)P(\mathcal{D} | \theta)}{P(\mathcal{D})} \\ &\propto \argmax_{\theta} P(\mathcal{D} | \theta) P(\theta) \quad (P(\mathcal{D})\text{ does not depend on }\theta) \end{align*}

Example

  • A dataset D\mathcal{D} of i.i.d. flips produces αH\alpha_H heads and αT\alpha_T tails.
  • MLE (Exactly the frequency of head):
θ^MLE=arg maxθP(Dθ)=αHαH+αT\begin{align*} \hat{\theta}_{MLE} &= \argmax_{\theta}P( \mathcal{D}|\theta) \\ &= \frac{\alpha_H}{\alpha_H + \alpha_T} \end{align*}
  • MAP
θ^MAP=arg maxθP(θD)=αH+βH1(αH+βH1)(αT+βT1)\begin{align*} \hat{\theta}_{MAP} &= \argmax_{\theta}P(\theta | \mathcal{D}) \\ &= \frac{\alpha_H + \beta_H - 1}{(\alpha_H + \beta_H - 1)(\alpha_T + \beta_T - 1)} \end{align*}

Note

βH1\beta_H - 1 and βT1\beta_T - 1 are for the halluciated heads and tails.

Example 2

  • D\mathcal{D}: Dataset of i.i.d. rolls of an M-sided die.
  • P(Dθ)P(\mathcal{D}|\theta): Likelihood of Mutinomial θ\theta ~ {θ1,θ2,,θM}\{\theta_1, \theta_2, \dots, \theta_M\}
P(Dθ)θ1α1θ2α2θMαMP(\mathcal{D} | \theta) \propto \theta^{\alpha_1}_1 \cdot \theta^{\alpha_2}_2 \cdots \theta^{\alpha_M}_M
  • θm\theta_m: Probability of rolling side m

\bullet The Prior

Dirichlet Distribution is used in this case

P(θ)=θ1β11θ2β21θMβM1B(β1,,BM)Dirichlet(β1,,βM)P(\theta) = \frac{\theta^{\beta_1 - 1}_1\theta^{\beta_2 - 1}_2 \cdots \theta^{\beta_M - 1}_M}{B(\beta_1, \dots, B_M)} \sim Dirichlet(\beta_1, \dots, \beta_M)

where B(β1,,BM)B(\beta_1, \dots, B_M) is the multivariate Beta function

\bullet The Posterior

P(θD)P(Dθ)P(θ)Dirichlet(α1+β1,,αM+βM)P(\theta | \mathcal{D}) \propto P(\mathcal{D} | \theta) P(\theta) \sim Dirichlet(\alpha_1 + \beta_1, \dots, \alpha_M + \beta_M)

\bullet MLE

θ^mMLE=αv=1Mαv\hat{\theta}^{MLE}_{m} = \frac{\alpha}{\sum^M_{v = 1}\alpha_v}

\bullet MAP

θ^mMAP=αm+βm1v=1M(αv+βv1)\hat{\theta}^{MAP}_{m} = \frac{\alpha_m + \beta_m - 1}{\sum^M_{v = 1}(\alpha_v + \beta_v - 1)}

Tip

You can find the full derivations of MLE and MAP in MLE and MAP Proofs

How to Choose Prior Distribution P(θ)P(\theta)

  • This requires prior knowledge about domain (i.e. unbiased coin)
  • A mathematically convenient form (e.g. conjugate ):
  • If P(θ)P(\theta) is conjugate prior for P(Dθ)P(\mathcal{D} \mid \theta), then posterior has the same form as prior
P(θD)        P(Dθ)×P(θ)P(\theta \mid \mathcal{D}) \;\; \propto \;\; P(\mathcal{D} \mid \theta) \times P(\theta)
PosteriorLikelihoodPrior
BetaBernoulliBeta
BetaBinomialBeta
DirichletMultinomialDirichlet
GaussianGaussianGaussian

MLE Visualization

Note

INDICATOR FUNCTION

I(e)={1(e is true)0(e is false)\mathbb{I} (e) = \begin{cases}1 \quad (\text{e is true}) \\ 0 \quad (\text{e is false})\end{cases}

MLE

P^MLE(X=x)=i=1NI(x(i)=x)N\hat P^{MLE} (X = x) = \frac{\sum^N_{i = 1} \mathbb{I}(x^{(i)} = x)}{N}

Screenshot 2025-09-21 at 16.01.28

Given DP(Y,X)\mathcal{D} \sim P(Y, X), Get the MLE of P(YX)P(Y | X)

P^MLE(Y=yX=x)=PMLE(Y=y,X=x)PMLE(X=x)=i=1NI(x(i)=x,y(i)=y)i=1NI(x(i)=x)\begin{align*} &\hat P^{MLE}(Y = y | X = x) \\ &= \frac{P^{MLE}(Y = y, X = x)}{P^{MLE}(X = x)} \\ &= \frac{\sum^N_{i = 1} \mathbb{I}(x^{(i)} = x, y^{(i)} = y)}{\sum^N_{i = 1} \mathbb{I}(x^{(i)} = x)} \\ \end{align*}

Screenshot 2025-09-21 at 16.05.04

D-dimensional Space

i=1NI(x1(i)=x1,x2(i)=x2,,xd(i)=xdy(i)=y)i=1NI(x1(i)=x1,x2(i)=x2,,xd(i)=xd)\frac{\sum^N_{i = 1} \mathbb{I}(x^{(i)}_1 = x_1, x^{(i)}_2 = x_2, \dots, x^{(i)}_d = x_d y^{(i)} = y)}{\sum^N_{i = 1} \mathbb{I}(x^{(i)}_1 = x_1, x^{(i)}_2 = x_2, \dots, x^{(i)}_d = x_d)} \\

Caution

It is only good if there are many training examples with the same identical features as x for high dimensional space

Important

Suppose X1,,XdX_1, \dots, X_d and YY are boolean random variables. How many parameters must we estimate? Ans:2dAns: 2^d cuz each variable has 2 possible values


Bayes Rule

(k,j)P(Y=ykX=xj)=P(X=xjY=yk)P(Y=yk)P(X=xj)=P(Y=ykX=xj)P(X=xj)P(X=xj)\begin{align*} &(\forall k, j)\quad \\ &P(Y = y_k | X = x_j)\\ &= \frac{P(X = x_j | Y = y_k) P(Y = y_k)}{P(X = x_j)} \\ &= \frac{P(Y = y_k | X = x_j) P(X = x_j)}{P(X = x_j)} \end{align*}

Unfortunately Bayes’ Rule Alone Does not Reduce the Parameters Needed

\bullet Rewrite P(YX1,,Xd)P(Y \mid X_1, \dots, X_d) with Bayes’ Rule

P(YX1,,Xd)=P(X1,,XdY)P(Y)P(X1,,Xd)\begin{align*} P(Y \mid X_1, \dots, X_d) = \frac{P(X_1, \dots, X_d \mid Y)P(Y)}{P(X_1, \dots, X_d)} \end{align*}

\bullet Parameter Counts

  • P(X1,,XdY=1):2d1P(X_1, \dots, X_d \mid Y = 1): 2^d - 1
  • P(X1,,XdY=0):2d1P(X_1, \dots, X_d \mid Y = 0): 2^d - 1
  • P(Y):1P(Y): 1

Caution

Therefore, the total of the parameters needed for P(XY)P(X\mid Y) with Bayes’ Rule is 2(2d1)+12(2^d - 1) + 1, which is more than the original 2d2^d

Naive Bayes

Assumption

P(X1,,XdY)=j=1dP(XjY)P(X_1, \dots, X_d \mid Y) = \prod^d_{j = 1} P(X_j \mid Y)

where

  • X1,,XdX_1, \dots, X_d are conditionally independent given YY

What Is Conditional Independence?

(j,k,t),P(X=xjY=Yk,Z=Zt)=P(X=xjZ=Zt)\forall (j, k, t), P(X = x_j \mid Y = Y_k, Z = Z_t) = P(X = x_j \mid Z = Z_t)
  • XX and YY are conditionally independent given ZZ

Naive Bayes Successfully Reduces the Number of Parameters Needed

\bullet Rewrite with Naive Bayes P(YX1,,Xd)P(Y \mid X_1, \dots, X_d)

P(YX1,,Xd)=j=1dP(XjY)P(Y)P(X1,,Xd)P(Y \mid X_1, \dots, X_d) = \frac{\prod^{d}_{j = 1} P(X_j \mid Y)P(Y)}{P(X_1, \dots, X_d)}

\bullet Parameter Counts

  • j=1dP(XjY=1):d\prod^d_{j = 1} P(X_j \mid Y = 1): d
  • j=1dP(XjY=0):d\prod^d_{j = 1} P(X_j \mid Y = 0): d
  • P(Y)=1P(Y) = 1

Tip

The total number of parameters is brought down to 2d+12d + 1

For Optimization

P(Y=ykX1,,Xd)=j=1dP(XjY=yk)P(Y=yk)P(X1,,Xd)P(Y)j=1dP(XjY=yk)Ynewarg maxyk(Y=yk)j=1dP(XjnewY=yk)(Given a new instance)\begin{align*} &P(Y = y_k \mid X_1, \dots, X_d) \\ &= \frac{\prod^d_{j = 1} P(X_j \mid Y = y_k)P(Y = y_k)}{P(X_1, \dots, X_d)} \\ &\propto P(Y) \prod^d_{j = 1} P(X_j \mid Y = y_k) \\ &\Rightarrow Y_{new} \leftarrow \argmax_{y_k} (Y = y_k) \prod^d_{j = 1} P(X^{new}_j \mid Y = y_k) \quad \text{(Given a new instance)} \end{align*}

Naive Bayes - Discrete Features

Xj{1,2,,Kj},j{1,2,,d}Y{1,2,,C}P(Xj=kY=c)=θjkc\begin{align*} X_j &\in \{1, 2, \dots, K_j\}, \forall j \in \{1, 2, \dots, d\} \\ Y &\in \{1, 2, \dots, C\} \\ P(X_j = k \mid Y = c) &= \theta_{jkc} \end{align*}

where

  • k=1Kjθjkc=1\sum^{K_j}_{k = 1} \theta_{jkc} = 1
The sum of the probability of feature j equal to k given the label c is 1\text{The sum of the probability of feature j equal to k given the label c is 1}

Maximum Likelihood Estimates (MLE)

Prior

π^cMLE=P^MLE(Y=c)=# of samples in class c# of samples=i=1NI{y(i)=c}N\begin{align*} \hat{\pi}^{MLE}_c &= \hat{P}^{MLE} (Y = c) \\ &= \frac{\# \text{ of samples in class c}}{\# \text{ of samples}} \\ &= \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\}}{N} \end{align*}

hallucinated Prior

π^cMLE=i=1NI{y(i)=c}+lN+lc\hat{\pi}^{MLE}_c = \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\} + l}{N + lc}

Likelihood

θ^jkcMLE=P^MLE(Xj=kY=c)=# of samples with the label c and have feature Xj=k# of samples with the label c=i=1NI{y(i)=cxj(i)=k}i=1NI{y(i)=c}\begin{align*} \hat{\theta}^{MLE}_{jkc} &= \hat{P}^{MLE}(X_j = k \mid Y = c) \\ &= \frac{\# \text{ of samples with the label c and have feature }X_j = k}{\# \text{ of samples with the label c}} \\ &= \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c \cap x^{(i)}_j = k\}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\}} \end{align*}

Hallucinated Likelihood

θ^jkcMLE=i=1NI{y(i)=cxj(i)=k}+li=1NI{y(i)=c}+lKj\hat{\theta}^{MLE}_{jkc} = \frac{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c \cap x^{(i)}_j = k\} + l}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\} + lK_j}

Learning to Classify Documents: P(YX)P(Y \mid X)

Question

  • Given a document of length MM
  • YY discrete values
  • X=<X1,,Xd>X = <X_1, \dots, X_d>
  • j{1,2,,d}\forall j \in \{1, 2, \dots, d\}
  • XjX_j is a random variable describing:
I am pleased to announce that Bob Frederking of the Language
Technologies Institute is our new Associate Dean for Graduate
Programs. In this role, he oversees the many issues that arise
with our multiple masters and PhD programs. Bob brings to this
positions considerable expereince with the masters and PhD
programs in the LTI.

I would like to thank Frank Pfenning, who has served ably in this
role for the past two years.

Answer

  • d:d: size of the vocab (assume that the word positions are independent)
  • Xj:X_j: the count of word jj in an email
  • M=j=1dXjM = \sum^d_{j = 1} X_j

\bullet Likelihood:

P(XM,Y=c)=M!X1!X2!Xd!j=1d(θjc)XjP(X \mid M, Y = c) = \frac{M!}{X_1!X_2! \cdots X_d!} \prod_{j = 1}^{d} (\theta_{jc})^{X_j}

where

  • θjc\theta_{jc} is the probability of selecting word jj
  • j=1dθjc=1\sum^d_{j = 1} \theta_{jc} = 1
  • j=1d(θjc)Xj\prod_{j = 1}^{d} (\theta_{jc})^{X_j}:
  • The product of the probabilities (θjc)(\theta_{jc}) of the number of times words (Xj)(X_j) occur in the document
  • The probability of a pattern occurs under label Y=cY = c
  • M!X1!X2!Xd!\frac{M!}{X_1!X_2! \cdots X_d!}:
  • 1X1!X2!Xd!\frac{1}{X_1!X_2!\cdots X_d!}: Cancel out the permutation of the same words
  • the permutation of the pattern

\bullet MLE:

θ^jcMLE=#of times word j appears in emails with label c#of words in all emails with lable c=i=1NI{y(i)=c}xj(i)i=1NI{y(i)=c}v=1dxv(i)\begin{align*} \hat{\theta}_{jc}^{MLE} &= \frac{\# \text{of times word j appears in emails with label c}}{\# \text{of words in all emails with lable c}} \\ &= \frac{\sum^N_{i = 1}\mathbb{I}\{y^{(i) } = c\}x_j^{(i)}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\}\sum^d_{v = 1} x_v^{(i)}} \\ \end{align*}

\bullet MAP:

θ^jcMAP=i=1NI{y(i)=c}xj(i)+βjci=1NI{y(i)=c}v=1Nxv(i)+v=1dβvc\hat{\theta}^{MAP}_{jc} = \frac{\sum^N_{i = 1}\mathbb{I}\{y^{(i) } = c\}x_j^{(i)} + \beta_{jc}}{\sum^{N}_{i = 1} \mathbb{I}\{y^{(i)} = c\}\sum^N_{v = 1}x^{(i)}_v + \sum^d_{v = 1}\beta_{vc}}

Note

  • Here β\beta is for hallucination
  • β=1\beta = 1 for Laplace smoothing

Continuous Data

  • For continuous data, we assume that the Likelihood follows the Gaussian Distribution
P(Y=cX1,,Xd)=P(Y=c)j=1dP(XjY=c)k=1CP(Y=k)j=1dP(XjY=k)P(Y = c \mid X_1, \dots, X_d) = \frac{P(Y = c)\prod^d_{j = 1} P(X_j\mid Y = c)}{\sum^C_{k = 1} P(Y = k)\sum^d_{j = 1} P(X_j \mid Y = k)}

where

  • P(XjY=c)N(μ,σ2)P(X_j \mid Y = c) \sim N(\mu, \sigma^2)

Normal Distribution

p(x)N(μ,σ2)=12πσ2e12(xμσ)2p(x) \sim N(\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}

where

  • abp(x)dx=1\int^b_a p(x) dx = 1

Gaussian Naive Bayes (GNB)

P(Xj=xY=c)=12πσjc2e12(xμjcσjc)2P(X_j = x\mid Y = c) = \frac{1}{\sqrt{2\pi\sigma^2_{jc}}} e^{-\frac{1}{2}(\frac{x - \mu_{jc}}{\sigma_{jc}})^2}

Note

Sometimes, we assume XjX_j and/or YY is independent of σjc2\sigma^2_{jc}

Estimating Parameters: Discrete Y, Continuous X

  • Given dataset: {x(i),y(i)}i=1N\{x^{(i)}, y^{(i)}\}_{i =1}^N
  • x(i)Rdx^{(i)} \in \R^d
  • y(i)Ry^{(i)} \in \R
  • jj: feature serial
  • ii: sample serial
  • cc: class serial

\bullet μ^jc\hat{\mu}_{jc}

μ^jc=1i=1NI{y(i)=c}i=1Nxj(i)I{y(i)=c}\hat{\mu}_{jc} = \frac{1}{\sum^N_{i = 1} \mathbb{I}\{y^{(i)} = c\}} \sum^{N}_{i = 1} x_j^{(i)} \mathbb{I}\{y^{(i)} = c\}

\bullet σ^jc2\hat{\sigma}_{jc}^2

σ^jc2=1i=1NI{y(i)=c}i=1N(xj(i)μ^jc)2I{y(i)=c}\hat{\sigma}_{jc}^2 = \frac{1}{\sum^N_{i = 1} \mathbb{I}\{y^{(i)} = c\}} \sum^N_{i =1} (x_j^{(i)} - \hat{\mu}_{jc})^2 \mathbb{I}\{y^{(i)} = c\}

If variance of each feature is independent of classes

Ynewarg maxy{0,1}P(Y=y)j=1dP(XjnewY=y)Y_{new} \leftarrow \argmax_{y \in \{0, 1\}} P(Y =y)\prod^d_{j = 1} P(X^{new}_j\mid Y = y)

clipboard.png

Decision Boundary

  • The decision boundary occurs where P(Y=0)j=1dP(XjY=0)=P(Y=1)j=1dP(XjY=1)P(Y = 0) \prod^d_{j = 1}P(X_j \mid Y = 0) = P(Y =1) \prod^d_{j =1} P(X_j \mid Y= 1)

clipboard.png

Multinomial Naive Bayes

Tip

When is an input data classified with label 1?

P(Y=1X)>P(Y=0X)    P(Y=1X)>P(Y=0X)    P(XY=1)P(Y=1)>P(XY=0)P(Y=0)    M!X1!X2!Xd!j=1d(θj1)XjP(Y=1)>M!X1!X2!Xd!j=1d(θj0)XjP(Y=0)    j=1d(θj1)XjP(Y=1)>j=1d(θj0)XjP(Y=0)    j=1dXjln(θj1)+lnP(Y=1)>j=1dXjln(θj0)+lnP(Y=0)    j=1dXj(ln(θj1)ln(θj0))+lnP(Y=1)lnP(Y=0)>0let w0=lnP(Y=1)lnP(Y=0), wj=(ln(θj1)ln(θj0))    w0+j=1dXjwj>0(A linear classifier)\begin{align*} &P(Y = 1 \mid X) > P(Y = 0 \mid X) \\ &\iff P(Y = 1 \mid X) > P(Y = 0 \mid X) \\ &\iff P(X\mid Y = 1) P(Y = 1) > P(X \mid Y = 0 ) P(Y = 0) \\ &\iff \frac{M!}{X_1!X_2!\cdots X_d!} \prod^d_{j = 1} (\theta_{j1})^{X_j} P(Y = 1) > \frac{M!}{X_1!X_2!\cdots X_d!} \prod^d_{j = 1} (\theta_{j0})^{X_j} P(Y = 0) \\ &\iff \prod^d_{j = 1} (\theta_{j1})^{X_j} P(Y = 1) > \prod^d_{j = 1} (\theta_{j0})^{X_j} P(Y = 0) \\ &\iff \sum^d_{j = 1} {X_j}\ln(\theta_{j1}) + \ln P(Y = 1) > \sum^d_{j = 1} {X_j}\ln(\theta_{j0}) + \ln P(Y = 0) \\ &\iff \sum^d_{j = 1} {X_j}(\ln(\theta_{j1}) - \ln(\theta_{j0})) + \ln P(Y = 1) - \ln P(Y = 0) > 0 \\ &let\ w_0 = \ln P(Y = 1) - \ln P(Y = 0),\ w_j = (\ln(\theta_{j1}) - \ln(\theta_{j0})) \\ &\iff w_0 + \sum^d_{j = 1} X_jw_j > 0 \quad (\text{A linear classifier}) \end{align*}

Multinomial Gaussian Bayes

  • Consider f:XYf: X \rightarrow Y
  • XX: a vector of real-value features <X1,,Xd><X_1, \dots, X_d>
  • YY: a boolean variable
  • Assume all XjX_j are conditionally independent given YY
  • Model P(XjY=c)N(μjc,σj2)P(X_j \mid Y = c) \sim \mathcal{N}(\mu_{jc}, \sigma^2_{j})
  • Model P(Y)Bernoulli(π)P(Y) \sim Bernoulli(\pi)
P(Y=0X)=P(Y=0)P(XY=0)P(Y=0)P(XY=0)+P(Y=1)P(XY=1)=11+P(Y=1)P(XY=1)P(Y=0)P(XY=0)=11+πP(XY=1)(1π)P(XY=0)=11+exp(lnπ(1π)+lnP(XY=1)P(XY=0))=11+exp(lnπ(1π)+j=1d(lnexp((Xjμj1)22σj2)exp((Xjμj0)22σj2)))(12πσj2exp((Xjμjc2σj)2)=11+exp(lnπ(1π)+j=1d(Xj2+2Xjμj1μj12σj2)+(Xj22Xjμj1μj12σj2))=11+exp(lnπ(1π)+j=1d(Xj2+2Xjμj1μj12σj2)+(Xj22Xjμj1μj12σj2))=11+exp(lnπ(1π)+j=1d(Xj(μj1μj0)σj2)+(μj02μj12σj2))=11+exp(lnπ(1π)+j=1dμj02μj12σj2+j=1dXj(μj1μj0)σj2)let w0=lnπ(1π)+j=1dμj02μj12σj2wj=j=1dXj(μj1μj0)σj2=11+ew0+j=1dwjXjP(Y=1X)=111+ew0+j=1dwjXj=ew0+j=1dwjXj1+ew0+j=1dwjXjP(Y=1X)P(Y=0X)=ew0+j=1dwjXj1lnP(Y=1X)P(Y=0X)=w0+j=1dwjXj0(Linear Classification Rule)\begin{align*} &\begin{align*} P(Y = 0 \mid X) &= \frac{P(Y = 0)P(X \mid Y = 0)}{P(Y = 0)P(X \mid Y = 0) + P(Y = 1) P(X \mid Y = 1)} \\ &= \frac{1}{1 + \frac{P(Y = 1) P(X \mid Y = 1)}{P(Y = 0)P(X \mid Y = 0)}} \\ &= \frac{1}{1 + \frac{\pi P(X \mid Y = 1)}{(1 - \pi)P(X \mid Y = 0)}} \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \ln\frac{P(X \mid Y = 1)}{P(X \mid Y = 0)})} \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} (\ln \frac{\exp(\frac{- (X_j - \mu_{j1})^2}{2\sigma^2_j})}{\exp(\frac{- (X_j - \mu_{j0})^2}{2\sigma^2_j})}))} \quad\quad (\frac{1}{\sqrt{2\pi\sigma_j^2}} \exp(- (\frac{X_j - \mu_{jc}}{\sqrt{2}\sigma_{j}})^2) \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} (\frac{-X_j^2 + 2X_j\mu_{j1} - \mu^2_{j1}}{\sigma^2_j})+ (\frac{X_j^2 - 2X_j\mu_{j1} \mu^2_{j1}}{\sigma^2_j}))} \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} (\frac{\cancel{-X_j^2} + 2X_j\mu_{j1} - \mu^2_{j1}}{\sigma^2_j})+ (\frac{\cancel{X_j^2} - 2X_j\mu_{j1} \mu^2_{j1}}{\sigma^2_j}))} \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} (X_j\frac{(\mu_{j1} - \mu_{j0})}{\sigma^2_j})+ (\frac{\mu_{j0}^2 - \mu_{j1}^2}{\sigma^2_j}))} \\ &= \frac{1}{1 + \exp (\ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} \frac{\mu_{j0}^2 - \mu_{j1}^2}{\sigma^2_j}+ \sum^{d}_{j = 1} X_j\frac{(\mu_{j1} - \mu_{j0})}{\sigma^2_j})} \\ \text{let } &w_0 = \ln \frac{\pi}{(1 - \pi)} + \sum^{d}_{j = 1} \frac{\mu_{j0}^2 - \mu_{j1}^2}{\sigma^2_j} \\ &w_j = \sum^{d}_{j = 1} X_j\frac{(\mu_{j1} - \mu_{j0})}{\sigma^2_j} \\ &= \frac{1}{1 + e^{w_0 + \sum^d_{j = 1} w_j X_j}} \end{align*} \\ &\begin{align*} P(Y = 1 \mid X) &= 1 - \frac{1}{1 + e^{w_0 + \sum^d_{j = 1} w_j X_j}} \\ &= \frac{e^{w_0 + \sum^d_{j = 1} w_j X_j}}{1 + e^{w_0 + \sum^d_{j = 1} w_j X_j}} \\ \end{align*} \\ &\Rightarrow \frac{P(Y = 1 \mid X)}{P(Y = 0 \mid X)} = e^{w_0 + \sum^d_{j = 1} w_j X_j} \gtrless 1 \\ &\Rightarrow \ln \frac{P(Y = 1 \mid X)}{P(Y = 0 \mid X)} = w_0 + \sum^d_{j = 1} w_j X_j \gtrless 0 \quad\quad \text{(Linear Classification Rule)} \end{align*}

Important

This is the Logistic Function:

11+ew0+j=1dwjXj\frac{1}{1 + e^{w_0 + \sum^d_{j = 1} w_jX_j}}

clipboard.png

AspectLogistic RegressionNaive Bayes
Type of modelDiscriminative — models P(YX)P(Y \mid X) directlyGenerative — models P(XY)P(X \mid Y) and P(Y)P(Y), then applies Bayes’ rule to get P(YX)P(Y \mid X)
Key assumptionNo independence assumption between featuresConditional independence of features given the class
Decision boundaryLinear in feature space (unless nonlinear terms added)Linear if equal variances (e.g., Gaussian NB), otherwise nonlinear
OutputDirect estimate of P(Y=1X)P(Y=1 \mid X)Computed from P(XY)P(Y)P(X \mid Y) P(Y)
InterpretabilityCoefficients show log-odds effect of featuresParameters correspond to class-conditional feature distributions
Data requirementNeeds more data for stable estimatesPerforms well even with small datasets
Handling correlated featuresCan handle correlations and interactionsBreaks down if features are correlated
Training speedSlower — requires iterative optimizationVery fast — closed-form parameter estimation
Probability calibrationUsually well-calibrated probabilitiesOften overconfident, may need calibration
RegularizationSupports L1/L2 regularizationUsually no regularization (can add smoothing)
Common variantsL1/L2 Logistic Regression, Multinomial LRGaussian NB, Multinomial NB, Bernoulli NB
Typical use casesContinuous or mixed data, interpretability neededText classification, spam detection, sentiment analysis