MLE and MAP Proofs

MLE and MAP for Coin Flipping

  • D\mathcal{D}: αH\alpha_H heads, αT\alpha_T tails.
  • P(Dθ)P(\mathcal{D}|\theta):
(αH+αTαH)θαH(1θ)αT\begin{pmatrix} \alpha_H + \alpha_T \\ \alpha_H \end{pmatrix} \theta^{\alpha_H} (1 - \theta)^{\alpha_T}

MLE Proof

θ^MLE=arg maxθ(αH+αTαH)θαH(1θ)αT=arg maxθ{ln(θαH(1θ)αT)}=arg maxθ{αHlnθ+αTln(1θ)}0=θ(αHlnθ+αTln(1θ))0=αHθαT1θθ^MLE=αHαH+αT\begin{align*} \hat{\theta}^{MLE} &= \argmax_{\theta} \begin{pmatrix} \alpha_H + \alpha_T \\ \alpha_H \end{pmatrix} \theta^{\alpha_H} (1 - \theta)^{\alpha_T} \\ &= \argmax_{\theta}\{ \ln{(\theta^{\alpha_H}(1 - \theta)^{\alpha_T})} \} \\ &= \argmax_{\theta}\{ \alpha_H\ln{\theta} + \alpha_T\ln{(1 - \theta)} \} \\ &\Rightarrow 0 = \frac{\partial}{\partial \theta} (\alpha_H\ln{\theta} + \alpha_T\ln{(1 - \theta)}) \\ &\Rightarrow 0 = \frac{\alpha_H}{\theta} - \frac{\alpha_T}{1 - \theta} \\ &\Rightarrow \hat{\theta}^{MLE} = \frac{\alpha_H}{\alpha_H + \alpha_T} \end{align*}

MAP Proof

  • Prior (Beta Distribution)
P(θ)=θβH1(1θ)βT1B(βH,βT)Beta(βH,βT)P(\theta) = \frac{\theta^{\beta_{H} - 1} (1 - \theta)^{\beta_T - 1}}{B(\beta_H, \beta_T)} \sim Beta(\beta_H, \beta_T)
  • Posterior
P(θD)P(Dθ)P(θ)θαH(1θ)αTθβH1(1θ)βT1B(βH,βT)θαH+βH1(1θ)αT+βT1B(βH,βT)\begin{align*} P(\theta \mid \mathcal{D}) &\propto P(\mathcal D \mid \theta)P(\theta) \\ &\propto \theta^{\alpha_H}(1-\theta)^{\alpha_T}\frac{\theta^{\beta_{H} - 1} (1 - \theta)^{\beta_T - 1}}{B(\beta_H, \beta_T)} \\ &\propto \frac{\theta^{\alpha_H + \beta_{H} - 1} (1 - \theta)^{\alpha_T + \beta_T - 1}}{B(\beta_H, \beta_T)} \end{align*} θ^MAP=arg maxθθαH+βH1(1θ)αT+βT1B(βH,βT)=arg maxθθαH+βH1(1θ)αT+βT1(B(βH,βT) does not depend on θ)=arg maxθ(αH+βH1)lnθ+(αT+βT1)ln(1θ)0=θ(αH+βH1)lnθ+(αT+βT1)ln(1θ)0=(αH+βH1)θ+(αT+βT1)(1θ)θ^MAP=(αH+βH1)(αT+βT1)+(αH+βH1)\begin{align*} \hat{\theta}^{MAP} &= \argmax_{\theta}\frac{\theta^{\alpha_H + \beta_{H} - 1} (1 - \theta)^{\alpha_T + \beta_T - 1}}{B(\beta_H, \beta_T)} \\ &= \argmax_{\theta}\theta^{\alpha_H + \beta_{H} - 1} (1 - \theta)^{\alpha_T + \beta_T - 1} \quad (B(\beta_H, \beta_T) \text{ does not depend on } \theta)\\ &= \argmax_{\theta}(\alpha_H + \beta_{H} - 1)\ln\theta + (\alpha_T + \beta_T - 1)\ln(1 - \theta) \\ &\Rightarrow 0 = \frac{\partial}{\partial\theta}(\alpha_H + \beta_{H} - 1)\ln\theta + (\alpha_T + \beta_T - 1)\ln(1 - \theta) \\ &\Rightarrow 0 = \frac{(\alpha_H + \beta_{H} - 1)}{\theta} + \frac{(\alpha_T + \beta_T - 1)}{(1 - \theta)} \\ &\Rightarrow \hat{\theta}^{MAP} = \frac{(\alpha_H + \beta_{H} - 1)}{(\alpha_T + \beta_T - 1) + (\alpha_H + \beta_{H} - 1)} \end{align*}

Tip

  • βH1:\beta_H - 1: # of hallucinated heads
  • βT1:\beta_T - 1: # of hallucinated tails

Note

Check the full proof of Beta Distribution in Beta and Gamma Distribution Proofs

Beta Distribution - Illustration

clipboard.png clipboard.png clipboard.png clipboard.png clipboard.png clipboard.png clipboard.png

Important

  • When αH+αT\alpha_H + \alpha_T is small, MAP can work better than MLE if our prior is accurate
  • If prior is wrong \rightarrow MAP can be very wrong
  • When αH+αT\alpha_H + \alpha_T \rightarrow \infty, θ^MAPθ^MLE\hat{\theta}^{MAP} \rightarrow \hat{\theta}^{MLE} (βH\beta_H and βT\beta_T become irrelevant in this case)

MLE and MAP of Rolling an M-sided Die

  • D\mathcal{D}: Dataset of i.i.d. rolls of an M-sided die.
  • P(Dθ)P(\mathcal{D}|\theta): Likelihood of Mutinomial θ\theta ~ {θ1,θ2,,θM}\{\theta_1, \theta_2, \dots, \theta_M\}
P(Dθ)θ1α1θ2α2θMαMP(\mathcal{D} | \theta) \propto \theta^{\alpha_1}_1 \cdot \theta^{\alpha_2}_2 \cdots \theta^{\alpha_M}_M
  • θm\theta_m: Probability of rolling side m

\bullet The Prior

Dirichlet Distribution is used in this case

P(θ)=θ1β11θ2β21θMβM1B(β1,,BM)Dirichlet(β1,,βM)P(\theta) = \frac{\theta^{\beta_1 - 1}_1\theta^{\beta_2 - 1}_2 \cdots \theta^{\beta_M - 1}_M}{B(\beta_1, \dots, B_M)} \sim Dirichlet(\beta_1, \dots, \beta_M)

where B(β1,,BM)B(\beta_1, \dots, B_M) is the multivariate Beta function

\bullet The Posterior

P(θD)P(Dθ)P(θ)Dirichlet(α1+β1,,αM+βM)P(\theta | \mathcal{D}) \propto P(\mathcal{D} | \theta) P(\theta) \sim Dirichlet(\alpha_1 + \beta_1, \dots, \alpha_M + \beta_M)

MLE Proof

  • Given that L(θ)=P(Dθ)L(\theta) = P(\mathcal{D} | \theta) is subject to the constraint m=1Mθm=1\sum^M_{m = 1} \theta_m = 1, we can define the maximization equations with Lagrangian and log-likelihood as:
l(θ)=lnL(θ)=m=1Mαmlnθmλ(m=1Mθm1)l(\theta) = \ln L(\theta) = \sum^M_{m=1} \alpha_m \ln \theta_m - \lambda(\sum^M_{m = 1} \theta_m - 1)
  • Then, we partially differentiate the it by θm\theta_m and λ\lambda and set them to zero, respectively.
lθm=amθmλ=0θ^mMLE=amλlλ=m=1Mθm+1=0m=1Mθ^mMLE=1m=1Mamλ=1m=1Mam=λθ^mMLE=amv=1Mav\begin{align*} &\begin{align*} \frac{\partial l}{\partial \theta_m} &= \frac{a_m}{\theta_m} - \lambda = 0 \\ &\Rightarrow \hat{\theta}^{MLE}_{m} = \frac{a_m}{\lambda} \end{align*} \\ &\begin{align*} \frac{\partial l}{\partial \lambda} &= -\sum^M_{m = 1} \theta_m + 1 = 0 \\ &\Rightarrow \sum^M_{m = 1} \hat{\theta}_m^{MLE} = 1 \\ &\Rightarrow \sum^M_{m = 1} \frac{a_m}{\lambda} = 1 \\ &\Rightarrow \sum^M_{m = 1} a_m = \lambda \end{align*}\\ &\Rightarrow \hat{\theta}_m^{MLE} = \frac{a_m}{\sum^M_{v = 1} a_v} \end{align*}
  • Therefore, we can derive that:
θ^mMLE=amv=1Mav\hat{\theta}_m^{MLE} = \frac{a_m}{\sum^M_{v = 1} a_v}

MAP Proof

  • The derivation is similar to the MLE, but we change the likelihood part to posterior
P(θD)=m=1M(αm+βm1)lnθmλ(m=1Mθm1)P(\theta | \mathcal{D}) = \sum^M_{m =1} (\alpha_m + \beta_m - 1) \ln \theta_m - \lambda(\sum^M_{m = 1} \theta_m - 1)
  • Take partial derivatives
lθm=αm+βm1θmλ=0θ^mMAP=αm+βm1λlλm=1Mαm+βm1λ=1λ=m=1Mαm+βm1θ^mMAP=αm+βm1v=1M(αv+βv1)\begin{align*} \frac{\partial l}{\partial \theta_m} &= \frac{\alpha_m + \beta_m - 1}{\theta_m} - \lambda = 0 \\ &\Rightarrow \hat{\theta}^{MAP}_m = \frac{\alpha_m + \beta_m -1}{\lambda} \\ \frac{\partial l}{\partial \lambda}&\Rightarrow \sum^M_{m = 1} \frac{\alpha_m + \beta_m -1}{\lambda} = 1 \\ &\Rightarrow \lambda = \sum^M_{m = 1} \alpha_m + \beta_m -1 \\ \end{align*} \\ \Rightarrow \hat\theta^{MAP}_m = \frac{\alpha_m + \beta_m - 1}{\sum^M_{v = 1} (\alpha_v + \beta_v -1)}
  • Therefore, we get:
θ^mMAP=αm+βm1v=1M(αv+βv1)\hat\theta^{MAP}_m = \frac{\alpha_m + \beta_m - 1}{\sum^M_{v = 1} (\alpha_v + \beta_v -1)}