9. Bias Variance

Training Data

  • D={(x(i),y(i))}i=1N\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i = 1}^N
  • drawn i.i.d. from some distribution (X,Y)(X, Y)
  • Assume a regression setting, YRY \in \R
  • For a given xx, there may be many yy
  • Different features may have different labels
  • xj(i)x^{(i)}_j in x(i)x^{(i)} may not have the same y(i)y^{(i)}

Tip

Given a certain xx, which yy should you predict?


Expected Label (given xRdx \in \R^d)

  • For a given xRdx \in \R^d, the expected label:
yˉ(x)=Eyx[Y]=yyP(yx)dy\bar{y}(x) = \mathbb{E}_{y \mid x}[Y] = \int_y yP(y \mid x) dy

Expected Test Error (given hDh_D)

  • Now, we have some ML Algorithm (A\mathcal{A}), which takes the dataset D\mathcal{D} as an input to generate the predictor hDh_\mathcal{D}
hD=A(D)h_\mathcal{D} = \mathcal{A}(\mathcal{D})
  • With the predictor hDh_\mathcal{D}, we can hopefully derive the expected test error as follows:
hˉ(x)=E(x,y)P[(hDy)2]=xy(hDy)2P(x,y)dydx\begin{align*} \bar{h}(x) &= \mathbb{E}_{(x, y) \sim P}\left[(h_\mathcal{D} - y)^2\right] \\ &= \int_x\int_y (h_\mathcal{D} - y)^2 P(x, y) dydx \end{align*}

Caution

With this formula, we are still depending on the data distributions


Expected Predictor (given A\mathcal A)

  • Assume that:
  • A:\mathcal A: Linear regression
  • D:\mathcal D: Sales data from the previous year
  • hD:h_\mathcal D: A random variable
  • Expected Predictor Given A
hˉ=EDPN[hD]=DhDP(D)dD\bar{h} = \mathbb{E}_{\mathcal D \sim P^N}[h_\mathcal D] = \int_\mathcal{D} h_\mathcal D P(\mathcal D) d\mathcal D

Expected Test Error (given A\mathcal A)

E(x,y)P,DPN[(hD(x)y)2]=xyD(hD(x)y)2P(x,y)P(D)dDdydx\begin{align*} &\mathbb{E}_{(x, y) \sim P, \mathcal D \sim P^N} [(h_\mathcal D(x) - y)^2] \\ &= \int_x \int_y \int_\mathcal D (h_\mathcal D(x) - y)^2 P(x, y) P(\mathcal D) d\mathcal D dydx \end{align*}

Full Derivation

  1. Rewrite the formula by plugging hˉ\bar{h} in
Ex,y,D[(hD(x)y)2]=Ex,y,D[[(hD(x)hˉ(x))+(hˉ(x)y)]2]=Ex,D[(hD(x)hˉ(x))2]+2Ex,y,D[(hD(x)hˉ(x))(hˉ(x)y)]+Ex,y[(hˉ(x)y)2]\begin{align*} &\mathbb{E}_{x, y, \mathcal D} [(h_\mathcal D(x) - y)^2] \\ &= \mathbb{E}_{x, y, \mathcal D} [[(h_\mathcal D(x) - \bar{h}(x)) + (\bar{h}(x) - y)]^2] \\ &= \mathbb{E}_{x, \mathcal D} [(h_\mathcal D(x) - \bar{h}(x))^2] + 2\mathbb{E}_{x, y, \mathcal D} [(h_\mathcal D(x) - \bar{h}(x))(\bar{h}(x) - y)] + \mathbb{E}_{x, y} [(\bar{h}(x) - y)^2]\\ \end{align*}
  1. Solve for Ex,y,D[(hD(x)hˉ(x))(hˉ(x)y)]\mathbb{E}_{x, y, \mathcal D} [(h_\mathcal D(x) - \bar{h}(x))(\bar{h}(x) - y)]
Ex,y,D[(hD(x)hˉ(x))(hˉ(x)y)]=Ex,y[ED[hD(x)hˉ(x)](hˉ(x)y)]=Ex,y[(ED[hD(x)]hˉ(x))(hˉ(x)y)]=Ex,y[(hˉ(x)hˉ(x))(hˉ(x)y)]=Ex,y[0]=0\begin{align*} &\mathbb{E}_{x, y, \mathcal D} [(h_\mathcal D(x) - \bar{h}(x))(\bar{h}(x) - y)] \\ &= \mathbb{E}_{x, y} [\mathbb{E}_{\mathcal D}[h_\mathcal D(x) - \bar{h}(x)](\bar{h}(x) - y)] \\ &= \mathbb{E}_{x, y} [ ( \mathbb{E}_{\mathcal D}[h_\mathcal D(x)] - \bar{h}(x) )( \bar{h}(x) - y )] \\ &= \mathbb{E}_{x, y} [(\bar{h}(x) - \bar{h}(x))(\bar{h}(x) - y)] \\ &= \mathbb{E}_{x, y} [0] \\ &= 0 \end{align*}
  1. Solve for Ex,y[(hˉ(x)y)2]\mathbb{E}_{x, y} [(\bar{h}(x) - y)^2]
Ex,y[(hˉ(x)y)2]=Ex,y[(hˉ(x)yˉ(x))+(yˉ(x)y)2]=Ex[(hˉ(x)yˉ(x))2]+2Ex,y[(hˉ(x)yˉ(x))(yˉ(x)y)]++Ex,y[(yˉ(x)y)2]\begin{align*} &\mathbb{E}_{x, y} [(\bar{h}(x) - y)^2] \\ &= \mathbb{E}_{x, y} [(\bar{h}(x) - \bar{y}(x)) + (\bar{y}(x) - y)^2] \\ &= \mathbb{E}_{x} [(\bar{h}(x) - \bar{y}(x))^2] + 2\mathbb{E}_{x, y}[(\bar{h}(x) - \bar{y}(x))(\bar{y}(x) - y)] + + \mathbb{E}_{x, y}[(\bar{y}(x) - y)^2]\\ \end{align*}
  1. Solve for Ex,y[(hˉ(x)yˉ(x))(yˉ(x)y)]\mathbb{E}_{x, y}[(\bar{h}(x) - \bar{y}(x))(\bar{y}(x) - y)]
Ex,y[(hˉ(x)yˉ(x))(yˉ(x)y)]=Ex[(hˉ(x)yˉ(x))Eyx[yˉ(x)y]]=Ex[(hˉ(x)yˉ(x))(yˉ(x)Eyx[y])]=Ex[(hˉ(x)yˉ(x))(yˉ(x)yˉ(x))]=Ex[0]=0\begin{align*} &\mathbb{E}_{x, y}[(\bar{h}(x) - \bar{y}(x))(\bar{y}(x) - y)] \\ &=\mathbb{E}_{x}[(\bar{h}(x) - \bar{y}(x))\mathbb E_{y \mid x}[\bar{y}(x) - y]] \\ &=\mathbb{E}_{x}[(\bar{h}(x) - \bar{y}(x))\mathbb (\bar{y}(x) - E_{y \mid x}[y])] \\ &=\mathbb{E}_{x}[(\bar{h}(x) - \bar{y}(x))\mathbb (\bar{y}(x) - \bar{y}(x))] \\ &=\mathbb{E}_{x}[0] \\ &= 0 \end{align*}
  1. Plug all the solved equations in:
Ex,y,D[[hD(x)y]2]=Ex,D[(hD(x)hˉ(x))2]+Ex[(hˉ(x)yˉ(x))2]+Ex,y[(yˉ(x)y)2]\begin{align*} &\mathbb E_{x, y, \mathcal D} \left[[h_\mathcal D(x) - y]^2\right] \\ &=\mathbb E_{x, \mathcal D} \left[(h_\mathcal D(x) - \bar{h}(x))^2\right] + \mathbb E_{x} \left[(\bar{h}(x) - \bar{y}(x))^2\right] + \mathbb E_{x, y} \left[(\bar{y}(x) - y)^2\right]\\ \end{align*}

Breaking Down the Expected Test Error

Ex,y,D[[hD(x)y]2]=Ex,D[(hD(x)hˉ(x))2]+Ex[(hˉ(x)yˉ(x))2]+Ex,y[(yˉ(x)y)2] \begin{align*} &\mathbb E_{x, y, \mathcal D} \left[[h_\mathcal D(x) - y]^2\right] \\ &=\mathbb E_{x, \mathcal D} \left[(h_\mathcal D(x) - \bar{h}(x))^2\right] + \mathbb E_{x} \left[(\bar{h}(x) - \bar{y}(x))^2\right] + \mathbb E_{x, y} \left[(\bar{y}(x) - y)^2\right]\\ \end{align*}

The expected test error equals

  • Variance (due to data randomness)
  • Bias (due to model misspecification)
  • Irreducible noise (due to inherent randomness in data).

1. Expected Test Error

Ex,y,D[(hD(x)y)2]\mathbb{E}{x, y, \mathcal{D}}\left[(h{\mathcal{D}}(x) - y)^2\right]
  • This is the overall expected test error — how far the model’s predictions hD(x)h_{\mathcal{D}}(x) are from the true values yy, averaged over:
  • all possible datasets D\mathcal{D} you could have trained on,
  • all possible inputs xx, and
  • all possible outputs yy drawn from the true data distribution.

2. First term:

Ex,D[(hD(x)hˉ(x))2]\mathbb E_{x, \mathcal D} \left[(h_\mathcal D(x) - \bar{h}(x))^2\right]
  • This measures how much the predictions from different datasets fluctuate around their average prediction hˉ(x)\bar{h}(x).
  • In words:
  • variance
  • how sensitive the model is to the particular training data it saw.
  • If this term is large, your model changes a lot depending on the training data (i.e., it’s unstable or overfits).
  • If it’s small, your model is consistent across different datasets.

3. Second term:

Ex[(hˉ(x)yˉ(x))2]\mathbb{E}_{x}\left[(\bar{h}(x) - \bar{y}(x))^2\right]
  • irreducible
  • we may add some more data to resovle this issue
  • This measures how far the model’s average prediction hˉ(x)\bar{h}(x) is from the best expected output yˉ(x)=Eyx[Y]\bar{y}(x) = \mathbb{E}_{y \mid x}[Y].
  • In words:
  • bias
  • the systematic error of your model.
  • If your model’s structure can’t capture the true relationship, this term is large (underfitting).
  • If it’s small, your model’s mean prediction is close to the truth.

4. Third term:

Ex,y[(yˉ(x)y)2]\mathbb{E}_{x, y}\left[(\bar{y}(x) - y)^2\right]
  • This measures how much the true data itself varies around its expected value.
  • In words:
  • irreducible noise
  • randomness or natural variability in the data that no model can ever predict perfectly.

Warning

Any of the error terms can dominate the entire test error


Illustration of Bias and Variance

clipboard.png