9.1 Cross Validation and Model Selection

Example: Sine Target

Let’s say we have the following two models:

\begin{align*} &H_0: h(x) = b \\ &H_1: h(x) = ax + b \end{align*}

Important

Which is better?

Main question $\rightarrow$ better for what?

Approximation

Let’s say we want to approximate the following function

f: [-1, 1] \rightarrow \R \;\; f(x) = sin(\pi x)

$\bullet$ Approximate with $H_1$

the yellow part tells us how far we are from the sine function
$Error = 0.20$

$\bullet$ Approximate with $H_0$

$Error = 0.50$

Tip

From the approximation perspective, the linear model wins.

Learning

For learning, we give our two models two data points and train them

Which one is better? We don’t know as it depends on what are the initial two data points given. Therefore, we need the bias variance method for us to determine which method produces less error.

Bias Variance

From each possible models, we calculate the expected model (mean)
From the mean model, we calculate the variance and the bias

Learning in Practice

more training examples with noise

Models to the left of best model may underfit
Models to the right of the best model may overfit

REM and Overfitting and Underfitting

REM = \min_w \frac{1}{N} \sum^N_{i = 1}l(h_w(x^{(i)}), y^{(i)}) + \lambda \gamma(w)

Important

Hoow to identify the sweet spot?

Hold-out Method

How

Can judge test error by using an independent sample of data
Split data into training set and validation set
Use the two sets for training and testing respectively.

Important

Telescopic Search

Find the best order of magnitude $(\lambda)$

Drawback

May not have enough data to afford setting one subset aside for getting generalizability
Validation error may be misleading (bad estimate of test error) if we get an unfortunate split

K-Fold Cross-Validation

Create K-fold partition of the dataset

Train using $K - 1$ partitions and calculate the validation error using the remaining partition

Large K

Validation error can approximate test error well
Observed validation error will be unstable (few validation points)
The computational time will be very large as well

Small K

The # of runs and computational time are reduced
Observed validation error will be stable
Validation error cannot approximate test error well

Tip

$K = 10$ is a common choice.

Leave-One-Out (LOO) Cross-Validation

Special case of K-fold validation with $K = N$ partitions

Early Stopping

Stop your optimization after $M \ge 0$ number of gradient steps, even if optimization has not converged yet.

11. Cross Validation and Model Selection

9.1 Cross Validation and Model Selection

Example: Sine Target

Approximation

$\bullet$ Approximate with $H_1$

$\bullet$ Approximate with $H_0$

Learning

Bias Variance

Learning in Practice

REM and Overfitting and Underfitting

Hold-out Method

How

Drawback

K-Fold Cross-Validation

Large K

Small K

Leave-One-Out (LOO) Cross-Validation

Early Stopping

What’s the connection between early stopping and regularization?

Early Stopping

Regularization

The plot of early stopping

Think about the variance

11. Cross Validation and Model Selection

9.1 Cross Validation and Model Selection

Example: Sine Target

Approximation

∙\bullet∙ Approximate with H1H_1H1​

∙\bullet∙ Approximate with H0H_0H0​

Learning

Bias Variance

Learning in Practice

REM and Overfitting and Underfitting

Hold-out Method

How

Drawback

K-Fold Cross-Validation

Large K

Small K

Leave-One-Out (LOO) Cross-Validation

Early Stopping

What’s the connection between early stopping and regularization?

Early Stopping

Regularization

The plot of early stopping

Think about the variance

$\bullet$ Approximate with $H_1$

$\bullet$ Approximate with $H_0$