1. Introduction and KNN

What is ML?

Definition

A computer program is said to
learn from experience E
with respect to some class of tasks T
and performance measure P
IF its performance at tasks in T, as measured by P, improves with E.

Note

Design algorithms that:

improves their performance
on some task
with experience (training data)

Good ML Algorithm

SHOULD: Generalize well on test data
SHOULD NOT: Overfit the training data

KNN

similar points are likely to have the same labels
Dataset: $D\{(x^{(i)}, y^{(i)})\}^N_{i=1}$
New Datapoint: $x$
Prediction: $h(x)$

Screenshot 2025-08-29 at 12.43.37

Algorithm

Find the top $K$ nearest, under metric $d$
Return the most common label among these $K$ neighbors

For regession, the average value of the neighbors is returned

How to measure closeness?

Rely on a distance metric.

Minkowski Distance

the common metric
$\forall x, z \in \R^d$

d(x, z) = (\sum^d_{r=1}\mid x_r - z_r \mid^p)^{1/p}

For each dimension, calculate the distance between $r^{th}$ x and $r^{th}$ z
d: distance
Special cases

The choice of K

Small K -> label has noise
Large K -> The boundary becomes smoother

Caution

Very large K may make the algorithm to include examples that are really far off.

What’s the best K

Issues

Memory issue
sensitive to outliers and easily fooled by irrelevant attributes
0 training time; computationally expensive O(Nd)
If d is large -> curse of dimensionality

Hyperparameters

We DO NOT CHOOSE hyperparameters to minimize training or testing errors

Solution

randomly take out 10~50% of training and use it instead of the test set to estimate test error.
Validation Set: the set taking out to verify the test set.

Curse of Dimensionality

Assume data lives in $[0, 1]^d$ , and all training data is sampled uniformly. And we observe the neighbors fall inside the small cube
The probability of sampling a point inside the small cube is roughly $\frac{K}{N}$

\begin{align*} &l^d = \frac{K}{N} \\ &l \approx (\frac{K}{N})^{\frac{1}{d}} \end{align*}

$N$ : The total number of data that we sample
$K$ : nearest neighbors fall inside the small cube.
If $K = 10$ and $N = 1000$ , how big is $l$
d = 2, l = 0.1
d = 10, l = 0.63
d = 100, l = 0.955
d = 1000, l = 0.9954

Caution

When $d$ is large, the $K$ nearest neighbors will be almost all over the place

In high dimensional space, you don’t have neighbors anymore

Data may have low dimensional structure

High dimensional space may contain low dimensional subspaces
Your data may lie in a low dimensional subspace or its low dimensional

KNN vs. Linear Classifier

KNN Summary

KNN is simple and effective if the distance reflects dissimilarity
works when data is low-dimensional
DOES NOT work for high-dimensional data due to sparsity.

1. Introduction and KNN

1. Introduction and KNN

What is ML?

Definition

Categories

Supervised Learning

Discrete Labels

Continuous Labels

Unsupervised Learning

Good ML Algorithm

KNN

Algorithm

How to measure closeness?

Minkowski Distance

The choice of K

What’s the best K

Issues

Hyperparameters

Solution

Curse of Dimensionality

Data may have low dimensional structure

KNN vs. Linear Classifier

KNN Summary