K-Nearest Neighbors (KNN) Algorithm: Simple Math and Example
Let's break down the mathematical concept behind the KNN algorithm and go through a simple example.
1. Mathematical Concept
KNN works by finding the K closest points in the feature space (based on distance metrics such as Euclidean distance) to classify or predict a new data point.
- Step 1: Choose a number K (the number of nearest neighbors).
- Step 2: Calculate the distance between the new data point and all other points in the training dataset.
- Step 3: Sort the distances and pick the K smallest distances.
- Step 4: For classification, assign the class label based on the majority of the K neighbors.
- Step 5: For regression, calculate the average value of the K neighbors and assign it as the prediction.
2. Distance Metric: Euclidean Distance
For classification and regression, KNN typically uses the Euclidean distance to measure how "far" a point is from others.
The Euclidean distance between two points (x₁, y₁) and (x₂, y₂) in 2D space is given by:
d = √((x₁ - x₂)² + (y₁ - y₂)²)
For higher-dimensional spaces (e.g., 3D, 4D), it’s generalized as:
d = √((x₁ - x₂)² + (y₁ - y₂)² + (z₁ - z₂)² + ...)
3. Example of KNN for Classification (2D Space)
Imagine you have a simple 2D dataset with two classes: Red and Blue. Your goal is to classify a new point based on its K nearest neighbors.
Step-by-Step Example:
Dataset:
We have a set of 6 points in 2D space with labels:
| X₁ | X₂ | Label |
|---|---|---|
| 1 | 2 | Red |
| 2 | 3 | Red |
| 3 | 3 | Blue |
| 6 | 6 | Blue |
| 7 | 8 | Blue |
| 8 | 8 | Red |
New Point:
Let's classify the new point P = (5, 5).
Distance Calculation (using Euclidean distance):
d(P, (1, 2)) = √((5 - 1)² + (5 - 2)²) = √(16 + 9) = √25 = 5
d(P, (2, 3)) = √((5 - 2)² + (5 - 3)²) = √(9 + 4) = √13 ≈ 3.61
d(P, (3, 3)) = √((5 - 3)² + (5 - 3)²) = √(4 + 4) = √8 ≈ 2.83
d(P, (6, 6)) = √((5 - 6)² + (5 - 6)²) = √(1 + 1) = √2 ≈ 1.41
d(P, (7, 8)) = √((5 - 7)² + (5 - 8)²) = √(4 + 9) = √13 ≈ 3.61
d(P, (8, 8)) = √((5 - 8)² + (5 - 8)²) = √(9 + 9) = √18 ≈ 4.24
Find K Nearest Neighbors:
Let’s choose K = 3. The 3 closest points to P = (5, 5) are:
- (6, 6) with distance 1.41 (Label: Blue)
- (3, 3) with distance 2.83 (Label: Blue)
- (2, 3) with distance 3.61 (Label: Red)
Majority Voting:
Among the 3 nearest neighbors, 2 are Blue and 1 is Red. According to majority voting, we classify the point P = (5, 5) as Blue.
Final Classification:
The new point P = (5, 5) is classified as Blue based on the majority label of its K nearest neighbors.
4. Example of KNN for Regression (1D Space)
Let’s now use KNN for regression to predict a value instead of a class label.
Step-by-Step Example:
Dataset:
We have a dataset of 5 data points:
| X | Y |
|---|---|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 5 | 7 |
| 6 | 8 |
New Point:
The goal is to predict the value of Y for a new point X = 4.
Distance Calculation:
d(4, 1) = |4 - 1| = 3
d(4, 2) = |4 - 2| = 2
d(4, 3) = |4 - 3| = 1
d(4, 5) = |4 - 5| = 1
d(4, 6) = |4 - 6| = 2
Find K Nearest Neighbors:
Let’s choose K = 3. The 3 closest points to X = 4 are:
- (3, 4) with distance 1
- (5, 7) with distance 1
- (2, 3) with distance 2
Prediction (Regression):
The predicted value of Y is the average of the 3 nearest neighbors' Y-values:
Ypred = (4 + 7 + 3) / 3 = 14 / 3 ≈ 4.67
Final Prediction:
The predicted value for X = 4 is approximately 4.67.