A way to classify a data point based on its “similarity distance” to the nearest neighbors.
Topics
Steps to Classify (based on this animation)
- Initialize
- Place the data point in a feature space
- Calculate Distance
- Find the Euclidean distance from the data point with all other labeled data
- May have to normalize data point values along different coordinates to be Z-scores for each coordinate
- Sort Distance
- Sort distances of the data point with other labeled data points in increasing order
- For classification, the most common class of -nearest neighbors determines the data point class
- For regression, use the average of the -nearest neighbors’ labels
Choices of
- Too small
- Results in overfitting of the model to the data
- Too large
- Results in underfitting
CS M146
- Gives the index in of the th nearest neighbor of , or the data point we’re interested in classifying
- is an array of indices for each data sample point
- We use the square of the L2 norm here to calculate distance but other measures could work
- Gives the set of indices of the nearest neighbors to
- Gives the number of “votes” for a particular class/label
- We iterate over each of the -nearest neighbors and increment by one every time the neighbor equals the class of interest
- The prediction is the class that maximizes the number of votes
Training Time Complexity
- I don’t know why this isn’t , but spatial complexity definitely is
Classifying Time Complexity:
- To classify, we need to calculate the distances from our point to all of our data points, where calculating the distances scales linearly with the number of features , which overall takes time. We must also sort these distances before finding the -nearest neighbors, which takes time