This article is contribute by Antonis Maronikolakis
Objective: We are given a data set containing items, with numeric values, categorized into classes. We are also given a new item, without a class. We want to categorize it into one of the given classes.
Approach: We will use the k Nearest Neighbors algorithm (kNN for short). The algorithm classifies a new item based on its closest neighbors. In other words, the algorithm looks what class of items is closer to the new item, and it classifies the new item to that class.
Read data from file Given a new item: Compute distances from new item to all other items Pick k closest items Find the most frequent class in these k items Categorize new item in that class
The algorithm will be implemented in Python.
Input: Our input is a text file. The first line holds the feature names and the rest of the lines hold the item information for the features, plus the class the item is categorized in. An example file can be found here. The data set is called Fisher’s Iris, a popular data set in Machine Learning exercises.
We will read the data from the file (stored in “data.txt”) and we will split it by lines.
We will store the feature names (appearing on the first line) in a list:
We will save the data into a list, named items. Each item in the list is represented as a dictionary, whose keys are the feature names, plus “Class” to store the class the item is categorized in.
We will also shuffle the items to ensure they are in a random order.
For the classification of a new item, we want to calculate the distances between the new item and every item in the data set. The distances in this tutorial are calculated via the generalized Euclidean formula for n dimensions. We will hold the k shortest distances in a list and in the end we will pick the class that is most common in that list.
In the list to hold the nearest neighbors, the elements are 2-tuples: (distance, class).
The external functions we need to implement are EuclideanDistance, UpdateNeighbors, CalculateNeighborsClass and FindMax.
Given a new distance to an item, we check if we should add it. First of all, if the list of neighbors has a length less than n we automatically add it to the list, as it is not full yet.
If the list is full (length = n) we check if the distance is longer than all the distances in the list. If yes, we do not add it and we move on.
Otherwise, we remove the longest distance and replace it with the new one, keeping the list sorted in ascending order. That way we can access the largest element by the index -1 (in Python), which gets the last item in the list.
A quick note: We want to keep the list of closest neighbors sorted so that we can more easily update it. To do this, we could implement an Insertion Sort method, but even though it is simple, it takes up quite a bit of space. For the purposes of this tutorial we will go for something easier to write.
We want to calculate the class that appears most frequently in the list of closest neighbors. We will use another dictionary, count, whose keys are the class names appearing in the list of neighbors. As we iterate through the neighbors, if a class name is not in the keys, we will add it. Otherwise, we will increment its count by one.
This function receives as input the dictionary count we build previously and returns its max.
To classify a new item, you must create a dictionary, with keys the feature names and as values the corresponding data and pass it as a parameter to the function Classify.