Show Buttons
Share On Facebook
Share On Twitter
Share On Google Plus
Share On Linkdin
Share On Pinterest
Share On Reddit
Share On Stumbleupon
Contact us
Hide Buttons

k-Nearest Neighbors

This arti­cle is con­tribute by Anto­nis Maroniko­lakis
Objec­tive: We are given a data set con­tain­ing items, with numeric val­ues, cat­e­go­rized into classes. We are also given a new item, with­out a class. We want to cat­e­go­rize it into one of the given classes.

Approach: We will use the k Near­est Neigh­bors algo­rithm (kNN for short). The algo­rithm clas­si­fies a new item based on its clos­est neigh­bors. In other words, the algo­rithm looks what class of items is closer to the new item, and it clas­si­fies the new item to that class.

Algo­rithm:

Read data from file
Given a new item:
Compute distances from new item to all other items
Pick k closest items
Find the most frequent class in these k items
Categorize new item in that class

The algo­rithm will be imple­mented in Python.

Input: Our input is a text file. The first line holds the fea­ture names and the rest of the lines hold the item infor­ma­tion for the fea­tures, plus the class the item is cat­e­go­rized in. An exam­ple file can be found here. The data set is called Fisher’s Iris, a pop­u­lar data set in Machine Learn­ing exercises.

We will read the data from the file (stored in “data.txt”) and we will split it by lines.

We will store the fea­ture names (appear­ing on the first line) in a list:

We will save the data into a list, named items. Each item in the list is rep­re­sented as a dic­tio­nary, whose keys are the fea­ture names, plus “Class” to store the class the item is cat­e­go­rized in.

We will also shuf­fle the items to ensure they are in a ran­dom order.


Clas­si­fi­ca­tion

For the clas­si­fi­ca­tion of a new item, we want to cal­cu­late the dis­tances between the new item and every item in the data set. The dis­tances in this tuto­r­ial are cal­cu­lated via the gen­er­al­ized Euclid­ean for­mula for n dimen­sions. We will hold the k short­est dis­tances in a list and in the end we will pick the class that is most com­mon in that list.

In the list to hold the near­est neigh­bors, the ele­ments are 2-tuples: (dis­tance, class).


The exter­nal func­tions we need to imple­ment are Euclid­e­an­Dis­tance, UpdateNeigh­bors, Cal­cu­lateNeigh­borsClass and Find­Max.

Euclid­e­an­Dis­tance


UpdateNeigh­bors

Given a new dis­tance to an item, we check if we should add it. First of all, if the list of neigh­bors has a length less than n we auto­mat­i­cally add it to the list, as it is not full yet.

If the list is full (length = n) we check if the dis­tance is longer than all the dis­tances in the list. If yes, we do not add it and we move on.

Oth­er­wise, we remove the longest dis­tance and replace it with the new one, keep­ing the list sorted in ascend­ing order. That way we can access the largest ele­ment by the index –1 (in Python), which gets the last item in the list.

A quick note: We want to keep the list of clos­est neigh­bors sorted so that we can more eas­ily update it. To do this, we could imple­ment an Inser­tion Sort method, but even though it is sim­ple, it takes up quite a bit of space. For the pur­poses of this tuto­r­ial we will go for some­thing eas­ier to write.

Code:


Cal­cu­lateNeigh­borsClass

We want to cal­cu­late the class that appears most fre­quently in the list of clos­est neigh­bors. We will use another dic­tio­nary, count, whose keys are the class names appear­ing in the list of neigh­bors. As we iter­ate through the neigh­bors, if a class name is not in the keys, we will add it. Oth­er­wise, we will incre­ment its count by one.

Find­Max

This func­tion receives as input the dic­tio­nary count we build pre­vi­ously and returns its max.


Con­clu­sion:

To clas­sify a new item, you must cre­ate a dic­tio­nary, with keys the fea­ture names and as val­ues the cor­re­spond­ing data and pass it as a para­me­ter to the func­tion Clas­sify.


Com­plete Code:

You may also like...